Results
Phylogenetic and mutation analysis of 60 SARS-CoV-2 sequences from Southeast Asia revealed 78 non-synonymous mutations. Most of them (n=52) were found in non-structural (NS) proteins. Other mutations with amino acid (aa) substitutions were present in spike protein (n=13), N protein (n=9), M protein (n=3) and E protein (n=1). The Nepal SARS-CoV-2 genome, which was sequenced early, had no NS mutations compared to our reference sequence (Table-1).
This study identified 21 NS mutations including 4 aa alterations in spike proteins that solely observed in Southeast Asia (Table 2). The majority of these unique mutations (n=15) have arisen once to-date. The remaining 6 were present more than once, but each of these variants circulated in a specific country or region. Moreover, we found 13 mutations with amino acid replacement in spike protein across Southeast Asia. Seven of them (L54F, T76I, S116C, A243S, E471Q, T572I and D614G) were present in the S1 domain and the remaining 6 (L822F, A829T, A930V, S939F, F1109L and G1124V) were present in the S2 domain of the spike protein. Only one aa substitution (E471Q) occurred in the receptor binding motif of spike RBD. A 3D structural visualization is presented in Figure 1, with 13 aa substitution sites represented; of them, 3 aa substitutions (S116C, E471Q and A930V) are highlighted in the trimeric spike glycoprotein.
The recurrent mutations found in Southeast Asia were presented country wise in figure 2. We identified the 10 most frequent mutations with amino acid substitution (N_R203K, N_G204R, N_P13L, NS3_Q57H, NS3_Q57H, NS8_L84S, NSP12_A97V, NSP12_P323L, NSP3_T1198K, NSP6_L37F, Spike_D614G) and separated the variants into 4 major groups and 2 subgroups accordingly (Figure 3). Group 1 consists of 11 sequences, including the reference sequence hCoV19 / Wuhan / WIV04 / 2019 (Accession: EPI_ISL_ 402124). All sequences within in this group were observed earlier in this year (13 January to 1 April). Group 2 consists of the co-evolving mutations NSP12_P323L and spike_D614G. Most of the sequences (n=24) belong to this group and their isolation dates range from 10 March to 4 May, 2020. We further separated this group into two subgroups; those in 2a have additional N_R203K, N_G204R (28881-28883: GGG>AAC) trinucleotide mutations and those in 2b have mutations with amino acid substitution at NS3 (Q57H). In group 3, 4 co-evolving mutations (NSP12_A97V, N_P13L, NSP3_T1198K, NSP6_L37F) were found in 12 Indian sequences which were collected between 29 March to 26 April, 2020. Group 4 consists of 8 variants from Bangladesh, India and Thailand with a common amino acid substitution at NS8(L84S).
In the cluster based time-plot, 187 available sequences were studied over the 15 days of the month from January until May 2020. The group 2 cluster was not observed until February (0%, n=0), 38% (n=19) in March, 37% (n=20) in April and 85% (n=61) in May. The group 3 cluster was 0% to 8% in those months, with a sudden increase of 54% (n=29) in April. The infections were increased from 17 cases in January up to 128,257 in May, 2020 (Figure 4).
A geographic heat map (Figure 5) revealed that most of the 78 NS mutations found in this study were also common in Europe and North America. The transmission map, generated with Nextstrain using 329 genome sequences (Figure 6), revealed that Group 2 sequences (A2 clade of Nextstrain) from this study were found to be dominant among strains circulating in Southeast Asia. These sequences were also found to be prevalent in Europe and North America.