Results
Phylogenetic and mutation analysis of 60 SARS-CoV-2 sequences from
Southeast Asia revealed 78 non-synonymous mutations. Most of them (n=52)
were found in non-structural (NS) proteins. Other mutations with amino
acid (aa) substitutions were present in spike protein (n=13), N protein
(n=9), M protein (n=3) and E protein (n=1). The Nepal SARS-CoV-2 genome,
which was sequenced early, had no NS mutations compared to our reference
sequence (Table-1).
This study identified 21 NS mutations including 4 aa alterations in
spike proteins that solely observed in Southeast Asia (Table 2). The
majority of these unique mutations (n=15) have arisen once to-date. The
remaining 6 were present more than once, but each of these variants
circulated in a specific country or region. Moreover, we found 13
mutations with amino acid replacement in spike protein across Southeast
Asia. Seven of them (L54F, T76I, S116C, A243S, E471Q, T572I and D614G)
were present in the S1 domain and the remaining 6 (L822F, A829T, A930V,
S939F, F1109L and G1124V) were present in the S2 domain of the spike
protein. Only one aa substitution (E471Q) occurred in the receptor
binding motif of spike RBD. A 3D structural visualization is presented
in Figure 1, with 13 aa substitution sites represented; of them, 3 aa
substitutions (S116C, E471Q and A930V) are highlighted in the trimeric
spike glycoprotein.
The recurrent mutations found in Southeast Asia were presented country
wise in figure 2. We identified the 10 most frequent mutations with
amino acid substitution (N_R203K, N_G204R, N_P13L, NS3_Q57H,
NS3_Q57H, NS8_L84S, NSP12_A97V, NSP12_P323L, NSP3_T1198K,
NSP6_L37F, Spike_D614G) and separated the variants into 4 major groups
and 2 subgroups accordingly (Figure 3). Group 1 consists of 11
sequences, including the reference sequence hCoV19 / Wuhan / WIV04 /
2019 (Accession: EPI_ISL_ 402124). All sequences within in this group
were observed earlier in this year (13 January to 1 April). Group 2
consists of the co-evolving mutations NSP12_P323L and spike_D614G.
Most of the sequences (n=24) belong to this group and their isolation
dates range from 10 March to 4 May, 2020. We further separated this
group into two subgroups; those in 2a have additional N_R203K, N_G204R
(28881-28883: GGG>AAC) trinucleotide mutations and those in
2b have mutations with amino acid substitution at NS3 (Q57H). In group
3, 4 co-evolving mutations (NSP12_A97V, N_P13L, NSP3_T1198K,
NSP6_L37F) were found in 12 Indian sequences which were collected
between 29 March to 26 April, 2020. Group 4 consists of 8 variants from
Bangladesh, India and Thailand with a common amino acid substitution at
NS8(L84S).
In the cluster based time-plot, 187 available sequences were studied
over the 15 days of the month from January until May 2020. The group 2
cluster was not observed until February (0%, n=0), 38% (n=19) in
March, 37% (n=20) in April and 85% (n=61) in May. The group 3 cluster
was 0% to 8% in those months, with a sudden increase of 54% (n=29) in
April. The infections were increased from 17 cases in January up to
128,257 in May, 2020 (Figure 4).
A geographic heat map (Figure 5) revealed that most of the 78 NS
mutations found in this study were also common in Europe and North
America. The transmission map, generated with Nextstrain using 329
genome sequences (Figure 6), revealed that Group 2 sequences (A2 clade
of Nextstrain) from this study were found to be dominant among strains
circulating in Southeast Asia. These sequences were also found to be
prevalent in Europe and North America.