Sequence trimming
As the DNA sequences from GISAID were not annotated, it was necessary to
firstly trim the sequences and isolate the coding sequence for the
surface glycoprotein. BioPerl was used to isolate the gene for the
surface glycoprotein from a total of 731 SARS-CoV-2 genomes. A simple
Perl script was designed to remove sequences upstream of the ATG start
codon and downstream from the TAA stop codon (Supplementary Figure 3).
Subsequently, incomplete sequences were removed which resulted in 702
sequences for further analysis.
EMBOSS Transeq was used to translate the DNA sequences of the surface
glycoprotein into amino acid sequences.