Sequence trimming
As the DNA sequences from GISAID were not annotated, it was necessary to firstly trim the sequences and isolate the coding sequence for the surface glycoprotein. BioPerl was used to isolate the gene for the surface glycoprotein from a total of 731 SARS-CoV-2 genomes. A simple Perl script was designed to remove sequences upstream of the ATG start codon and downstream from the TAA stop codon (Supplementary Figure 3). Subsequently, incomplete sequences were removed which resulted in 702 sequences for further analysis.
EMBOSS Transeq was used to translate the DNA sequences of the surface glycoprotein into amino acid sequences.