2.1 | Computation
Accurate annotation of sORFs using computational tools is challenging
not only due to their short lengths that impede statistical analyses,
but also because they exhibit intermediate conservation relative to
longer genes, which has been interpreted as evidence for the de
novo evolution of some microproteins. Notwithstanding these challenges,
algorithms and machine learning strategies are currently being developed
to better find sORFs within genomes. Some computational efforts rely on
phylogeny, nucleotide and amino acid homology, and secondary structure
to identify unannotated sORFs with sequence or structural similarities
to canonical proteins; examples include PhyloCSF and miPFinder.
Additional dimensions of predictive information, including the presence
of a ribosome binding site upstream of bacterial sORF start codons or a
Kozak consensus sequence surrounding a eukaryotic sORF start codon, have
been applied to sORF prediction. Ambitiously, OpenProt predicts all
AUG-initiated sORFs and alternative ORFs (alt-ORFs) within all known
mRNAs for several organisms, and curates experimental evidence (or lack
thereof) for their expression. Finally, deep forest and deep learning
models have been applied to sORF prediction, with application to
individual microbial genomes, as well as the microbiome and metagenomes.
These methods have highlighted new sORFs in intergenic regions,
noncoding RNAs and in multicistronic/dual coding mRNAs.