Genetic database construction and sequence sampling
Sequences for nirS and eNOR genes from SURF MAG 42 (Table
S1) were used as queries to BLAST (Camacho et al. , 2009) three
genomic repositories:
- Genome databases constructed for 21 Chloroflexi genomes assembled from
deep-subsurface MAG data (Jungbluth, Amend and Rappé, 2017; Momperet al. , 2017) (Table S1).
- Genome databases constructed for 86 genomes from recent MAG assembled
sludge bioreactor genomes (Parks et al. , 2017) (Table S3)
- The full NCBI non-redundant protein database (as of 25 September,
2019)(Agarwala et al. , 2018)
Additionally, putative environmental homologs were evaluated using
protein sequence data from SURF MAG 42 to query NCBI’s non-redundant
environmental metagenomic sequence database (env-nr, as of June
2020)(Agarwala et al. , 2018) (Supplementary Datafile S2) .
Hits from all databases (Table S4) were combined and assessed for
quality; hits with E ≤ 1x10-10 were included for
initial analyses. To capture diversity while limiting imprecision and
biased sampling of overrepresented groups (e.g., Proteobacteria), hits
were subsampled to the genus level, with the exception of members of the
Chloroflexi (to fully capture the taxonomic distribution of the novel
gene variant). One additional, divergent multispecies hit was allowed
per genus. The genus-level filter was also removed for C1, where
non-Chloroflexi hits were severely limited (see below). Duplicate
sequences (from strains with multiple genome entries or in multiple
databases surveyed) were removed.