Future work and similarity with other methods
Inbred or haploid genotypic datasets enjoy huge quality advantages over heterozygous datasets at comparable levels of sequencing depth. This study used a minimum depth threshold of 5 for P1 and P2 datasets, which should theoretically lead to 93.75% of truly heterozygous sites being called correctly (assuming no amplification bias) and which actually resulted in ~80-90% of the raw data being discarded (Table 1). The luxury of relaxing or removing depth thresholds in inbred datasets results in retention of much more data, and summarizing heterozygosity by taxa or by SNP in inbred datasets simplifies the removal of cross-contaminated DNA samples and homeo-SNPs respectively. In this study, dual alignment of reads from interspecific hybrids to both parental genomes (P1+P2) resulted in effectively inbred datasets that enabled more rigorous quality control, displayed higher concordance following downsampling, and provided more robust estimation of population structure compared to standard alignment against a single reference genome. Although this study used Beagle imputation for purposes of comparing different alignment strategies, datasets resulting from dual alignment could also be imputed using FSFHap, an imputation method designed for inbred populations (Swarts et al., 2014), whereas P1 and P2 datasets could not. The practical conclusion of this study is that dual alignment allows interspecific hybrids to be genotyped and imputed as efficiently and inexpensively as inbreds.
The divergence between parental genomes in this study is estimated at 38 million years for Pistacia (P. atlantica vs P. integerrima) (Xie et al., 2014) and 45 million years for Juglans (J. microcarpa vs J. regia) (Stevens et al., 2018). This study used 90 bp Illumina reads trimmed to 64 bp for speedier processing through the TASSEL GBS pipeline (Glaubitz et al., 2014), of which 65% and 76% mapped uniquely to the Pistacia and Juglans P1+P2 genomes respectively. Longer reads could be used to apply this strategy to hybrids with lower divergence, and perhaps even hybrids between heterotic groups within a species. Alternatively, strategies that make use of a “pan-genome”, including the Practical Haplotype Graph (Bradbury et al., 2022), may achieve a similar result by including enough representative reference contigs to ensure that all reads align to a homologous (non-homeologous) sequence. The strategy described here could also be applied to transcriptome data of hybrids to investigate allele-specific or species-specific patterns of expression and co-expression.