DISCUSSION
In this study, we designed and developed four R functions that automate tasks commonly needed in conservation genomic analyses: (1)filter.sex.linked to identify and remove sex-linked loci, (2)infer.sex to infer the genetic sex of individuals using sex-linked loci, (3) filter.excess.het to remove loci with abnormally high heterozygosity, and (4) gl2colony to produce input files for parentage analysis software. Use of these functions on genomic data for two bird species revealed that standard filters, such as low read depth and call rate, are inefficient at removing sex-linked loci, removing fewer than half of Z-linked loci and only 29-63% of gametologs. In the two studied species, the failure to comprehensively remove sex-linked loci led to one or more of: (i) overestimation of up to 9% of population F IS, and up to 8% of the number of private alleles (ii) incorrectly inferring sex differences in individual heterozygosity, (iii) capturing sex genomic differences instead of population structure, and (iv) inferring ~11% fewer parent-offspring relationships in parentage analyses. We also found that our functions were capable of identifying all sex-linked loci using as few as 15 known males and 15 known females, through a preliminary run of filter.sex.linked , followed by running infer.sex and then re-running filter.sex.linked .
Appropriate filtering is a challenging part of population genomic analyses. It is widely acknowledged that filtering can significantly affect the inferences drawn from different analyses, ranging from ‘simple’ standard measures like heterozygosity, all the way to GEA (e.g., Fu 2014; Linck & Battey 2019; Graham et al. 2020; William et al. 2022; Ahrens et al. 2021). Given this awareness, there is surprisingly little mention of best-practices for filtering out sex-linked loci from SNP datasets in population genomics research (but see Benestan et al. 2017 and Trenkel et al. 2020). Unless using per-markerF ST or dartR ’s gl.report.sexlinked function to explicitly identify sex-linked markers, studies rarely address them, and seem to rely mainly on read depth and loci missing data filters to remove sex-linked loci from large SNP datasets. We have demonstrated that this untargeted approach fails to remove ~19-29% of all sex-linked loci. Filtering sex-linked markers based only on assumed synteny with the chromosome location of a heterospecific reference genome can also result in failing to account for neo-sex chromosomes in evolutionary studies (Morales et al. 2018). Recent discoveries of neo-sex chromosome systems in Sylvioidea (Sigeman et al. 2020; Sigeman et al. 2022), Australian robins (Gan et al. 2019), insects (Wang et al. 2022) and other systems highlight dangers of assuming synteny with reference genomes of other species while detecting sex-linked loci. Thus, we propose that use our filter.sex.linkedfunction to remove sex-linked loci before applying SNP quality filters can comprise best-practice that will ensure that downstream filters are in fact evaluating the quality of autosomal loci.
We showed that the failure to remove sex-linked loci meant that a considerable proportion—7.8% and 5.7%—of the SNPs in the final datasets were not autosomal, and therefore, yielded incorrect estimates of population diversity. Interestingly, the effect of sex-linked loci on genetic diversity biases varied among populations unpredictably, and was not influenced by the within-population sex-ratio (Figure 5). This is likely because there are many factors intervening in addition to sample sex-bias, such as the proportions of different types of sex-linked loci, their different allelic frequencies in the populations, the total amount of sex-linked versus autosomal loci, the sex-chromosome-to-autosome diversity ratio, and the level of recombination between sex chromosomes. This highlights the necessity of searching for and carefully filtering out sex-linked loci, because it would be hard to control for their presence in other ways (e.g., by introducing sample sex ratio in statistical models).
Despite the relatively small impact of the presence of sex-linked loci on population Ho, there was a significant impact onindividual Ho that was large enough to erroneously indicate that YTH females were 5% less heterozygous than males (Table 5). This spurious significant difference could have mistakenly suggested that females are philopatric (which is not true in cassidix ; Smales 2004) or that they experience less inbreeding depression for survival (the reverse is true in cassidix ; Harrisson et al. 2019). If these hypotheses were not known in advance to be incorrect, they might have been accepted or at least further investigated; thus, poor filtering of sex-linked loci can lead to incorrect ecological and evolutionary inferences and wasted resources.
Our results also illustrated how the presence of sex-linked SNPs can obscure population structure. The first PC on EYR data showed population structure due to geographically separated groups. The second PC, however, simply captured the genetic differences between sexes when sex-linked markers were not removed, obscuring the fact that in reality, the second largest source of genetic variation comes from within the Muckleford population (Figure 6). This masking of population structure has also been observed in the Discriminant Analysis of Principal Components (DAPC) of two species of lobsters due to the presence of a few sex-linked loci (Benestan et al. 2017). If not properly checked against sex, the PC2 split in two could have been interpreted as, for instance, the presence of two cryptic sympatric species. Researchers studying populations with little genetic variation should be particularly careful, because this effect is expected to be more pronounced for populations with low genetic differentiation.
Importantly, we found that failing to remove sex-linked loci led to ~11% fewer correct parentage assignments (Table 6). Such a substantial loss of correct assignments could have repercussions for the management of endangered species. For example, releases of captive-bred individuals or translocations/introductions are usually done avoiding the release of close relatives in the same group in order to maximize genetic diversity and discourage inbreeding (e.g.,cassidix , Harrisson et al. 2016; Frankham et al. 2017). Removing sex-linked loci will be even more crucial in the absence of a set of known parentages with which to calibrate parentage analyses as is likely to apply to many species of conservation concern such as (i) those whose breeding season cannot be monitored because it occurs in inaccessible locations or because of lack of resources, (ii) polygamous and cooperative-breeding species, (iii) those with external fertilisation like amphibian and fish species (Nakamura 2009). Accounting for sex-linked loci is also likely to have the largest impact on species with large sex chromosomes (including neo-sex chromosomes, which have been discovered in many taxa including EYR) because sex-linked loci will represent a large proportion of the potential genomic markers for parentage analysis (Sigeman et al. 2022; Beukeboom & Perrin 2014; Gan et al. 201).
The functions we propose were created with the needs of conservation genomicists and wildlife managers in mind. Sexing individuals is especially important for species without sex dimorphism, or for sexually-dimorphic species whose youngs’ sex is undistinguishable. With the combination of the functions filter.sex.linked andinfer.sex we offer a formal statistical framework that systematically identifies and uses sex-linked loci to make sex assignments with as few as 15 known-sex individuals of each sex. Unlike current practices, infer.sex was designed to use the complementary information contained in all types of sex-linked loci available, which makes the sex-assignments more robust. The use of all types of sex-linked loci will be advantageous for low-density marker datasets because it uses information that would otherwise be neglected, and it facilitates development of SNP panels that include sex-specific loci (Blåhed et al. 2018; Willis et al. 2020). It also allows for error-checking and confirming congruence between genetic and phenotypic sex of individuals, which may assist in detecting cases of environmental sex-reversal (Stelkens & Wedekind 2010). The separation of sex-linked loci can be used to validate the assembly of W and Y chromosomes, and to study sex-specific processes (e.g., natural selection, philopatry). Furthermore, it reduces the cost in time, genetic material and resources of using other sexing methods (e.g., PCR amplification of CHD1-Z and CHD1-W genes; Fridolfsson & Ellegren 1999).
The function filter.excess.het provides a statistically-backed method to identify artefactual multilocus SNPs that show abnormally high heterozygosity. The function circumvents the problem of choosing an arbitrary heterozygosity threshold by, instead, testing loci whose heterozygosity ≥ 0.5 and also have significant excess of heterozygotes beyond sampling error. This has the advantage of taking into account random sampling and genotyping errors that affect loci differently. In fact, this approach is available in VCFtools but not yet in dartR , snpR orSNPfiltR (Hohenlohe et al. 2011; Denecek et al. 2011; Mijangos et al. 2022; Hemstrom & Jones 2022; DeRaad 2022). Nonetheless, we would like to emphasize that this is not a Hardy-Weinberg equilibrium filter (which requires critical thinking to be correctly applied and interpreted; Waples 2015), and should be used only when looking to obtain neutral autosomal loci (cf. looking for signatures of selection).
In conclusion, we demonstrated how incomplete removal of sex-linked loci can bias conservation genomic inferences. We argue that comprehensively removing sex-linked loci should be best practice when handling genomic data, and we offer convenient easy-to-use resources to automate this and other bioinformatic steps. The functions presented here can be integrated into bioinformatic pipelines and widely used Rpackages such as dartR , sambaR , SNPfiltR andsnpR . By developing functions that can be easily adopted by conservation biologists and incorporated in wildlife management workflows, this study will contribute to a better understanding of the processes occurring in threatened species, such as inbreeding, inbreeding depression, population structure.