INTRODUCTION
Population genetic datasets are a rich source of information for wildlife managers (Hoffmann et al. 2015; Hohenlohe et al. 2021). They provide data on genetic structure, adaptation and evolutionary trajectories of species and populations (e.g., local adaptation, hybridization, population dynamics, evolutionary potential; Willi et al. 2021). They can reveal biological and ecological processes that could not otherwise be studied (e.g., mating systems, sex-specific dispersal and gene flow; Ellegren 2014; Amos et al. 2014). In addition, they help to identify genetic problems in small populations—notably loss of genetic diversity, inbreeding, inbreeding depression—and develop simple and cost-effective management solutions towards their conservation (e.g., genetic augmentation, genetic rescue; Frankham et al. 2017; Harrisson et al. 2019; Kardos 2021).
With the massive amount of genomic data that can be generated, the level of expertise in bioinformatics required for analysing genomic datasets has increased (MacMahon et al. 2014; Holderegger et al. 2019; Hohenlohe et al. 2020). Conservation geneticists spend a great part of their time learning the use of new software, which reduces their availability to engage in other important activities needed to bridge the gap between research and conservation practice (e.g., facilitating communication with wildlife managers, building relationships with primary industry, informing and shaping policy; Galla et al. 2016; Taylor et al. 2017; Britt et al. 2018). Accordingly, there is much interest in creating easy-to-use resources to automate and streamline dataset filtering and genomic analyses. This has included the development of packages forR, which tends to be a more welcoming environment for biologists than does command-line software (e.g., dartR : Gruber et al. 2018, Mijangos et al. 2022; SambaR : de Jong et al. 2021; snpR : Hemstrom & Jones 2022; SNPfiltR: DeRaad 2022; Hogg et al. 2022).
Most population genetic analyses assume autosomal loci; thus best-practice filtering includes removal of sex-linked loci from SNP datasets. If sex-linked loci are not removed, estimates of population genetic diversity such as heterozygosity, Wright’s fixation indices including F IS, polymorphism, and allelic richness may be biased depending on the sex ratio of the sample and the sex-chromosome-to-autosome diversity ratio (Ellegren 2009; Allendorf et al. 2022; Frankham et al. 2017). Assessment of population genetic structure also benefits from the removal of sex-linked loci because they can mask genetic structure that is due to evolutionary processes (e.g., gene flow, natural selection, genetic drift; Pritchard et el. 2000; Radosavljević et al. 2015; Benestan et al. 2016). Similarly, parentage analyses assume autosomal Mendelian inheritance and so their accuracy can be affected by the presence of sex-linked loci because they create apparent genetic mismatches between true parent-offspring pairs (Jones & Wang 2010). On the other hand, focusing on sex-linked markers can help assign sex to individuals of sexually-monomorphic species, as well as reveal interesting patterns of sex-specific ecology and evolution (e.g., natural selection, philopatry; Castella et al. 2001; Pavlova et al. 2013; Arnold & Wilkinson 2015). Thus, correct identification of sex-linked loci is important for making appropriate management recommendations.
In animal species, the two most common chromosomal sex-determination systems are XY and ZW. In an XY system, typical for mammals and some insects, males are the heterogametic sex with one X and one Y chromosome, and females are the homogametic sex with two X chromosomes. In contrast, in the ZW system, typical for birds, and some reptiles and insects, females are heterogametic (ZW) and males homogametic (ZZ) (Beaukeboom & Perrin 2014). The SNP markers on sex chromosomes can be classified into three types with different inheritance and characteristics (Figure 1):
  1. Those present only on the W or Y chromosome (hereafter ‘W-linked/Y-linked’; Figure 1 in yellow). In SNP datasets, such markers are called only in the heterogametic sex and are missing in the homogametic sex.
  2. Those present only on the Z or X chromosome (hereafter ‘Z-linked/X-linked’; Figure 1 in orange). In SNP datasets, the heterogametic sex possesses only one allele (i.e., they arehemizygous ) and individuals appear homozygous when genotyped. The homogametic sex, which possesses two alleles, can be heterozygous or homozygous as for an autosomal locus.
  3. Those present in homologous regions of both sex chromosomes, Zand W or X and Y, and similar enough to be considered alleles of the same locus (hereafter ‘gametologs’, Figure 1 in green). In some cases, gametologous loci have one allele that is found exclusively on one sex chromosome while the other allele appears exclusively on the other. As a result, all members of the heterogametic sex appear heterozygous, and the homogametic sex homozygous. These loci are known as ‘fixed’ gametologs and are typical of old sex chromosomes. In other cases (i.e., in recently evolved [neo-] sex chromosomes), the ‘Z-allele’ (or ‘X-allele’) is still found on some versions of the W (or Y) chromosome, and thus, some individuals of the heterogametic sex are homozygous. In these cases, the gametologs are ‘non-fixed’.
The simplest way to distinguish sex-linked loci from autosomal ones is to identify those found in reads that mapped to the sex chromosomes of the reference genome. However, this is not possible when (i) a reference genome is not available–as is the case for most wildlife species–andde novo genotyping is required, (ii) there is little conserved synteny between the studied genome and the reference, or (iii) the W/Y chromosome of the reference genome is fragmented into numerous unmapped scaffolds, as is common in many genome projects (Carvalho & Clark 2013).
Some methods to identify sex-linked SNPs have been developed. MendelChecker, for example, uses the deviation from Mendelian inheritance to calculate the probability that a specific SNP is sex-linked, with the disadvantage that it requires genotype probabilities and pedigree information for analysis (Chen et al. 2014). Other methods use a set of individuals of known sex to test whether the allele frequencies of a given locus differ between the sexes. For instance, RADSEX is a command-line software that uses identical raw reads as non-polymorphic markers and uses their presence or absence in males and females to identify those significantly associated with sex (Feron et al. 2021). Some other studies have identified sex-linked markers by testing for differentiation between the sexes usingF ST, but this approach can be used only for Z-linked/X-linked and gametologous loci (Benestan et al. 2017; Drinan et al. 2018; Trenkel et al. 2020). Function gl.report.sexlinked fromdartR package (v2; Mijangos et al. 2022) uses arbitrary heterozygosity thresholds as default parameters to identify fixed gametologs, and can be used to identify non-fixed gametologs and Z-linked/X-linked loci by fine-tuning parameters (Pavlova et al. 2022). Nevertheless, this approach has the disadvantages that there are no clear instructions on how to tune parameters, the user has to manually adjust thresholds on a trial-and-error basis for each genomic dataset, and its precision declines with heterozygosity, risking either the erroneous removal of autosomal loci with rare alleles or the failure to remove sex-linked loci with low heterozygosity. Overall, these methods could be improved upon by developing an intuitive statistical approach that systematically identifies and distinguishes among types of SNPs (autosomal, W-linked/Y-linked, Z-linked/X-linked, and gametologs) that is automated in a ready-to-use R function with little user intervention needed.
In the same way that it is possible to use a set of known-sex individuals to identify sex-linked loci, the opposite is also possible: use a set of known sex-linked loci to identify the sex of an individual. Sex-assignment is usually done utilizing a handful of sex-linked loci of only one type (Trenkel et al. 2020). For example, if using non-fixed ZW-gametologs (for which heterozygous individuals are never male), an individual is declared female if it is heterozygous for at least one locus, yet by chance, depending on allelic frequencies and the number of evaluated gametologs, some females may not be heterozygous for any of the loci. Similarly, depending on genotyping error rates, some males may appear heterozygous for some loci. To our knowledge, despite the rich information that the three types of sex-linked loci contain to improve sex-assignment, the comparison of their information is rarely, if ever, done. Thus, the sexing of individuals using large SNP datasets can benefit from a methodical procedure that uses the information from all available sex-linked loci and that can be integrated as a standard step in bioinformatics pipelines.
Another best-practice during filtering autosomal and sex-linked datasets is minimizing the presence of ‘multilocus’ SNPs (also known as multilocus contigs, multicopy loci or homeologs; Hohenlohe et al. 2011; Willis et al. 2017; O’Leary et al. 2018). These artefactual SNPs arise during bioinformatic processing of raw reads and are product of erroneously fusing two physically separate loci that are very similar, either because they are paralogs, repetitive elements or otherwise very much alike. Because a multilocus SNP is actually two loci, multilocus SNPs tend to present abnormally high read depths. This characteristic allows their removal by setting a maximum read depth threshold during filtering (usually twice the mode or mean; Willis et al. 2017). In some cases, there are fixed or near-fixed differences between the artificially fused loci, which makes multilocus SNPs exhibit heterozygosity well-above the expectation of 0.5 for biallelic markers at Hardy-Weinberg proportions. As a consequence, these SNPs can inflate estimates of heterozygosity (O’Leary et al. 2018). A common practice to identify these artefactual loci using heterozygosity is to set an arbitrary maximum threshold (e.g., heterozygosity ≥ 0.6). It has been found that using more than one approach to identify multilocus SNPs–and removing those that are flagged by any method—constitutes the best strategy (Willis et al. 2017).
Lastly, parentage analyses and sibship reconstruction using molecular markers have great relevance in wildlife conservation. Resolving unknown parent-offspring relationships gives insights into the behaviour, ecology and evolution of plant and animal populations (e.g., extra-pair mating, inbreeding avoidance, dispersal, natural selection, effective population size; Flanagan & Jones 2018). Their application extends into very practical instances such as monitoring the success of translocations and genetic rescue, and spotting illegal trade of wild individuals (Fitzpatrick et al. 2020; Van Rossum 2022; Mucci et al. 2020). Moreover, captive breeding programs also benefit from parentage analyses that allow them to estimate founder relationships (typically assumed unrelated), and validate pedigrees and correct errors (Moran et al. 2021; Overbeek et al. 2020; Galla et al. 2021). Among the variety of parentage analysis software in existence, one of the most popular is COLONY, which simultaneously infers sibship and parentage, and can handle thousands of SNPs (Jones & Wang 2010). However, handling large amounts of genetic data in order to format it into the specific input file for COLONY requires some degree of bioinformatics expertise (Flanagan & Jones 2018). Often, researchers need to create different input files because several runs are usually required to maximize detection of true relationships. This can be a time-consuming task worth automating.
In this study, we aim to create four R functions to assist researchers analyzing reduced-representation genomic datasets. The study consists of two parts. First, we describe four R functions that we designed to automate common tasks in conservation genomic studies: (1) identify and remove sex-linked loci (functionfilter.sex.linked ), (2) use sex-linked loci to identify the genetic sex of individuals (function infer.sex ), (3) filter out excessively heterozygous loci that are likely to be genotyping errors (function filter.excess.het ), and (4) create input files for parentage analyses in COLONY (function gl2colony ). Second, we use the four new functions to process genetic datasets for two bird species and show how incomplete removal of sex-linked loci affects downstream analyses of (i) population genetic diversity, (ii) individual heterozygosity, (iii) population genetic structure, and (iv) parentage.