Material & Methods

Bait development

For an overview over the whole bait development process, see Figure 1.

Transcriptome assemblies

To target expressed genomic regions relevant for investigating signature of selection, genome-guided transcriptome assemblies were used for bait design. Raw Illumina RNA-Seq data from blood, epidermis, subdermis, muscle and heart tissue of the white shark (Carcharodon carcharias ) were obtained from Richards et al. (2013), Marra et al. (2017) and Marra et al. (2019). The data was quality checked with FastQC (v. 0.11.5, Babraham Bioinformatics 2010) and detected adapters trimmed via Trimmomatic (v. 0.39, Bolger et al. 2014). Paired end (PE) data were filtered for TruSeq3-PE-2 adapters in palindrome mode, minimum adapter length of 1 bp and retention of both reads, with the addition of TruSeq2-SE-adapter clipping for data obtained from Marra et al. (2019). Single end (SE) data were filtered for TruSeq3-SE adapters.
Genome-guided transcriptome assemblies were constructed via Scallop (Shao and Kingsford 2017) and StringTie (Pertea et al. 2015; Kovaka et al. 2019). For assembly, RNA reads were aligned to the white shark reference genome (Marra et al. 2019) in four data sets: Specimen-specific and data type-specific (SE vs. PE), via the splice-aware mapper STAR (v. 2.7.5a_2020-06-29, Dobin et al. 2013). Genome indices were generated with 150 bp splice junction overhang and usage of the parent GFF attribute to assign the exons to the transcripts. Sorted bam-files were then generated in basic two-pass mode, including the XS field. Splice junction overhang was again 150 bp for all data sets and the parent GFF attribute was used.
Transcripts of the individual, aligned data sets were assembled in StringTie (v. 2.1.4, Pertea et al. 2015; Kovaka et al. 2019), and the results were merged via StringTie’s merge mode. Furthermore, Scallop (v. 0.10.5, Shao and Kingsford 2017) was used to assemble transcripts of each aligned data set, using default options for unstranded library types. The resulting four Scallop-assemblies were then merged via TACO (v. 0.7.3, Niknafs et al. 2017), with no minimum length or expression filtering applied.
Expressed exon sequences were extracted from gtf to fasta-format via the software gffread (v. 0.12.1, Pertea and Pertea 2020). Assemblies were then compared for quality via the tool rnaQUAST (v. 2.1.0, Bushmanova et al. 2016), using the PE read data. Estimation of the presence of Benchmarking Universal Single-Copy Orthologs (BUSCO) for vertebrates (Seppey et al. 2019) were included in the analysis. Genes and transcripts were not inferred in the process, as provided via the genome annotation. After analysis, the Scallop/TACO-derived assembly was chosen for further downstream analysis because the assembly covered more annotated genes as well as vertebrate BUSCOs than the StringTie-derived assembly.

Sequence selection for probe set design

To reduce the probability of inclusion of paralogues into the bait design, multiple filtering steps were applied to the assembled transcriptomes. For this, transcripts were aligned to the genome and to themselves with minimap2 (v. 2.17-r941, Li 2018). First, the transcripts were aligned to the genome in splice-aware mode, with parameters that in theory allow for up to 33%, but in practice for up to 15%, sequence divergence (personal communication by the software’s authors). Paralogous genes with 15% divergence or less should therefore be covered by alignments from the same transcript. Up to 1 Mio secondary alignments to the genome were included to allow for hits to putative paralogous genomic loci. In a second step, all transcripts were aligned to each other, not allowing for splicing but instead for up to 20% of sequence divergence, and again including up to 1 Mio secondary hits to catch as many paralogues as possible.
Continuing from the alignment output, sequences for the target capture probe design were selected via a custom-written Python 3 script (v. 3.8.5, Rossum and Boer 1991). In a first step, potential paralogous sequences were detected, to be later excluded from the probe set. First, all transcripts that aligned to more than one locus in the genome were marked as possible paralogues. Furthermore, two transcripts mapping to each other were also marked as possible paralogues if the two transcripts did not overlap in their mapping locus in the genome. If one, or both of two transcripts mapping to each other, did not map to the genome at all, both were declared as possible paralogues. Finally, all transcripts that mapped to a potential paralog were themselves marked as potential paralogues in a last, repetitive step.
Annotated genes were then selected to be included in the probe design if they corresponded to a transcript, therefore being acknowledged as expressed, but excluded if they corresponded to a transcript previously identified as a potential paralogue, therefore excluding putative paralogue gene copies. Next, to avoid genomic linkage between loci included in our design, all selected genes from the same scaffold were filtered for being located at least 250,000 bp apart. Then, sequences were cut to their first 200 bases and sequences with two or more consecutive missing nucleotides were discarded. 49 additional genes annotated as part of a major histocompatibility complex (MHC) were included in the bait set with their whole assembled nucleotide sequence (Marra et al. 2019). Another four full genes suggested to be under positive selection in some shark species were included, two of them possibly involved in reproduction (VAMP4 and TCTEX1D2 ) and two in brain development (YWHAE and ARL6IP5 ) (Swift et al. 2016).
Baits were designed as a myBaits Custom DNA-Seq kit by Arbor Biosciences (Ann Arbor, USA) with 80-mer probes and a tiling density of 3x, as recommended for fresh as well as degraded DNA.

Capture probe testing

DNA extraction & library preparation

To test the efficiency of the designed baits, three individuals each of bull sharks (Carcharhinus leucas ), tope (Galeorhinus galeus ), porbeagle (Lamna nasus ), shortfin mako (Isurus oxyrinchus ) and spurdog (Squalus acanthias ), as well as nine basking sharks (Cetorhinus maximus ) and twelve white sharks (Carcharodon carcharias , Table 1 and Supplementary File 1) were used.
All sharks were sampled either via muscle plugs or fin clips, except for the basking sharks from which skin mucus swabs were taken. Genomic DNA was extracted following a standard phenol-chloroform extraction approach (Sambrook et al. 1989) or with the DNeasy Blood & Tissue Kit (QIAGEN N.V., Venlo, The Netherlands), including an over-night incubation step. In the following, DNA was fragmented via sonication with a M220 Focused-ultrasonicator (Covaris, LLC, Woburn, USA) to a target peak length of 350 bp and libraries were prepared using the NEBNext® Ultra™ II DNA Library Prep Kit for Illumina (New England Biolabs, Inc., Ipswich, USA) and 8 nucleotide long NEBNext® Multiplex Oligos for Illumina (New England Biolabs, Inc., Ipswich, USA) with 530 ng as a starting amount. Library quality and quantity was checked on an Agilent Tape Station 4150 (Agilent Technologies, Inc., Santa Clara, USA). Three target gene capture reactions were performed according to manufacturer´s high sensitivity hybridization protocol v. 5.01, by pooling twelve individual per capture reaction and using equal starting amount of each individual library within each capture reaction (namely 500 ng, 400 ng, and 300 ng). The white shark samples were pooled in one single pool, while the individuals of the other species were randomly mixed between two pools, so that each reaction contained representatives of each of the six species. Hybridization and wash temperatures were 60 °C to account for bait- and target sequence divergence, and hybridization was performed over 72 hours. The captured libraries were sequenced together on an Illumina NextSeq 500 platform (Illumina, Inc., San Diego, USA) with a 2x150 bp mid-output NextSeq 500/550 v. 2.5. kit (Illumina, Inc., San Diego, USA). After sequencing, reads were demultiplexed with bcl2fastq (v. 2.20.0.422, Illumina, Inc., San Diego, USA).

Data analysis

Sequencing data was analyzed for technical artefacts with FastQC (v. 0.11.9, Babraham Bioinformatics 2010). Poly-G tails were then trimmed with cutadapt (v. 3.5, Martin 2011) via the option “nextseq-trim”, and a phred threshold of 20. Trimmomatic (v. 0.39, Bolger et al. 2014) was then used in palindrome paired end mode to remove internally provided TruSeq3-PE-2 adapters and low-quality base calls, with a maximum seed mismatch count of 2, a palindrome clip threshold of 30, a simple clip threshold of 10, and minimum detected adapter length threshold of 1. Both reads were retained after adapter clipping. Reads were then cleaned of low quality, with a phred threshold of 20 for clipping bases at both read ends, and for trimming within a sliding window of 4 bp. All reads with a minimum length of 36 bp were retained.
Read mapping, cleaning and Single Nucleotide Polymorphisms (SNPs) calling was performed following a pipeline developed elsewhere (Choquet et al. 2019; Choo et al. 2020), which was adapted to the needs of this study. Reads were mapped to the white shark reference genome (NCBI accession number GCF_017639515.1, Rhie et al. 2021; Sayers et al. 2022) using BWA-MEM (v. 0.7.17-r1188, Li 2013). Reads with ambiguous alignments were then removed via SAMtools (v. 1.10, Danecek et al. 2021), with filtering for reads with multiple mapping loci (XA flag), reads that produced chimeric alignments (SA flag), non-mapping reads, reads with secondary or supplementary alignments, and duplicated reads (flag 3332, “UNMAP,SECONDARY,DUP,SUPPLEMENTARY”). In addition, duplicate reads were removed and all remaining reads grouped into one single read group via Picard (v. 2.26.6, Broad Institute 2019).
SNPs were then called and processed using the Genome Analysis Toolkit (GATK, v. 4.2.3.0, McKenna et al. 2010), following the project’s best practice recommendations (Van der Auwera and O’Connor 2020) and sped up using GNU Parallel (v. 20161222, Tange 2011). Variants were called for each individual with HaplotypeCaller and afterwards combined for each species separately with CombineGVCFs. Species-specific joined genotyping was performed using the GenotypeGVCFs tool, allowing for a maximum of 18 alternative alleles. SNPs were then extracted with SelectVariants, and, after visualization of quality metrics via the VariantsToTable tool and R (v. 4.1.2, R Core Team 2019), hard filtered with VariantFiltration. Lenient filtering thresholds were used, with one filter expression of QualByDepth (QD) < 2.0, FisherStrand (FS) > 60.0, RMSMappingQuality (MQ) < 50.0, MappingQualityRankSumTest (MQRankSum) < -5.0 or ReadPosRankSumTest (ReadPosRankSum) < -5.0, in accordance with parameters of previous studies by our group (Choo et al., 2020; Choquet et al., 2019).
Afterwards, SNPs were filtered to retain only biallelic SNPs, using a custom Python3 script (Supplementary File 2, Python3 v. 3.8.5, van Rossum and de Boer 1991) and PLINK2 (v. 2.00a3, Chang et al. 2015) for allele frequency computation. In addition, only SNPs present in at least 80% of the genotypes were retained, as well as SNPs with a minimum read coverage of 5, using VCFtools (v. 0.1.16, Danecek et al. 2011).
Furthermore, to allow for better comparison between data sets, SNP numbers were calculated for target regions only. For this, the bait sequences were aligned to the white shark reference genome from NCBI (accession number GCF_017639515.1, Rhie et al. 2021; Sayers et al. 2022) with Bowtie 2 (v. 2.4.5, Langmead and Salzberg 2012) in end-to-end mode. All target loci with multiple alignments were then excluded (“XS:i” flag). The resulting coordinate set, produced with BEDTools (v. 2.27.1, Quinlan and Hall 2010), was used in the filtering steps described above to keep the detection of SNPs within the actual target regions.