Introduction
De novo mutations (DNMs) arise from mutational events that occur
during gametogenesis in either parent germ cells, or postzygotically in
both somatic and germ cells of the individual carrying them. On average,
one to two DNMs can be found in the coding region of a person’s genome
(Durbin et al., 2010; O’Roak et al., 2011; Xu et al., 2011). DNMs are of
particular significance due to their contribution to many diseases and
genetic disorders, notably those affecting individual fitness such as
intellectual disability and male infertility (Awadalla et al., 2010;
Veltman and Brunner, 2012; Gilissen et al., 2014; Acuna-Hidalgo et al.,
2016; Taylor et al., 2019; Oud et al., 2022). It has been shown that
approximately 80% of DNMs are of paternal origin (Kong et al., 2012;
Goldmann et al., 2016; Yuen et al., 2016; Oud et al., 2022). A major
factor known to contribute to an increase in DMNs in individuals is
advanced parental age at the time of conception, particularly paternal
age (Kong et al., 2012; Goldmann et al., 2016). Investigating the
parental origin and timing of DNMs provides not only biological insight
into the generation and ability of these DNMs to underlie genetic
disorders, it has also been shown to be important for determining the
recurrency risk of these disorders (Campbell et al., 2014; Almobarak et
al., 2020).
Phasing analysis interrogates the
diploid genome, allowing allele separation of the parental chromosomes.
This helps not only to determine the parental origin and timing of DNMs,
but is also critical to identify compound heterozygous mutations and
look into allele specific expression, linked variants, and structural
variation (Tewhey et al., 2011; Soifer et al., 2020; Ebert et al.,
2021). With short-read whole genome sequencing (WGS) of parent-offspring
trios, 15-20% of DNMs can be successfully phased and parent-of-origin
called (Goldmann et al., 2016). However, this percentage is expected to
be even lower in whole exome sequencing (WES). Phasing challenges can be
attributed to the limited sequencing read lengths, the presence of
intronic gaps, and the reduced amount of genetic variation in the exonic
regions compared to intronic regions (Frigola et al., 2017). By
definition, germline DNMs need to be absent in the parental somatic
cells, requiring trio-based exome or genome sequencing of
parent-offspring trios for discovery. In a next step, the
parent-of-origin and zygosity of a DNM can be identified by targeted
amplification and long-read sequencing of a region spanning the DNM as
well as one or more parentally informative single nucleotide
polymorphism (iSNPs). While this appears straightforward, long-read
sequencing has both random and positional error, which may result in
false variants used for phasing, reducing reliability of downstream
analysis (Magi et al., 2018; Watson and Warr, 2019).
There are numerous methodologies to target genomic regions for
enrichment prior to sequencing, with the majority being divided into
PCR- or CRISPR-based approaches (Hafford-Tear et al., 2019; Gilpatrick
et al., 2020; Player et al., 2020). Importantly, when mapping sequence
data to the reference genome from CRISPR targeting approaches, the
off-target mapping of the sequences is several fold greater than PCR
based methods and target coverage is therefore often many factors lower
(Hafford-Tear et al., 2019;
McDonald et al., 2021), and costs per target are significantly higher.
Innate challenges also exist with
PCR approaches, including the presence of inhibitors, variable target
length, optimisation time, amplification bias, and nucleotide errors
(Potapov and Ong, 2017; Shagin et al., 2017). However, despite these
challenges with long-range PCR enrichment, the approach is arguably more
effective for scaled-up targeted phasing at present.
This study aims to identify the DNM parent-of-origin and zygosity using
a targeted long-range PCR approach for phasing 109 distinct DNMs
previously identified in infertile men by patient-parent trio exome
sequencing (Oud et al., 2022). Targeted amplification of regions
encompassing each unique DNM is performed using an optimised long-range
PCR workflow designed to quickly increase PCR success rates and reduce
pre-sequencing base error. The combination of exome patient-parent trio
data, targeted ONT sequencing and validated DNMs are used to improve
phasing and allele frequency confidence. Critical aspects of the process
are assessed to ascertain practical application for large-scale use of
the approach, including amplification length, long-read sequencing error
rates and overall phasing performance.