Annotating and prioritising
genomic variants using the Ensembl Variant Effect Predictor - a tutorial
Benjamin Moore, Sarah E Hunt, M. Ridwan Amode, Irina M Armean, Diana
Lemos, Aleena Mushtaq, Andrew Parton, Helen Schuilenburg, Michał Szpak,
Anja Thormann, Emily Perry, Stephen J Trevanion, Paul Flicek, Andrew D
Yates, Fiona Cunningham
European Molecular Biology Laboratory, European Bioinformatics
Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, United
Kingdom
Grant numbers
Ensembl Variation Resources receive funding from the Wellcome Trust
(grant number WT108749/Z/15/Z, WT200990/Z/16/Z, WT201535/Z/16/Z,
WT212925/Z/18/Z), the BBSRC (BB/S020152/1) and the European Molecular
Biology Laboratory. This project has also received funding from the
European Union’s Horizon 2020 research and innovation programme under
grant agreement n°825575.
Abstract
The Ensembl Variant Effect Predictor (VEP) is a freely available, open
source tool for the annotation and filtering of genomic variants. It
predicts variant molecular consequence using the Ensembl/GENCODE or
RefSeq gene sets. It also reports phenotype associations from databases
such as ClinVar, allele frequencies from studies including gnomAD, and
predictions of deleteriousness from tools such as SIFT and CADD. Ensembl
VEP includes filtering options to customise variant prioritisation. It
is well supported and updated roughly quarterly to incorporate the
latest gene, variant and phenotype association information.
Ensembl VEP analysis can be performed using a highly configurable,
extensible command-line tool, a Representational State Transfer (REST)
application programming interface (API) and a user-friendly web
interface. These access methods are designed to suit different levels of
bioinformatics experience and meet different needs in terms of data
size, visualisation and flexibility. In this tutorial, we will describe
performing variant annotation using the Ensembl VEP web tool, which
enables sophisticated analysis through a simple interface.
Keywords
Variant annotation, filtering, VEP, “molecular consequence”, variant
prioritisation
Main Text
Introduction
Genome and exome sequencing are becoming routine in clinical research
and diagnostic settings, as an individual’s genotype may provide insight
into disease mechanism, progression and treatment. Each sequenced genome
contains 4.1 to 5.0 million variant sites (1000 Genomes Project
Consortium et al., 2015), many of which will be rare but benign alleles,
so additional information is required to enable variant interpretation
and prioritisation. As the scale of data production increases, robust
and efficient software tools are needed to support variant annotation
and filtering.
Variant interpretation requires i) the mapping of variants to
transcripts and predictions of molecular consequence; ii) the
consideration of all current knowledge relating to a variant and iii)
the application of predictive algorithms to evaluate impact of change at
the locus. Appropriate resources are available: the reference gene sets
are regularly updated; the number of assertions of phenotype association
in the literature and in key databases continues to grow; population
frequency studies expand to include more individuals and report more
detailed catalogues of rare variants and variant pathogenicity
prediction is an active area of tool development.
In the Ensembl Project (Howe et al., 2021) we create high-quality gene
sets, predict genomic regions involved in gene regulation and collate
large-scale sets of variant and phenotype association data. Ensembl VEP
(McLaren et al., 2016) builds on these resources and integrates results
from variant assessment algorithms to enable convenient but extensive
variant annotation. We provide regular updates, approximately every 3
months, to both the VEP software and associated data to ensure the
latest information can be used for analysis. Here we present a tutorial
describing the Ensembl VEP web interface, detailing the available
analyses options and filters.
Tutorial
Data Input
Navigate to the Ensembl VEP homepage by clicking on the ‘VEP’ link in
the blue navigation bar in the Ensembl homepage
(https://www.ensembl.org/index.html).
The Ensembl VEP homepage links to the three different VEP interfaces and
detailed documentation. Click on ‘Launch VEP’ to open the web form,
which is divided into sections for data input and optional analysis
configuration (Figure 1).
The human GRCh38 assembly is selected by default, but a link provides
access to a dedicated GRCh37 tool. Other species can be selected using
the ‘Add/remove species’ option. To make the management of multiple
analyses simpler, a name can be assigned to the job.
Data can be input by (1) pasting into the text box, (2) uploading a file
or (3) by providing a URL for a file on a public server. The text box is
suitable for small-scale datasets. To analyse a larger dataset, provide
a URL or use the file upload option which supports a maximum file size
of 50 megabytes (or around 2 million lines in a compressed VCF).
Ensembl VEP supports a range of data input formats including;
- variant call format (VCF);
- Human Genome Variation Society (HGVS) descriptions (den Dunnen et al.,
2016), using Ensembl, RefSeq or LRG accessions;
- variant identifiers (from databases including dbSNP, ClinVar and
UniProt);
- ambiguous gene-based descriptions often used in literature (for
example ‘BRCA2:p.Val2466Ala’).
VCF is the standard exchange format used in next-generation sequencing
pipelines so Ensembl VEP is optimised to analyse variants in this
format.
Transcript set selection
Predicting the molecular consequence of a genomic variant is an
essential step in interpretation and requires extensive, accurate gene
annotation. There are two commonly used human gene sets: Ensembl/GENCODE
(Frankish et al., 2021) and RefSeq (O’Leary et al., 2019). Both sets are
generated using similar but slightly different evidence and algorithms,
and so differ slightly. VEP can analyse variants using either gene set,
or the combined group or GENCODE Basic, (which contains a small subset
of representative transcripts for each gene). Select your preference in
the ‘Transcript database to use’ section (Figure 1).
The VEP algorithm compares each variant to each transcript in the
selected set and reports the relative transcript location of the variant
(for example exonic, upstream) with any predicted molecular consequence
(for example missense, frameshift). Consequences are described using
Sequence Ontology terms (SO; Cunningham et al., 2015) to enable
comparison and integration with results from other systems.
Transcript-related
identifiers
HUGO Gene Nomenclature Committee (HGNC) gene symbols, versioned
transcript accessions and transcript types (for example: AGT,
ENST00000366667.6, protein coding respectively) are returned by default.
Use the ‘Identifiers’ section (Figure 2) to add further information,
including Ensembl or RefSeq protein identifiers, UniProt protein
accessions and HGVS variant descriptions at protein and transcript level
to your output.
Frequencies and citations
With over seven hundred million variants in dbSNP (version 154, May
2020) alone, the majority of variants found in an individual will have
already been described. This information can be crucial to
interpretation. Ensembl VEP searches databases including dbSNP, COSMIC
and HGMD and reports any variants at the same location as your input
variants. For databases with redistribution restrictions, variants are
matched on location alone (i.e., with no allele specificity) and names
are reported. For fully open databases, variants are matched by allele
and key additional information is reported. By default, we only report
matches to variants passing our quality filtering (for example, those
mapping to multiple genomic locations are excluded); to include all
variants in the search check the ‘Include flagged variants’ option.
In rare disease studies it is useful to filter out variants using
reference population frequencies, as variants common in the general
population are less likely to be causative. Use the ‘Variants and
frequency data’ section (Figure 3) section to select the reference
dataset to be searched. Allele frequencies from the Genome Aggregation
Database (gnomAD; Karczewski et al., 2020) and 1000 Genomes Project
(1000 Genomes Project Consortium et al., 2015) are currently available.
The American College of Medical Genetics and Genomics (ACMG) guidelines
(Richards et al., 2015) uses 5% allele frequency as stand-alone
evidence a variant allele is not pathogenic. For a single causative
variant, ACMG recommend frequency filters should be selected to be
higher than disease prevalence. Filter cut-offs should be higher if it
is possible multiple variants are acting together.
Select the ‘Variant synonyms’ option to display the names of variants in
databases such as ClinVar, UniProt and PharmGKB. In your results, the
names will be linked to the relevant entries in the source databases, so
the details held in these resources can be examined. Check the ‘PubMed
identifiers’ button to return a list of any publications describing the
variant with links to full text resources where available. Citation and
synonym information is matched on variant name or location and is not
allele specific.
Transcript Selection
Transcriptomic sequencing from multiple tissues has resulted in the
annotation of increasing numbers of transcript isoforms for many genes.
Assessing large numbers of predictions for each variant is
time-consuming but important to ensure no information is missed. To
support downstream filtering VEP reports transcript type (such as
protein coding or pseudogene) and, for Ensembl transcripts, two
prioritisation metrics. Transcript Support Level (TSL) summarises the
amount of evidence supporting a transcript into a numeric score. APPRIS
(Rodriguez et al., 2017) identifies principal transcript isoforms for
genes in vertebrate species using protein structural information,
functionally important residues and evidence from cross-species
alignments. These options are listed in the ‘Transcript annotation’
section and are reported in Ensembl VEP results by default.
MANE (Matched Annotation from NCBI and EMBL-EBI) transcripts are also
reported by default to facilitate transcript prioritisation. MANE Select
transcripts are single representative transcripts for each protein
coding human gene, chosen by the European Molecular Biology Laboratory’s
European Bioinformatics Institute (EMBL-EBI) and the National Center for
Biotechnology Information (NCBI). They are recommended as the default
transcript where one is needed for reporting. An additional transcript
is required to report all clinically relevant variants in a small number
of genes, including LAMA3 and SCN2A. MANE Plus Clinical transcripts are
being assigned to meet this need. MANE transcripts are identical between
the RefSeq and Ensembl/GENCODE sets and match the GRCh38 reference
genome sequence. MANE Select transcripts are available for 78% of
protein coding genes and MANE Plus Clinical transcripts for 55 genes in
Ensembl release 104 (May 2021). Selection of the MANE option flags these
recommended transcripts and reports both RefSeq and Ensembl transcript
identifiers.
The Ensembl canonical transcript is a single default transcript
available for every gene, in every species. The same Ensembl algorithm
is used to pick MANE Select transcript and the canonical transcript in
human, so the two are the same where a MANE Select exists. Check the
‘Identify canonical transcripts’ option to highlight these transcripts
in your results if you require a default for every gene.
Protein domains
When a variant maps to the protein, understanding which domain it falls
in can provide clues as to possible impact on function. InterPro is an
integrated resource for protein families, domains and sites, combining
information from several different protein signature databases. We run
InterProScan (Jones et al., 2014) on all Ensembl protein sequences to
identify domains and these are reported in VEP. Check the ‘Protein
domains’ option (Figure 4) to report these results and any overlapping
PDBe structures.
Regulatory elements
Variants in the non-coding regions of the genome are more difficult to
interpret than those falling within genes, and are also important in
disease (Zhang et al., 2015). In the Ensembl Project, we use data from
large scale projects including ENCODE, IHEC and Blueprint, to predict
regions in the human genome that influence gene regulation. We classify
them into types such as ‘promoter’ and ‘enhancer’ (Zerbino et al.,
2015). Select the ‘Regulatory data’ option (Figure 4) to identify where
your variants overlap such regions. This analysis can be configured to
report all results or only those from specific cell types.
Phenotype and disease
associations
Access to phenotype or disease associations previously reported for your
variants or the genes they overlap is essential. There is a large body
of information available in different databases but performing multiple
searches across different resources is time consuming. In Ensembl, we
aggregate phenotype and disease associations from a variety of sources,
including Orphanet, the Cancer Gene Census, OMIM, ClinVar and the
NHGRI-EBI GWAS Catalog, into a standardised format (Hunt et al., 2018).
This information is searched by Ensembl VEP and summary information
reported. ClinVar assertions of variant clinical significance are
reported by default and, importantly, these are matched by allele and
not just variant location. Select the ‘Phenotypes’ option (Figure 4) to
retrieve a list of phenotype associations for overlapping genes and
previously reported variants, with links to fuller information.
Results from additional sources are available. DisGeNET (Piñero et al.,
2020) is a database of gene and variant disease associations. Select
this option to view summary results including disease names and PubMed
identifiers, which are linked to full text publications. The Mastermind
Genomic Search Engine (Chunn et al., 2020)
(https://www.genomenon.com/mastermind) holds gene, variant, disease,
phenotype and therapy evidence mined from millions of scientific
articles. Select this option to return links to the Mastermind website,
which is free to access with registration.
Prediction packages
An increasing number of pathogenicity scoring algorithms are being
developed to aid variant interpretation. It must however be remembered
that predictions often use the same training sets and/or evidence so
agreement between two algorithms does not necessarily provide additional
evidence for a rating. We calculate scores for all possible amino acid
substitutions in all Ensembl proteins using SIFT
(Kumar et al., 2009) and
PolyPhen-2 (Adzhubei et al., 2010). These results are returned by
default.
dbNSFP, the database for nonsynonymous SNPs’ functional predictions
(Liu at al., 2020) contains
pre-calculated scores for over 20 algorithms. Select this option (Figure
5), to browse the ‘Fields to include’ menu and configure the precise
results set to be returned. Combined Annotation-Dependent Depletion
(CADD; Rentzsch et al., 2019) is a framework for scoring the
deleteriousness of genomic variants using a wide range of different
information including conservation, functional information and protein
level pathogenicity predictions. Select this option to view scores for
variants in both coding and non-coding loci.
Variants which disrupt splicing have also been implicated in human
disease (Ward et al., 2010). We optionally report results from the
well-established MaxEntScan (Yeo et al., 2014); SpliceAI
(Jaganathan et al., 2019),
which takes a machine learning approach; and the ensemble scores
provided in the dbscSNV (Liu et al., 2020) database. Select these
options in the ‘Splicing predictions’ section (Figure 5).
Filtering and Advanced
options
The options in these sections will not be required for the majority of
analyses. The ‘Filters’ section (Figure 6) allows the results returned
to be restricted by allele frequency, to contain only variants in coding
sequence or to be reduced to a subset of the available
variant-transcript combinations. However, we recommend instead to filter
results after the analysis, which allows greater flexibility. The
‘Advanced options’ allow you to change the way VEP analyses variants
internally (a smaller batch size will reduce memory requirements but
increase run time) and control whether insertion and deletions in
repetitive sequence are expressed at their most 3’ position prior to
consequence evaluation.
Results
Having configured your analysis, click the ‘Run’ button at the bottom of
the form. Analysis jobs run on our compute farm and the time required
will depend on the number of input variants and range of options chosen.
The ‘Recent jobs’ table displays the status of all your analyses and has
options to edit and resubmit, share or discard jobs. Results can be
saved by logging into an Ensembl account. Once a job has the status of
‘Done’, clicking on ‘View Results’ will display the results table.
Summary statistics and charts display an overview of the results on the
output page (Figure 7). There is also a table with a preview of the
detailed results and a simple interface to configure filtering of the
output. To aid variant prioritisation, multiple filters can be combined
using basic logical relationships, allowing the creation of complex
customised queries. For example, ‘Consequence is
protein_altering_variant’ plus ‘CADD PHRED >=30’ plus
‘gnomAD AF is not defined’ will report variants which are predicted to
change protein sequence, are in the 0.1% most deleterious changes
predicted by CADD and are not seen in the gnomAD exome variant set.
Importantly, we report the most specific SO term but enable querying by
parent terms. For example, when the consequence of ‘protein altering
variant’ is selected, missense and frameshift variants are reported.
The results interface allows you to download your output in VCF and
other formats for further analysis or export the variation or gene list
to the Ensembl BioMart tool to extract additional data, such as gene
homologues and sequences.
Results are displayed in a table (Figure 8) with a single line per
combination of variant allele and transcript or regulatory element.
Click on the “Show/hide columns” button to configure which columns are
displayed if you wish to view a subset of the results. Cells containing
many records (as can happen for example for PubMed IDs) will initially
be compressed and need expanding to view. The results table displays
only a summary of the information available for a variant. You can
easily examine evidence for your variants of interest in greater detail.
Links enable you to access relevant publications in Europe PMC or view
details in resources such as UniProt, ClinVar and PDBe. The table is
also a convenient access point to data held in Ensembl: it has links to
the variant location on the genome browser and detailed information
about any genes, transcripts or variants the input variant overlaps.
Ensembl VEP
interfaces
The Ensembl VEP web tool enables analysis configuration and results
filtering via a simple interface. It is ideal for analysing small sets
of variants and interactively assessing the results. We provide two
other interfaces that are more appropriate for the integration of VEP
annotations in web views or for large scale analyses. Here we briefly
describe these REST and command line interfaces.
Language-agnostic computational access to VEP analysis is available
through the Ensembl REST API. The VEP REST service
(https://rest.ensembl.org) supports similar options to the web tool and
is suitable for programmatic integration into web pages or analysis
pipelines. HGVS notation, position and allele-based descriptions and a
range of common variant names are supported as input and up to 200
variants can be submitted in a single request.
The command line tool is the most powerful and flexible way to use
Ensembl VEP. It supports more analysis options than the other
interfaces. There is also no limit on input file size, making it
suitable for the annotation of large variant sets identified through
whole genome sequencing. The use of custom gene, variant and other
annotation sets is supported, enabling analysis against private data.
While VEP can be run by anyone comfortable with command line tools,
those with basic programming skills can simply create extensions to add
novel, custom functionality. Run time depends on the number and
complexity of options selected: basic analysis of a whole exome
(~200,000 variants) takes under 5 minutes while a single
genome (~4.5 million variants) will take around an hour.
A Docker image is available to simplify installation. A
results-filtering tool is also available in the Ensembl VEP command line
package. Full instructions for installation and options for running
Ensembl VEP locally can be found in our online documentation
(https://www.ensembl.org/vep).