Annotating and prioritising genomic variants using the Ensembl Variant Effect Predictor - a tutorial
Benjamin Moore, Sarah E Hunt, M. Ridwan Amode, Irina M Armean, Diana Lemos, Aleena Mushtaq, Andrew Parton, Helen Schuilenburg, Michał Szpak, Anja Thormann, Emily Perry, Stephen J Trevanion, Paul Flicek, Andrew D Yates, Fiona Cunningham
European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
Grant numbers
Ensembl Variation Resources receive funding from the Wellcome Trust (grant number WT108749/Z/15/Z, WT200990/Z/16/Z, WT201535/Z/16/Z, WT212925/Z/18/Z), the BBSRC (BB/S020152/1) and the European Molecular Biology Laboratory. This project has also received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement n°825575.

Abstract

The Ensembl Variant Effect Predictor (VEP) is a freely available, open source tool for the annotation and filtering of genomic variants. It predicts variant molecular consequence using the Ensembl/GENCODE or RefSeq gene sets. It also reports  phenotype associations from databases such as ClinVar, allele frequencies from studies including gnomAD, and predictions of deleteriousness from tools such as SIFT and CADD. Ensembl VEP includes filtering options to customise variant prioritisation. It is well supported and updated roughly quarterly to incorporate the latest gene, variant and phenotype association information.
Ensembl VEP analysis can be performed using a highly configurable, extensible command-line tool, a Representational State Transfer (REST) application programming interface (API) and a user-friendly web interface. These access methods are designed to suit different levels of bioinformatics experience and meet different needs in terms of data size, visualisation and flexibility. In this tutorial, we will describe performing variant annotation using the Ensembl VEP web tool, which enables sophisticated analysis through a simple interface.

Keywords

Variant annotation, filtering, VEP, “molecular consequence”, variant prioritisation
Main Text

Introduction

Genome and exome sequencing are becoming routine in clinical research and diagnostic settings, as an individual’s genotype may provide insight into disease mechanism, progression and treatment. Each sequenced genome contains 4.1 to 5.0 million variant sites (1000 Genomes Project Consortium et al., 2015), many of which will be rare but benign alleles, so additional information is required to enable variant interpretation and prioritisation. As the scale of data production increases, robust and efficient software tools are needed to support variant annotation and filtering.
Variant interpretation requires i) the mapping of variants to transcripts and predictions of molecular consequence; ii) the consideration of all current knowledge relating to a variant and iii) the application of predictive algorithms to evaluate impact of change at the locus. Appropriate resources are available: the reference gene sets are regularly updated; the number of assertions of phenotype association in the literature and in key databases continues to grow; population frequency studies expand to include more individuals and report more detailed catalogues of rare variants and variant pathogenicity prediction is an active area of tool development.
In the Ensembl Project (Howe et al., 2021) we create high-quality gene sets, predict genomic regions involved in gene regulation and collate large-scale sets of variant and phenotype association data. Ensembl VEP (McLaren et al., 2016) builds on these resources and integrates results from variant assessment algorithms to enable convenient but extensive variant annotation. We provide regular updates, approximately every 3 months, to both the VEP software and associated data to ensure the latest information can be used for analysis. Here we present a tutorial describing the Ensembl VEP web interface, detailing the available analyses options and filters.

Tutorial

Data Input

Navigate to the Ensembl VEP homepage by clicking on the ‘VEP’ link in the blue navigation bar in the Ensembl homepage (https://www.ensembl.org/index.html). The Ensembl VEP homepage links to the three different VEP interfaces and detailed documentation. Click on ‘Launch VEP’ to open the web form, which is divided into sections for data input and optional analysis configuration (Figure 1).
The human GRCh38 assembly is selected by default, but a link provides access to a dedicated GRCh37 tool. Other species can be selected using the ‘Add/remove species’ option. To make the management of multiple analyses simpler, a name can be assigned to the job.
Data can be input by (1) pasting into the text box, (2) uploading a file or (3) by providing a URL for a file on a public server. The text box is suitable for small-scale datasets. To analyse a larger dataset, provide a URL or use the file upload option which supports a maximum file size of 50 megabytes (or around 2 million lines in a compressed VCF).
Ensembl VEP supports a range of data input formats including;
VCF is the standard exchange format used in next-generation sequencing pipelines so Ensembl VEP is optimised to analyse variants in this format.

Transcript set selection

Predicting the molecular consequence of a genomic variant is an essential step in interpretation and requires extensive, accurate gene annotation. There are two commonly used human gene sets: Ensembl/GENCODE (Frankish et al., 2021) and RefSeq (O’Leary et al., 2019). Both sets are generated using similar but slightly different evidence and algorithms, and so differ slightly. VEP can analyse variants using either gene set, or the combined group or GENCODE Basic, (which contains a small subset of representative transcripts for each gene). Select your preference in the ‘Transcript database to use’ section (Figure 1).
The VEP algorithm compares each variant to each transcript in the selected set and reports the relative transcript location of the variant (for example exonic, upstream) with any predicted molecular consequence (for example missense, frameshift). Consequences are described using Sequence Ontology terms (SO; Cunningham et al., 2015) to enable comparison and integration with results from other systems.

Transcript-related identifiers

HUGO Gene Nomenclature Committee (HGNC) gene symbols, versioned transcript accessions and transcript types (for example: AGT, ENST00000366667.6, protein coding respectively) are returned by default. Use the ‘Identifiers’ section (Figure 2) to add further information, including Ensembl or RefSeq protein identifiers, UniProt protein accessions and HGVS variant descriptions at protein and transcript level to your output.

Frequencies and citations

With over seven hundred million variants in dbSNP (version 154, May 2020) alone, the majority of variants found in an individual will have already been described. This information can be crucial to interpretation. Ensembl VEP searches databases including dbSNP, COSMIC and HGMD and reports any variants at the same location as your input variants. For databases with redistribution restrictions, variants are matched on location alone (i.e., with no allele specificity) and names are reported. For fully open databases, variants are matched by allele and key additional information is reported. By default, we only report matches to variants passing our quality filtering (for example, those mapping to multiple genomic locations are excluded); to include all variants in the search check the ‘Include flagged variants’ option.
In rare disease studies it is useful to filter out variants using reference population frequencies, as variants common in the general population are less likely to be causative. Use the ‘Variants and frequency data’ section (Figure 3) section to select the reference dataset to be searched. Allele frequencies from the Genome Aggregation Database (gnomAD; Karczewski et al., 2020) and 1000 Genomes Project (1000 Genomes Project Consortium et al., 2015) are currently available.
The American College of Medical Genetics and Genomics (ACMG) guidelines (Richards et al., 2015) uses 5% allele frequency as stand-alone evidence a variant allele is not pathogenic. For a single causative variant, ACMG recommend frequency filters should be selected to be higher than disease prevalence. Filter cut-offs should be higher if it is possible multiple variants are acting together.
Select the ‘Variant synonyms’ option to display the names of variants in databases such as ClinVar, UniProt and PharmGKB. In your results, the names will be linked to the relevant entries in the source databases, so the details held in these resources can be examined. Check the ‘PubMed identifiers’ button to return a list of any publications describing the variant with links to full text resources where available. Citation and synonym information is matched on variant name or location and is not allele specific.

Transcript Selection

Transcriptomic sequencing from multiple tissues has resulted in the annotation of increasing numbers of transcript isoforms for many genes. Assessing large numbers of predictions for each variant is time-consuming but important to ensure no information is missed. To support downstream filtering VEP reports transcript type (such as protein coding or pseudogene) and, for Ensembl transcripts, two prioritisation metrics. Transcript Support Level (TSL) summarises the amount of evidence supporting a transcript into a numeric score. APPRIS (Rodriguez et al., 2017) identifies principal transcript isoforms for genes in vertebrate species using protein structural information, functionally important residues and evidence from cross-species alignments. These options are listed in the ‘Transcript annotation’ section and are reported in Ensembl VEP results by default.
MANE (Matched Annotation from NCBI and EMBL-EBI) transcripts are also reported by default to facilitate transcript prioritisation. MANE Select transcripts are single representative transcripts for each protein coding human gene, chosen by the European Molecular Biology Laboratory’s European Bioinformatics Institute (EMBL-EBI) and the National Center for Biotechnology Information (NCBI). They are recommended as the default transcript where one is needed for reporting. An additional transcript is required to report all clinically relevant variants in a small number of genes, including LAMA3 and SCN2A. MANE Plus Clinical transcripts are being assigned to meet this need. MANE transcripts are identical between the RefSeq and Ensembl/GENCODE sets and match the GRCh38 reference genome sequence. MANE Select transcripts are available for 78% of protein coding genes and MANE Plus Clinical transcripts for 55 genes in Ensembl release 104 (May 2021). Selection of the MANE option flags these recommended transcripts and reports both RefSeq and Ensembl transcript identifiers.
The Ensembl canonical transcript is a single default transcript available for every gene, in every species. The same Ensembl algorithm is used to pick MANE Select transcript and the canonical transcript in human, so the two are the same where a MANE Select exists. Check the ‘Identify canonical transcripts’ option to highlight these transcripts in your results if you require a default for every gene.

Protein domains

When a variant maps to the protein, understanding which domain it falls in can provide clues as to possible impact on function. InterPro is an integrated resource for protein families, domains and sites, combining information from several different protein signature databases. We run InterProScan (Jones et al., 2014) on all Ensembl protein sequences to identify domains and these are reported in VEP. Check the ‘Protein domains’ option (Figure 4) to report these results and any overlapping PDBe structures.

Regulatory elements

Variants in the non-coding regions of the genome are more difficult to interpret than those falling within genes, and are also important in disease (Zhang et al., 2015). In the Ensembl Project, we use data from large scale projects including ENCODE, IHEC and Blueprint, to predict regions in the human genome that influence gene regulation. We classify them into types such as ‘promoter’ and ‘enhancer’ (Zerbino et al., 2015). Select the ‘Regulatory data’ option (Figure 4) to identify where your variants overlap such regions. This analysis can be configured to report all results or only those from specific cell types.

Phenotype and disease associations

Access to phenotype or disease associations previously reported for your variants or the genes they overlap is essential. There is a large body of information available in different databases but performing multiple searches across different resources is time consuming. In Ensembl, we aggregate phenotype and disease associations from a variety of sources, including Orphanet, the Cancer Gene Census, OMIM, ClinVar and the NHGRI-EBI GWAS Catalog, into a standardised format (Hunt et al., 2018). This information is searched by Ensembl VEP and summary information reported. ClinVar assertions of variant clinical significance are reported by default and, importantly, these are matched by allele and not just variant location. Select the ‘Phenotypes’ option (Figure 4) to retrieve a list of phenotype associations for overlapping genes and previously reported variants, with links to fuller information.
Results from additional sources are available. DisGeNET (Piñero et al., 2020) is a database of gene and variant disease associations. Select this option to view summary results including disease names and PubMed identifiers, which are linked to full text publications. The Mastermind Genomic Search Engine (Chunn et al., 2020) (https://www.genomenon.com/mastermind) holds gene, variant, disease, phenotype and therapy evidence mined from millions of scientific articles. Select this option to return links to the Mastermind website, which is free to access with registration.

Prediction packages

An increasing number of pathogenicity scoring algorithms are being developed to aid variant interpretation. It must however be remembered that predictions often use the same training sets and/or evidence so agreement between two algorithms does not necessarily provide additional evidence for a rating. We calculate scores for all possible amino acid substitutions in all Ensembl proteins using SIFT (Kumar et al., 2009) and PolyPhen-2 (Adzhubei et al., 2010). These results are returned by default.
dbNSFP, the database for nonsynonymous SNPs’ functional predictions (Liu at al., 2020) contains pre-calculated scores for over 20 algorithms. Select this option (Figure 5), to browse the ‘Fields to include’ menu and configure the precise results set to be returned. Combined Annotation-Dependent Depletion (CADD; Rentzsch et al., 2019) is a framework for scoring the deleteriousness of genomic variants using a wide range of different information including conservation, functional information and protein level pathogenicity predictions. Select this option to view scores for variants in both coding and non-coding loci.
Variants which disrupt splicing have also been implicated in human disease (Ward et al., 2010). We optionally report results from the well-established MaxEntScan (Yeo et al., 2014); SpliceAI (Jaganathan et al., 2019), which takes a machine learning approach; and the ensemble scores provided in the dbscSNV (Liu et al., 2020) database. Select these options in the ‘Splicing predictions’ section (Figure 5).

Filtering and Advanced options

The options in these sections will not be required for the majority of analyses. The ‘Filters’ section (Figure 6) allows the results returned to be restricted by allele frequency, to contain only variants in coding sequence or to be reduced to a subset of the available variant-transcript combinations. However, we recommend instead to filter results after the analysis, which allows greater flexibility. The ‘Advanced options’ allow you to change the way VEP analyses variants internally (a smaller batch size will reduce memory requirements but increase run time) and control whether insertion and deletions in repetitive sequence are expressed at their most 3’ position prior to consequence evaluation.

Results

Having configured your analysis, click the ‘Run’ button at the bottom of the form. Analysis jobs run on our compute farm and the time required will depend on the number of input variants and range of options chosen. The ‘Recent jobs’ table displays the status of all your analyses and has options to edit and resubmit, share or discard jobs. Results can be saved by logging into an Ensembl account. Once a job has the status of ‘Done’, clicking on ‘View Results’ will display the results table.
Summary statistics and charts display an overview of the results on the output page (Figure 7). There is also a table with a preview of the detailed results and a simple interface to configure filtering of the output. To aid variant prioritisation, multiple filters can be combined using basic logical relationships, allowing the creation of complex customised queries. For example, ‘Consequence is protein_altering_variant’ plus ‘CADD PHRED >=30’ plus ‘gnomAD AF is not defined’ will report variants which are predicted to change protein sequence, are in the 0.1% most deleterious changes predicted by CADD and are not seen in the gnomAD exome variant set. Importantly, we report the most specific SO term but enable querying by parent terms. For example, when the consequence of ‘protein altering variant’ is selected, missense and frameshift variants are reported.
The results interface allows you to download your output in VCF and other formats for further analysis or export the variation or gene list to the Ensembl BioMart tool to extract additional data, such as gene homologues and sequences.
Results are displayed in a table (Figure 8) with a single line per combination of variant allele and transcript or regulatory element. Click on the “Show/hide columns” button to configure which columns are displayed if you wish to view a subset of the results. Cells containing many records (as can happen for example for PubMed IDs) will initially be compressed and need expanding to view. The results table displays only a summary of the information available for a variant. You can easily examine evidence for your variants of interest in greater detail. Links enable you to access relevant publications in Europe PMC or view details in resources such as UniProt, ClinVar and PDBe. The table is also a convenient access point to data held in Ensembl: it has links to the variant location on the genome browser and detailed information about any genes, transcripts or variants the input variant overlaps.

Ensembl VEP interfaces

The Ensembl VEP web tool enables analysis configuration and results filtering via a simple interface. It is ideal for analysing small sets of variants and interactively assessing the results. We provide two other interfaces that are more appropriate for the integration of VEP annotations in web views or for large scale analyses. Here we briefly describe these REST and command line interfaces.
Language-agnostic computational access to VEP analysis is available through the Ensembl REST API. The VEP REST service (https://rest.ensembl.org) supports similar options to the web tool and is suitable for programmatic integration into web pages or analysis pipelines. HGVS notation, position and allele-based descriptions and a range of common variant names are supported as input and up to 200 variants can be submitted in a single request.
The command line tool is the most powerful and flexible way to use Ensembl VEP. It supports more analysis options than the other interfaces. There is also no limit on input file size, making it suitable for the annotation of large variant sets identified through whole genome sequencing. The use of custom gene, variant and other annotation sets is supported, enabling analysis against private data. While VEP can be run by anyone comfortable with command line tools, those with basic programming skills can simply create extensions to add novel, custom functionality. Run time depends on the number and complexity of options selected: basic analysis of a whole exome (~200,000 variants) takes under 5 minutes while a single genome (~4.5 million variants) will take around an hour. A Docker image is available to simplify installation. A results-filtering tool is also available in the Ensembl VEP command line package. Full instructions for installation and options for running Ensembl VEP locally can be found in our online documentation (https://www.ensembl.org/vep).