Discussion
In the last 10 years, multiple public databases such as Exome Variant Server (https://evs.gs.washington.edu/EVS/) (ExomeVariantServer), 1000 Genome Project (https://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/) (Genomes Project et al. 2015), ExAC (gs://gnomad-public/legacy) (Lek et al. 2016), gnomAD (https://gnomad.broadinstitute.org/) (Karczewski et al. 2020) and ABraOM (https://abraom.ib.usp.br/search.php) (Naslavsky et al. 2017) have facilitated the identification of rare pathogenic variants in the represented populations by publicly sharing the genomic data of, mostly, healthy individuals in an accessible way. However, these databases are de-identified and do not harbor phenotypic information of included individuals. Therefore, these databases are not optimized for the interpretation of rare variants possibly associated with rare phenotypes, particularly those characterized by mild presentation, incomplete penetrance, and/or late onset.
To overcome this limitation, databases such DECIPHER (Firth et al. 2009), MyGene2/Geno2MP (Chong et al. 2016), VariantMatcher (Wohler et al. 2021) and Franklin (Genoox) have created a public way to share genomic and phenotypic data from individuals with rare phenotypes that is easily accessible to researchers, clinicians, health care providers, and patients. While they are each queried in a slightly different way, they all harbor accessible genomic and phenotypic information from patients with rare phenotypes. The use of these databases has supported the identification of novel disease-causing variants and the more precise classification of many variants of uncertain significance. To date, most of the VUSs investigated in databases such MyGene2/Geno2MP, VariantMatcher and Franklin could be classified as benign after close comparison of the phenotypes facilitating identification of stronger candidate causative variants for the phenotypes being investigated (Wohler et al. 2021).
We plan to follow the successful Matchmaker Exchange (MME) model to connect these databases and others in a federated network using the GA4GH Data Connect standard. This will facilitate data sharing, the identification of individuals harboring the same variant, and the exchange of phenotypic information making variant classification more specific. Users will be able to choose the most appropriate database to share their data and easily query other connected databases for similar cases. When a match occurs among connected databases, the users will automatically and simultaneously receive an email notification informing them of the presence or absence of a match in the queried database(s). A matching email contains the matching data (genomic +/- phenotypic features), contact information of the users to whom they matched, and additional metadata that will be shared at the discretion of the databases that harbor the matching cases. Subsequently, the matched users can choose to contact each other to exchange further information about their cases including detailed phenotypic information.
If there is no match in any of the queried databases, the submission will only be stored in the database from which the query originated, not the external databases queried. In the future, if the users would like to repeat the query, they would need to send the submission again. In some databases, such as VariantMatcher, the users have the option to automatically resend the data from their submissions to the other chosen databases on a periodic basis.
Variant information in the format of genomic location is the minimal requirement to start a query among the connected nodes. Matching on variant features such as zygosity or phenotypic features in addition to the required variant will also be supported by some of the databases such VariantMatcher. However, even if some of the databases match only on the variant information, we expect that the users querying the databases through the Data Connect API will also submit zygosity information in addition to detailed phenotypic information so this information can be shared in the email notification which will facilitate further communication among the users who matched. To enhance the likelihood of a match on pathogenic variant databases such VariantMatcher and Geno2MP only harbor rare coding variants.
We will follow the recommendations of the Consent Task Team from the GA4GH Regulatory and Ethics Working Group Since and individual written informed consent will typically be required since variant-level data and/or phenotypic data will be provided. Each database will manage the security and privacy of the data they harbor.
By connecting variant-level databases that also facilitate phenotypic data access, we expect to improve the variant classification process in research and clinical settings and also to increase the discovery rate of novel disease-causing variants by increasing the specificity of matches. Nevertheless, incomplete penetrance, variable expressivity of the phenotype, age of onset, and zygosity are some of the factors that should be considered when the variants and phenotypes are being compared before the final classification of a candidate variant.