Data Connect
Data Connect is a standard for discovery and search of biomedical data
from the GA4GH
(https://github.com/ga4gh-discovery/data-connect/blob/master/SPEC.md).
It provides data custodians with a mechanism to organize and
semantically describe their data and its data model, and data consumers
with a mechanism to construct flexible queries and search the described
data. Unlike other data-sharing technologies, Data Connect does not
prescribe a data model, thus allowing arbitrary data to be discovered
and searched “as is”, without potentially expensive transformations.
It relies on the JSON Schema standard (https://json-schema.org/) for
describing data models, and the SQL standard for querying.
Through Data Connect, databases harboring variant-level data with or
without phenotypic feature data will be able to connect in a federated
network, answer more complex questions, and communicate while preserving
their respective data models.
Databases can connect in the network by implementing the Data Connect
application programming interface (API). The API consists of three
parts:
1) Table API, through which each database describes its data models to
enable their discovery as well as fetching of associated data;
2) Service Info API for discovery of metadata about the database, and;
3) Search API allowing other databases to search the database for
similar variants using rich and flexible queries.
The algorithm that decides similarity is defined by the database being
queried. The database evaluates the query, applies the matching
function, and replies with a list of other similar cases it hosts.
We plan to establish a peer-to-peer federated network based on Data
Connect, where each database connects to one or more databases within
the network. Because of the sensitivity of the information being shared,
most databases will require requests from other databases to be
authenticated with a pre-shared key (PSK). These keys are usually shared
via encrypted email messages. This process of connecting to other
databases can be time-consuming, but it assures each database full
control over who it shares data with.