Advancing the allergenicity assessment of new proteins using a
text mining resource
Jorge Novoa1, Antonio
Fernandez-Dumont2, E.N. Clare
Mills3, F. Javier Moreno4* and
Florencio Pazos5*
1 Computational Systems Biology Group, National Centre
for Biotechnology (CNB-CSIC), 28049 Madrid, Spain.
2 European Food Safety Authority (EFSA), 43126 Parma,
Italy.
3 School of Biosciences and Medicine, The University
of Surrey, Guildford GU2 7XH, UK
4 Instituto de Investigación en Ciencias de la
Alimentación (CIAL), CSIC-UAM, CEI (UAM+CSIC), 28049 Madrid, Spain
CONTACT INFORMATION:
* Correspondence to:javier.moreno@csic.es ;pazos@cnb.csic.es
ORCID ID:
Abstract
BACKGROUND: With a society increasingly demanding alternative protein
food sources, new strategies for evaluating protein safety issues, such
as their allergenic potential, are needed. Large-scale and systemic
studies on allergenic proteins are hindered by the limited and
non-harmonized clinical information available for these substances in
dedicated databases. A clearly missing key information is that
representing the symptomatology of the allergens, especially given in
terms of standard vocabularies, that would allow connecting with other
biomedical resources to carry out different studies related to human
health. In this work, we have generated the first resource with a
comprehensive annotation of allergens’ symptomatology, using a
text-mining approach that extracts significant co-mentions between these
entities from the scientific literature.
METHODS: The main resource of biomedical literature (PubMed,
~36 million abstracts) was mined to automatically
extract relationships between allergens and clinical symptoms. The
annotations are given in terms of standard vocabularies in widely used
biomedical databases. The method identifies statistically significant
co-mentions between the textual descriptions of the two types of
entities in the literature as indication of relationship.
RESULTS: 1,180 clinical signs extracted from the Human Phenotype
Ontology (HPO), the Medical Subject Heading (MeSH) terms of PubMed
together with other allergen-specific symptoms, were linked to 1,036
unique allergens annotated in the two main allergen-related public
databases via 14,009 relationships.
CONCLUSIONS: This resource could serve as a starting point for a future
manually curated compilation of allergen symptomatology. The annotations
are publicly available through an interactive web interface athttps://csbg.cnb.csic.es/CoMent_allergen/.
Keywords: allergen databases; allergen symptomatology; clinical
relevance; risk assessment; text mining