Discussion
The consideration of the clinical relevance of individual food allergens
is a relevant factor for developing new strategies and tools for
improving the allergenicity RA of novel
proteins[24,25]. However, this is challenging
because there is a lack of consensus lists of clinically relevant
allergens with demonstrable potency in eliciting an allergic
reaction[9]. Here, we have
generated the first resource with
a comprehensive annotation of allergens’ symptomatology using a
text-mining approach that extracts significant co-mentions between
allergens and clinical symptoms from scientific literature. The
annotations are given in terms of standard vocabularies in widely used
biomedical databases, which allows connecting information on allergens
with a plethora of other resources.
An obvious drawback of this approach is that it detects generic
relationships non-informative about the underlying cause, the
”direction” or their positive/negative nature. This is illustrated by
the relationship between latex allergy and spina bifida previously
commented. Another example is the relationship found between many
allergens and “decreased IgE levels” since reducing the levels of that
immunoglobulin is a treatment for some allergies (not a symptom of
them). Some wrong associations are also due to recurrent co-mentions
with terms mentioned in a general “introductory” context. This is, for
example, the case for the appreciable number of wrong linkages with
“anaphylaxis” discussed in the Results. Despite these examples, in the
case of allergens, most of the co-mentions with symptoms and
pathological terms are expected to be correct and due to the
cause-effect relationships we are interested in. Even if it is not
possible to carry out an exhaustive evaluation of all the relationships,
as there are not curated resources to check against, the indirect
evaluation using the exposure route show that our results are globally
correct (Figure 3 ). Future improvements of the system as well
as eventual manual curation of these allergen-symptom relationships
could lead to a fully curated resource.
Our approach could be a valid and
starting point for the future grading of allergens based on clinical
symptoms and contribute to the refinement of the current and over
simplistic RA view where proteins are categorized as allergens or
non-allergens[9,21]. In this work, we mainly focus
on identifying allergens causing life-threatening symptoms and an
initial list of 222 allergens is provided with significant co-mentions
(Supplementary Table 1). The main routes of allergen exposure are
considered, being ingestion (foods) the most common one (54.5%),
followed distantly by injection (insect bites) (23.0%), inhalation
(19.8%), and skin (2.7%). This list has been further manually refined
to a selection of 137 allergens of potential food relevance for RA
purposes (Table 1 ). Remarkably, this refined list could serve
as a starting point to propose a selection of the most relevant
allergens as a potential set of reference proteins to validate
predictive models of allergenicity, which today does not exist. In this
context, a potential match of a novel protein with any of these
allergens, based on criteria like their similarity of amino acid
sequence and physicochemical properties, and/or secondary structure
motifs and 3D-structure, could indicate a high risk of allergenicity.
Moreover, a potential match with allergens statistically co-mentioned
with less severe symptoms (814 allergens are classified in this
category) or with none of them (1,143) could suggest a medium or low
risk of IgE allergenicity, respectively. This type of information
linking the allergen hit with the clinical symptomatology is crucial to
develop targeted and proportionate follow-up RA steps depending on the
risk level and accompanying uncertainties[25].
This resource, available to the
community through a free web interface, will allow connecting the, until
now, isolated data on allergenic substances with other biomedical
databases (on genes, diseases, molecular processes, etc.). For example,
as HPO (symptom) profiles for diseases are available, it would be
possible to compare them with those for allergens so that to inform
practitioners of possible allergy-disease misdiagnostics. In general,
this massive information would pave the way for systemic studies on the
complex phenomena of allergenicity. Systemic and network-based studies
are widely used now in biomedicine[26] but they
need massive amounts of interconnected data, which are right now scarce
for the particular case of allergens.
This resource can also be useful for the similarity-based prediction of
allergenic proteins in new food sources, as a matching against a known
allergen can now be enriched with a profile of possible symptoms and,
eventually, a severity inference, as commented above. This
straightforward information will be of great utility for the
allergenicity RA process.
In a next step, this approach could also allow to perform data-mining
studies aimed at getting new insights into the molecular basis of other
hazardous properties (e.g., adjuvanticity, toxicity) advancing
the overall protein safety assessment.