Introduction
Society currently demands for healthier food and more sustainable food
systems – aiming at reducing global warming potential and making a more
effective use of water and land. Moreover, there is a need to securing
our food supply since the global population is expected to reach nearly
10 billion by 2,050[1]. Therefore, the increasing
demand of protein in an overgrowing world is a global food system
challenge[2]. Alternative proteins and innovative
foods are proposed as an answer, receiving extensive attention
worldwide[3]. Before any innovative food can be
introduced into the market, a pre-market safety assessment is undertaken
ensuring the wholesomeness of the product. The allergenicity evaluation
of new food proteins is a key step in the risk assessment (RA) process.
Food proteins can cause life-threatening conditions (e.g.,anaphylactic reaction) and chronic pathologies (e.g., celiac
disease), although it is remaining to fully understand what makes a
protein an allergen. Food allergy shows a substantial geographical
variation in prevalence and causative foods in both children and
adults[4,5], and represents a major public human
health problem for which no effective cure
exists[6]. According to recent epidemiological
data, food allergy prevalence is rising, the severity of allergic
reaction symptoms is high, and there are significant unmet needs for
those living with food allergies[7].
Allergenicity prediction is challenging because an allergic reaction to
a protein depends on a complex interplay between an individual’s immune
system, the protein, as well as other environmental and life-style
factors (microbiome, environmental pollutants, diet, etc.). Current
strategies and tools used for the RA and management of food allergies
are considered rudimentary[8] and present
limitations[9]. Bioinformatic analysis is a
cornerstone in the allergenicity assessment, where the amino acid
sequence of the new protein is compared with those of known allergens
available in public databases. While these databases contain relevant
information of thousands of allergens, the inclusion criteria used are
often different and they lack systematic information on their associated
symptomatology[10,11]. Such information is crucial
for interpreting the potential risk associated to a new protein, as
allergens are very diverse in terms of prevalence, potency and severity
of clinical symptoms, ranging from very mild clinical manifestations
(e.g., hives) to severe and even fatal (e.g.,anaphylaxis)[12,13].
Information on clinical symptoms caused by allergens is dispersed in the
scientific literature, expressed textually and not linked to the
structured biomedical databases via standard identifiers and
vocabularies. This hinders simple tasks such as the automatic assessment
of the symptomatology/severity of a new allergen by sequence matching
against others, as well as large scale studies aimed at getting insight
into this phenomenon.
In this work, we mined the main resource of biomedical literature
(PubMed, ~36 million abstracts) to automatically extract
relationships between allergens and clinical symptoms. The method
identifies statistically significant co-mentions between the textual
descriptions of the two types of entities in the literature as
indication of relationship. With this approach, we generated the first
comprehensive structured database that associates allergens with
symptoms, expressed in terms of standard vocabularies. These
associations are available to the community through an interactive web
interface.