Introduction
Society currently demands for healthier food and more sustainable food systems – aiming at reducing global warming potential and making a more effective use of water and land. Moreover, there is a need to securing our food supply since the global population is expected to reach nearly 10 billion by 2,050[1]. Therefore, the increasing demand of protein in an overgrowing world is a global food system challenge[2]. Alternative proteins and innovative foods are proposed as an answer, receiving extensive attention worldwide[3]. Before any innovative food can be introduced into the market, a pre-market safety assessment is undertaken ensuring the wholesomeness of the product. The allergenicity evaluation of new food proteins is a key step in the risk assessment (RA) process. Food proteins can cause life-threatening conditions (e.g.,anaphylactic reaction) and chronic pathologies (e.g., celiac disease), although it is remaining to fully understand what makes a protein an allergen. Food allergy shows a substantial geographical variation in prevalence and causative foods in both children and adults[4,5], and represents a major public human health problem for which no effective cure exists[6]. According to recent epidemiological data, food allergy prevalence is rising, the severity of allergic reaction symptoms is high, and there are significant unmet needs for those living with food allergies[7].
Allergenicity prediction is challenging because an allergic reaction to a protein depends on a complex interplay between an individual’s immune system, the protein, as well as other environmental and life-style factors (microbiome, environmental pollutants, diet, etc.). Current strategies and tools used for the RA and management of food allergies are considered rudimentary[8] and present limitations[9]. Bioinformatic analysis is a cornerstone in the allergenicity assessment, where the amino acid sequence of the new protein is compared with those of known allergens available in public databases. While these databases contain relevant information of thousands of allergens, the inclusion criteria used are often different and they lack systematic information on their associated symptomatology[10,11]. Such information is crucial for interpreting the potential risk associated to a new protein, as allergens are very diverse in terms of prevalence, potency and severity of clinical symptoms, ranging from very mild clinical manifestations (e.g., hives) to severe and even fatal (e.g.,anaphylaxis)[12,13].
Information on clinical symptoms caused by allergens is dispersed in the scientific literature, expressed textually and not linked to the structured biomedical databases via standard identifiers and vocabularies. This hinders simple tasks such as the automatic assessment of the symptomatology/severity of a new allergen by sequence matching against others, as well as large scale studies aimed at getting insight into this phenomenon.
In this work, we mined the main resource of biomedical literature (PubMed, ~36 million abstracts) to automatically extract relationships between allergens and clinical symptoms. The method identifies statistically significant co-mentions between the textual descriptions of the two types of entities in the literature as indication of relationship. With this approach, we generated the first comprehensive structured database that associates allergens with symptoms, expressed in terms of standard vocabularies. These associations are available to the community through an interactive web interface.