Discussion
The consideration of the clinical relevance of individual food allergens is a relevant factor for developing new strategies and tools for improving the allergenicity RA of novel proteins[24,25]. However, this is challenging because there is a lack of consensus lists of clinically relevant allergens with demonstrable potency in eliciting an allergic reaction[9]. Here, we have generated the first resource with a comprehensive annotation of allergens’ symptomatology using a text-mining approach that extracts significant co-mentions between allergens and clinical symptoms from scientific literature. The annotations are given in terms of standard vocabularies in widely used biomedical databases, which allows connecting information on allergens with a plethora of other resources.
An obvious drawback of this approach is that it detects generic relationships non-informative about the underlying cause, the ”direction” or their positive/negative nature. This is illustrated by the relationship between latex allergy and spina bifida previously commented. Another example is the relationship found between many allergens and “decreased IgE levels” since reducing the levels of that immunoglobulin is a treatment for some allergies (not a symptom of them). Some wrong associations are also due to recurrent co-mentions with terms mentioned in a general “introductory” context. This is, for example, the case for the appreciable number of wrong linkages with “anaphylaxis” discussed in the Results. Despite these examples, in the case of allergens, most of the co-mentions with symptoms and pathological terms are expected to be correct and due to the cause-effect relationships we are interested in. Even if it is not possible to carry out an exhaustive evaluation of all the relationships, as there are not curated resources to check against, the indirect evaluation using the exposure route show that our results are globally correct (Figure 3 ). Future improvements of the system as well as eventual manual curation of these allergen-symptom relationships could lead to a fully curated resource.
Our approach could be a valid and starting point for the future grading of allergens based on clinical symptoms and contribute to the refinement of the current and over simplistic RA view where proteins are categorized as allergens or non-allergens[9,21]. In this work, we mainly focus on identifying allergens causing life-threatening symptoms and an initial list of 222 allergens is provided with significant co-mentions (Supplementary Table 1). The main routes of allergen exposure are considered, being ingestion (foods) the most common one (54.5%), followed distantly by injection (insect bites) (23.0%), inhalation (19.8%), and skin (2.7%). This list has been further manually refined to a selection of 137 allergens of potential food relevance for RA purposes (Table 1 ). Remarkably, this refined list could serve as a starting point to propose a selection of the most relevant allergens as a potential set of reference proteins to validate predictive models of allergenicity, which today does not exist. In this context, a potential match of a novel protein with any of these allergens, based on criteria like their similarity of amino acid sequence and physicochemical properties, and/or secondary structure motifs and 3D-structure, could indicate a high risk of allergenicity. Moreover, a potential match with allergens statistically co-mentioned with less severe symptoms (814 allergens are classified in this category) or with none of them (1,143) could suggest a medium or low risk of IgE allergenicity, respectively. This type of information linking the allergen hit with the clinical symptomatology is crucial to develop targeted and proportionate follow-up RA steps depending on the risk level and accompanying uncertainties[25].
This resource, available to the community through a free web interface, will allow connecting the, until now, isolated data on allergenic substances with other biomedical databases (on genes, diseases, molecular processes, etc.). For example, as HPO (symptom) profiles for diseases are available, it would be possible to compare them with those for allergens so that to inform practitioners of possible allergy-disease misdiagnostics. In general, this massive information would pave the way for systemic studies on the complex phenomena of allergenicity. Systemic and network-based studies are widely used now in biomedicine[26] but they need massive amounts of interconnected data, which are right now scarce for the particular case of allergens.
This resource can also be useful for the similarity-based prediction of allergenic proteins in new food sources, as a matching against a known allergen can now be enriched with a profile of possible symptoms and, eventually, a severity inference, as commented above. This straightforward information will be of great utility for the allergenicity RA process.
In a next step, this approach could also allow to perform data-mining studies aimed at getting new insights into the molecular basis of other hazardous properties (e.g., adjuvanticity, toxicity) advancing the overall protein safety assessment.