Literature searches
For each allergen, we query NCBI’s Entrez API to retrieve the list of articles (PubMed entries) mentioning it. Thus, we construct a search string with allergen’s synonyms (combined with OR operators) and species name(s) (with AND). For example, for an allergen with 3 synonyms, coming from a species with common name “scpC” and scientific name “spcS” the search string is:
(“syn1”OR“syn2”OR“syn3”) AND (“spcC”OR“spcS”)
At this point, we make the list of allergens non redundant joining two entries from the two databases when they share the same allergen identifier based on the WHO/IUIS Allergen Nomenclature system[17]. As the list of retrieved articles could be slightly different for the same allergen in each database due to the different synonyms, for the unified entry we join all articles of the original two entries. This led to a final list of 2,179 unique allergens.
From the HPO[18] we retrieved all terms under “phenotypic abnormality” (HP:0000118) in the HPO hierarchy, to avoid HPO terms not related to phenotypes or clinical signs. This led to a final list of 16,481 clinical signs and symptoms associated to human pathologies, which includes different synonyms used for naming them. We retrieved the list of PubMed entries mentioning a given HPO term by querying the Entrez API with the term’s synonyms combined with “OR”.
As an additional vocabulary related to pathology, we included a list of 4,791 MeSH terms[19] labelled with MeSH “semantic types” related to diseases and symptoms (T046, T047, T048, T049, T184, T019, T020, T190 and T191). We retrieved the lists of PubMed entries associated to these terms by querying the Entrez API for articles annotated with these terms in their MeSH fields.
Finally, as these two vocabularies are generic and might not have the optimal level of detail for describing allergy-related symptomatology, we retrieved an additional list of 108 clinical signs associated to allergy[13], and followed a similar procedure for obtaining the PubMed entries mentioning them.