Literature searches
For each allergen, we query NCBI’s Entrez API to retrieve the list of
articles (PubMed entries) mentioning it. Thus, we construct a search
string with allergen’s synonyms (combined with OR operators) and species
name(s) (with AND). For example, for an allergen with 3 synonyms, coming
from a species with common name “scpC” and scientific name “spcS”
the search string is:
(“syn1”OR“syn2”OR“syn3”) AND (“spcC”OR“spcS”)
At this point, we make the list of allergens non redundant joining two
entries from the two databases when they share the same allergen
identifier based on the WHO/IUIS Allergen Nomenclature
system[17]. As the list of retrieved articles
could be slightly different for the same allergen in each database due
to the different synonyms, for the unified entry we join all articles of
the original two entries. This led to a final list of 2,179 unique
allergens.
From the HPO[18] we retrieved all terms under
“phenotypic abnormality” (HP:0000118) in the HPO hierarchy, to avoid
HPO terms not related to phenotypes or clinical signs. This led to a
final list of 16,481 clinical signs and symptoms associated to human
pathologies, which includes different synonyms used for naming them. We
retrieved the list of PubMed entries mentioning a given HPO term by
querying the Entrez API with the term’s synonyms combined with “OR”.
As an additional vocabulary related to pathology, we included a list of
4,791 MeSH terms[19] labelled with MeSH “semantic
types” related to diseases and symptoms (T046, T047, T048, T049, T184,
T019, T020, T190 and T191). We retrieved the lists of PubMed entries
associated to these terms by querying the Entrez API for articles
annotated with these terms in their MeSH fields.
Finally, as these two vocabularies are generic and might not have the
optimal level of detail for describing allergy-related symptomatology,
we retrieved an additional list of 108 clinical signs associated to
allergy[13], and followed a similar procedure for
obtaining the PubMed entries mentioning them.