HP-PPI Data collection and cleaning
The information regarding the host HP-PPIs for the database was
collected from different HP-PPIs databases namely BioGrid(36),
PHISTO(37), HPIDb(38), MINT(39), IntAct(40), MPIDB(41), UniProt(42),
VirHostNet(43), MatrixDB(44), I2D(45), DIP(46) and InnateDB(47). The
data obtained from these sources included information about i) UniProt
accession numbers, ii) Gene symbols, iii) UniProt entry names, iv) Gene
symbols for the interacting proteins of pathogen and human host, v)
Corresponding pathogen names for all pathogen proteins, vi) Pathogen
taxon IDs, and vii) Experimental method of interaction detection for
each unique interaction.
The UniProt accession number was used as a unique identifier for the
proteins extracted from different sources to maintain uniformity in the
data. The pathogen names from different databases were also examined for
variations in syntax/nomenclature and were converted into a uniform name
using UniProt Taxon identifier. The duplicate entries were removed from
the data to avoid redundancy and the obsolete entries were either
removed or converted into secondary uniport accession if available.