Protein sequence networks
Sets of representative protein sequences were formed by clustering with
CD-HIT to reduce the sample size and thus computational effort for
pairwise sequence alignments. Values of pairwise sequence identity or
similarity were calculated by the Needleman-Wunsch algorithm available
in EMBOSS (version 6.6.0) with default gap opening and gap extension
penalties of 10 and 0.5, respectively, and the substitution matrix
BLOSUM62 24,25.
Collections of protein sequences were represented as protein sequence
networks that depicted sequences as nodes connected by edges (lines).
The edges in a protein sequence network were weighted by values of
pairwise sequence identity or similarity. A threshold of the respective
edge weights was chosen to select a subset of edges for the network.
Protein sequence networks were visualized in Cytoscape (version 3.8.2)
with the prefuse-force directed layout algorithm, taking the edge
weights into account 26: edges of higher sequence
identity or similarity were depicted preferably in closer vicinity to
each other. The Python NetworkX package (version 1.11) was used to store
the metadata of protein sequence networks in GraphML format, available
for download at https://doi.org/10.18419/darus-205427.