Discussion
During CASP14 our group used well known methodologies for structural
modeling of protein complexes: template-based modeling and rigid body
docking. We did not use any deep learning-based inter-chain contacts
prediction or refinement using extensive molecular dynamics simulations.
Our main aim was accurate prediction of protein-protein interfaces, even
if this meant lower global accuracy of models. As the interface accuracy
of our models was the best among CASP14 groups predicting protein
assemblies (Fig. 2, Fig. 3A,B), it appears that we have coped with this
task quite successfully. Probably the main reasons for the successful
modeling were (1) effective multimeric template identification by
sequence and structure-based methods, (2) model selection procedure,
involving improved VoroMQA scoring with more emphasis on the interaction
interface, and (3) short molecular dynamics simulations aimed at
removing unrealistic geometry and clashes in docking models.
Unlike the interaction interface accuracy, the global accuracy of our
models was not the highest (Fig. 2, Fig. 3C,D). This is particularly
evident from the close to average values of lDDT, the score that
considers all atoms. Most of these lower scores came from template-based
models generated using MODELLER/AltMod. When we used CASP server models
for docking, lDDT scores were typically higher. This suggests that had
we used more advanced modeling techniques43, our
template-based models might have been of higher global accuracy.
The template-based modeling remains the most accurate method to predict
the structures of protein complexes, but the limiting factor for this
approach is the detection of structural templates. Typically templates
are identified by sequence-based search methods such as BLAST,
PSI-BLAST44 or HHpred20. In CASP14,
aiming to expand the set of available templates, we additionally
employed structure-based searches. The efficiency of structure-based
approach has been greatly increased by the recent advances in protein
structure prediction3, 4. The availability of more
accurate models for monomers may be the reason why our structure-based
template searches successfully complemented sequence-based searches in
CASP14, but less so in CASP1317. It is possible that
the structure-based template identification for protein complexes may
play even a more prominent role in the future.
The template-based modeling of protein complexes represents a more
complex problem than the homology modeling of individual proteins.
Unlike for monomeric proteins, the modeling of complexes has to deal
with additional complications such as the presence of alternative
interaction interfaces and differences in stoichiometry of homologous
protein complexes15–17, 36. In CASP14, the modeling
of evolutionary non-conserved antibody-antigen interactions was yet
another example of a more complex problem. In other cases such as
host-pathogen interactions that do not always emerge from long
coexistence of species, it might also be hard to apply either
template-based or co-evolution-based modeling methods.
When there are no templates and other constraints are lacking, free
docking is the only feasible approach to predict the structures of
protein complexes. Our CASP14 results support previous observations that
docking may be successful only when subunits are sufficiently
accurate11, 17. Thus, recent breakthrough in protein
structure prediction might help not only to detect templates for
multimeric structures through structure-based searches, but also to
expand the applicability of protein docking. However, even if fairly
accurate structures of monomers are available, the docking is much
better in predicting protein-protein binding sites45,
46 than the exact mutual arrangement and interface contacts. This has
been observed by us both in previous studies17 and in
CASP14.
CASP14 results showed that our docking workflow still has a lot of room
for improvement. With more time and more computational resources devoted
for every target, some improvements could be made even while staying in
the realm of rigid body docking and keeping our current, admittedly
imperfect scoring function: (1) using more different input monomers, and
generating even more monomers by modeling domain motions and by
remodeling flexible loops and tails; (2) ensuring that the docking
software always perform a sufficiently exhaustive sampling of
conformations; (3) producing structural variations of each docking
solution using molecular dynamics or other sampling techniques. These
enhancements would allow to explore the conformational space more
thoroughly, possibly leading to better results47.
Despite limitations of our CASP14 modeling protocol, our strong
performance suggests that the prediction of inter-chain contacts using
co-evolution and deep learning methods still has little impact on
modeling of protein-protein interactions. Why is that? Apparently, there
are multiple reasons why inter-chain contact prediction is harder than
intra-chain. For example, contact prediction for heteromeric protein
complexes requires generating joined multiple sequence alignments. The
interacting proteins in the alignment are inferred by genomic distances
or by phylogeny48, 49, as well as selected using
automated sequence comparison procedures50. However,
this significantly reduces the number of sequences in the alignment and
does not guarantee correct pairing of proteins. The alignment joining
problem is not present for homo-multimers, yet in this case the problem
is to distinguish intra-subunit from inter-subunit
contacts49. So far this problem has been solved by
including the monomer structures into the prediction
pipeline51.
In addition to the issues related to obtaining and analyzing the
multiple sequence alignments, training of supervised learning-based
methods for contact prediction using the structures of protein complexes
may be limited by the availability of experimental structural data. The
number of possible protein complexes is believed to outnumber the number
of possible protein folds7, 8, and it is not clear
whether known structures represent a significant part of all interaction
types8, 52. Moreover, there are examples of
protein-protein interactions such as antibody-antigen or host-pathogen
protein interactions, for which principles of co-evolution are hardly
applicable.
To conclude, the progress in monomeric protein structure prediction has
not yet translated into similar breakthrough in structural modeling of
protein complexes. A number of issues of both technical and fundamental
nature have to be solved to make a leap in producing reliable structural
models of protein interactions, and it will be exciting to see what
developments will occur in this research area in the nearest future.