Discussion

During CASP14 our group used well known methodologies for structural modeling of protein complexes: template-based modeling and rigid body docking. We did not use any deep learning-based inter-chain contacts prediction or refinement using extensive molecular dynamics simulations. Our main aim was accurate prediction of protein-protein interfaces, even if this meant lower global accuracy of models. As the interface accuracy of our models was the best among CASP14 groups predicting protein assemblies (Fig. 2, Fig. 3A,B), it appears that we have coped with this task quite successfully. Probably the main reasons for the successful modeling were (1) effective multimeric template identification by sequence and structure-based methods, (2) model selection procedure, involving improved VoroMQA scoring with more emphasis on the interaction interface, and (3) short molecular dynamics simulations aimed at removing unrealistic geometry and clashes in docking models.
Unlike the interaction interface accuracy, the global accuracy of our models was not the highest (Fig. 2, Fig. 3C,D). This is particularly evident from the close to average values of lDDT, the score that considers all atoms. Most of these lower scores came from template-based models generated using MODELLER/AltMod. When we used CASP server models for docking, lDDT scores were typically higher. This suggests that had we used more advanced modeling techniques43, our template-based models might have been of higher global accuracy.
The template-based modeling remains the most accurate method to predict the structures of protein complexes, but the limiting factor for this approach is the detection of structural templates. Typically templates are identified by sequence-based search methods such as BLAST, PSI-BLAST44 or HHpred20. In CASP14, aiming to expand the set of available templates, we additionally employed structure-based searches. The efficiency of structure-based approach has been greatly increased by the recent advances in protein structure prediction3, 4. The availability of more accurate models for monomers may be the reason why our structure-based template searches successfully complemented sequence-based searches in CASP14, but less so in CASP1317. It is possible that the structure-based template identification for protein complexes may play even a more prominent role in the future.
The template-based modeling of protein complexes represents a more complex problem than the homology modeling of individual proteins. Unlike for monomeric proteins, the modeling of complexes has to deal with additional complications such as the presence of alternative interaction interfaces and differences in stoichiometry of homologous protein complexes15–17, 36. In CASP14, the modeling of evolutionary non-conserved antibody-antigen interactions was yet another example of a more complex problem. In other cases such as host-pathogen interactions that do not always emerge from long coexistence of species, it might also be hard to apply either template-based or co-evolution-based modeling methods.
When there are no templates and other constraints are lacking, free docking is the only feasible approach to predict the structures of protein complexes. Our CASP14 results support previous observations that docking may be successful only when subunits are sufficiently accurate11, 17. Thus, recent breakthrough in protein structure prediction might help not only to detect templates for multimeric structures through structure-based searches, but also to expand the applicability of protein docking. However, even if fairly accurate structures of monomers are available, the docking is much better in predicting protein-protein binding sites45, 46 than the exact mutual arrangement and interface contacts. This has been observed by us both in previous studies17 and in CASP14.
CASP14 results showed that our docking workflow still has a lot of room for improvement. With more time and more computational resources devoted for every target, some improvements could be made even while staying in the realm of rigid body docking and keeping our current, admittedly imperfect scoring function: (1) using more different input monomers, and generating even more monomers by modeling domain motions and by remodeling flexible loops and tails; (2) ensuring that the docking software always perform a sufficiently exhaustive sampling of conformations; (3) producing structural variations of each docking solution using molecular dynamics or other sampling techniques. These enhancements would allow to explore the conformational space more thoroughly, possibly leading to better results47.
Despite limitations of our CASP14 modeling protocol, our strong performance suggests that the prediction of inter-chain contacts using co-evolution and deep learning methods still has little impact on modeling of protein-protein interactions. Why is that? Apparently, there are multiple reasons why inter-chain contact prediction is harder than intra-chain. For example, contact prediction for heteromeric protein complexes requires generating joined multiple sequence alignments. The interacting proteins in the alignment are inferred by genomic distances or by phylogeny48, 49, as well as selected using automated sequence comparison procedures50. However, this significantly reduces the number of sequences in the alignment and does not guarantee correct pairing of proteins. The alignment joining problem is not present for homo-multimers, yet in this case the problem is to distinguish intra-subunit from inter-subunit contacts49. So far this problem has been solved by including the monomer structures into the prediction pipeline51.
In addition to the issues related to obtaining and analyzing the multiple sequence alignments, training of supervised learning-based methods for contact prediction using the structures of protein complexes may be limited by the availability of experimental structural data. The number of possible protein complexes is believed to outnumber the number of possible protein folds7, 8, and it is not clear whether known structures represent a significant part of all interaction types8, 52. Moreover, there are examples of protein-protein interactions such as antibody-antigen or host-pathogen protein interactions, for which principles of co-evolution are hardly applicable.
To conclude, the progress in monomeric protein structure prediction has not yet translated into similar breakthrough in structural modeling of protein complexes. A number of issues of both technical and fundamental nature have to be solved to make a leap in producing reliable structural models of protein interactions, and it will be exciting to see what developments will occur in this research area in the nearest future.