INTRODUCTION
The AlphaFold2 algorithm, developed by DeepMind, has demonstrated
remarkable performance in protein structure prediction in Critical
Assessment of Structure Prediction (CASP) 141-3. This
was followed by the development of the AlphaFold-Multimer
algorithm4, which can predict multimeric structures
with high accuracy. (Hereafter, I call both AlphaFold2 and
AlphaFold-Multimer AF2 unless there is a specific need to differentiate
them.) Other protein structure prediction programs have emerged
following the success of AF25-7. However, AF2
demonstrated comparable or even better performance than the newer
programs. Hence, optimizing AF2 is considered one of the most promising
strategies for achieving the highest accuracy in protein structure
prediction tasks.
Therefore, the challenges in CASP15 were as follows: (1) collecting a
sufficient number of evolutionarily related sequences for input into
AF2. (2) improving the structures generated by AF2.
Protein structure prediction tools are known to exhibit poor performance
when there is a limited number of evolutionarily related sequences.
Although AF2 exhibits reduced sensitivity to this problem, it remains a
concern1,8. As a result, the collection of
evolutionarily related sequences is a crucial step in the process.
Utilizing large metagenomic databases is a prominent strategy for
addressing this challenge9. Therefore, in addition to
the databases employed in the official AF2 pipeline, I used PZLAST10,11 to collect more metagenomic sequences.
Furthermore, an in-house database was constructed using NCBI assembly12 data to obtain sequences with taxonomic information
because it was considered to be necessary to predict multimeric
structures4,13. The nr database14, a
widely used extensive collection of sequences, was included and searched
using a customized version of PSI-BLAST15,16.
To accomplish the second objective, a deep learning model was
constructed to improve the accuracy of the predicted structures.
Additionally, it was assumed that AF2 (and other structure-prediction
software using Multiple Sequence Alignments [MSAs]) required MSAs
for high-quality prediction. However, they can be disrupted by the MSAs
at the same time. For example, antibody complementary-determining
regions are sequence-specific; therefore, the amino acids in MSA should
not be considered. The details of this model have been described in the
independent paper for the model17. Although the model
was primarily designed to refine multimeric structures, it was
considered to be useful to refine monomeric structures because the
underlying principles must be similar.
For the CASP15 project, I devised a semi-automatic pipeline with several
issues that need to be rectified. For example, AF2 can handle up to
approximately 2200 amino acids (aa) in my environment. Therefore, if the
number of amino acids was large, the sequences were cut into small
pieces for prediction. In addition, the conserved domains had many hits,
then the number of hits covering other regions was relatively small. In
this case, sequences were sampled to flatten the MSA depth. Furthermore,
many target-specific interventions exist because of various targets,
including mutated proteins and targets required for predicting ensemble
structures.
As a result, my team got third place with GDT-TS, first place with
Assessor’s formulae in the single-domain category, and tenth place in
the multimer category, which showed that my approach could achieve
state-of-the-art performance. However, several problems have resulted in
poor predictions, as described in this manuscript.