Bioimpacts. 11(4):271279. doi: 10.34172/bi.2021.37Original Research
GADPalign: A genetic algorithm and dynamic programmingbased method for structural alignment of proteins
Soraya Mirzaei^{ 1}^{}, Jafar Razmara^{ 1, }^{ *}^{}, Shahriar Lotfi^{ 1}
^{1}Department of Computer Science, Faculty of Mathematics, Statistics, and Computer Science, University of Tabriz, Tabriz, Iran
*Corresponding author: Jafar Razmara, Email: razmara@tabrizu.ac.ir
Abstract
Introduction:
Similarity analysis of protein structure is considered as a fundamental step to give insight into the relationships between proteins. The primary step in structural alignment is looking for the optimal correspondence between residues of two structures to optimize the scoring function. An exhaustive search for finding such a correspondence between two structures is intractable.
Methods:
In this paper, a hybrid method is proposed, namely GADPalign, for pairwise protein structure alignment. The proposed method looks for an optimal alignment using a hybrid method based on a genetic algorithm and an iterative dynamic programming technique. To this end, the method first creates an initial map of correspondence between secondary structure elements (SSEs) of two proteins. Then, a genetic algorithm combined with an iterative dynamic programming algorithm is employed to optimize the alignment.
Results:
The GADPalign algorithm was employed to align 10 ‘difficult to align’ protein pairs in order to evaluate its performance. The experimental study shows that the proposed hybrid method produces highly accurate alignments in comparison with the methods using exactly the dynamic programming technique. Furthermore, the proposed method prevents the local optimal traps caused by the unsuitable initial guess of the corresponding residues.
Conclusion:
The findings of this paper demonstrate that employing the genetic algorithm along with the dynamic programming technique yields highly accurate alignments between a protein pair by exploring the global alignment and avoiding trapping in local alignments.
Keywords: Bioinformatics, Protein structure alignment, Genetic algorithm, Dynamic programming
Copyright
© 2021 The Author(s)
This work is published by BioImpacts as an open access article distributed under the terms of the Creative Commons Attribution License (
http://creativecommons.org/licenses/bync/4.0/). Noncommercial uses of the work are permitted, provided the original work is properly cited.
Introduction
Structural alignment of proteins plays a fundamental role in understanding their functional similarity and evolutionary relationship. Generally, since protein structure is more conserved than protein sequence, structure comparison is more evident to identify high distantly proteins than sequence comparison. As a result, the prediction of the new protein’s function is carried out through the detection of local or global structural similarity between a new protein and a protein with a known function.^{
1
} The methods are used to measure the structural similarity between two proteins and then produce a pairwise structural alignment between them.
Many algorithms have been developed to find the optimal structural alignment. These methods deal with two significant problems: a search algorithm for determining an optimal alignment between two structures and a scoring function to evaluate the created alignment.^{
2
} The structural alignment problem can be formulated as a search algorithm for finding the optimal set of correspondences minimizing the spatial distance between pairs of residues.^{
3,4
} The algorithm does not have any initial knowledge about correspondence between residues. Accordingly, the problem is computationally intractable and known as NPhard.^{
5
} Though many similarity scoring functions have been proposed in polynomial time.^{
6,7
} no procedure (of any running time) has been proposed to optimize structural alignment.^{
8,9
} Taylor and coworkers write: ‘In structure comparison, we do not even have an algorithm that guarantees an optimal answer for pairs of structures’.^{
10
}
Many efforts have been made to explore efficient pairwise structural alignment. In this regard, SSAP^{
11
} employs a double dynamic programming technique based on atomtoatom vectors in the structure space. SPalign^{
12
} is another method for pairwise protein structure alignment based on SPscore as a sizeindependent scoring function to compare protein sequences. Alternatively, several methods use the distance geometry to represent each protein by a pairwise distance matrix between all C_{α} atoms. DALI^{
13
} is a popular method that uses the Monte Carlo algorithm to obtain the optimal superposition of matrices. Other methods use additional heuristics to produce faster or more accurate alignment, such as CE.^{
14
} MMLigne^{
15
} uses statistical inference and CLICK^{
16
} matches the clique of residues to find onetoone equivalence residues. Several strategies have been proposed to optimize the previously obtained equivalence set further. These include the Monte Carlo algorithm or simulated annealing,^{
13
} dynamic programming,^{
8
} the incremental combinatorial extension of the optimal path,^{
14,17
} and genetic algorithm.^{
1820
} Recently, the multicriteria protein structure comparison tool is developed for combining methods and generating consensus structural similarity scores.^{
21
} The tools additionally are used to generate the training set for a templatebased protein structure prediction based on the threading strategy. These methods commonly employ machine learning techniques^{
22,23
} for implementation of their proposed strategy. Recently, some approaches have been proposed to solve the protein classification problem based on twodimensional multiview images of the protein structure and learning methods.^{
32,33
}
Several methods use iterative dynamic programming over an initial map of corresponding residues to find a transformation and choose the final alignment after some heuristic iterations. STRUCTAL^{
24
} uses five sets of equivalent residues, including start, middle, and end of a chain, sequence identity, and similar C_{a} torsion angles, to produce the final alignment based on the highest score. TMalign^{
9
} obtains an initial alignment by aligning the secondary structures, and gapless or gapped matching of two structures. SPalign^{
25
} uses similar heuristics as TMalign and gapless threading, secondary structure fitting, and fragment with 20 size matching are used as initial alignment. TSAMIR^{
26
} uses a text modelingbased technique to match secondary structure elements (SSEs) and then find the starting and ending residues of these elements as the initial alignment. These strategies for optimizing the initial correspondence would be impractical in some conditions like when sequence identity is low or when the SSEs have different sizes. In this paper, we introduce GADPalign as a hybrid algorithm for the optimal alignment of proteins by combining genetic algorithm with iterative dynamic programming technique. The genetic algorithm is a heuristic algorithm that is highly adaptable with dynamic programming. In the proposed method, in addition to the leading operators of genetic algorithm, we add the shift operator for enhancing the genetic algorithm’s global exploring along two protein chains. The main objective of combining these two optimization methods is to explore the global alignment and to avoid trapping in local alignment originated by improper initial alignment.
The paper is organized as follows. Section 2 describes the combination of GA with iterative dynamic programming algorithm in detail. Section 3 shows the evaluated genetic algorithm performance and makes a comparison between TMalign, MMLigne, SPalign, and CLICK methods. Section 4 concludes the proposed algorithm.
Materials and Methods
The GADPalign algorithm
The proposed GADPalign algorithm is introduced in this section. represents the steps of the algorithm in summary. The algorithm first looks for a match between SSEs of two proteins to produce an initial match between their secondary structures. Then, the algorithm employs a procedure as shown in a flowchart to align two structures at the residue level.
Fig. 1.
The GADPalign algorithm in summary.
Matching of the SSE sequences
To find an initial corresponding element between the secondary structures of two proteins, the algorithm encodes each protein secondary structure in an SSE sequence including two letters of the alphabet (H for αhelix and S for βstrand). Then, it employs the NeedlemanWunsch dynamic programming algorithm^{
27
} to create an initial correspondence between two sequences. The algorithm considers a score of +2 for identical SSEs and 1 for nonidentical SSEs as well as a score of 2 for the gapopening penalty. Then, the initial match is submitted to the genetic algorithm to search heuristically for initial corresponding residues within each matched SSEs.
The genetic algorithm
The algorithm starts with a set of chromosomes as the initial map of SSE matching. represents the structure of a sample chromosome. Each chromosome contains a sequence of matched SSE pairs, and each pair is represented by the SSE type (0 for αhelix and 1 for βstand), the number of corresponding residues within the SSE, and the list of their position along with the peptide. Coil and loop structures are highly irregular and are ignored in both protein structures. The population size was set to N=100 by default.
Fig. 2.
The structure of a sample chromosome for aligning 1BGE:B (HHHHH) and 2GMF:A (HHSHHHSH) proteins. The rectangles with diagonal lines represent αhelices and the rest boxes are βstrands. The first SSE of 2GMF:A protein is matched with the second SSE of 1BGE:B using the NeedlemanWunsch algorithm and seven residues are selected randomly as the initial correspondence of these matched SSEs.
Initial population
To produce an initial population, the SSE sequences of two proteins are compared and matched using NeedlemanWunsch dynamic programming, and a set of chromosomes is defined based on pairs of matched SSEs. For each pair of matched SSEs in the individuals, an initial list of corresponding residues is randomly selected, ranging between 20% and 100% of the shorter SSE length.
Fitness function
For evaluating each alignment, the TMScore is used via the formula^{
9
}:
Where L_{Target} is the length of query protein which is the protein with a lower number of SSEs, L_{ali} is the number of aligned residues, d_{i} is the distance between i^{th} pair of aligned residues, and.
$${d}_{0}({L}_{T\mathrm{arg}et})=1.24\sqrt[3]{{L}_{t\mathrm{arg}et}15}1.8$$
. It can be seen that the formula is independent of the protein size.
Selection strategy
The selection process chooses the highfitness chromosome by using the tournament selection. This means that the chromosome with a higher fitness value has a higher probability of being placed in the intermediate population. The tournament selection randomly selects kindividuals through the substitution of the current population. The best chromosome having the highest score is entered into the middle population. Thisprocess is repeated N (=100)times. In the experiments, the value of kwas set to 3.
Crossover operator
The crossover operator is used to combine the genetic information of two individuals. In the proposed method, the crossover is done at two points along the chromosome with P_{c}=0.75 as the crossover probability. These two points are chosen randomly, and then, the first and third segments are swapped, as shown in .
Fig. 3.
The crossover operator. The SSEs alongside diagonal lines are swapped.
Mutation operator
The mutation operator randomly mutates an individual. It allows the algorithm to search within the solution space and converge the population to the global maximum while it prevents the algorithm from falling into a local optimal trap. In the proposed method, the number of aligned residues within a pair of matched SSEs in the individual may be increased or decreased by 1 with P_{m}=0.04 as the mutation probability.
Shift operator
The procedure utilizes the shift operator to find an optimal matching between the SSE sequences of two proteins. The operator generates a new matching between SSEs and tries to converge to the global optimal matching instead of a local matching. For each individual in the population, the SSEs are shifted left or right with the shifting probability of P_{s}=0.45. shows the shift operator for the proteins 3HHR and 1TEN PDBID, whereas the SSEs of 1TEN is shifted right or left along with the 3HHR protein.
Fig. 4.
The shift operator. Dashed boxes represent the matched SSEs (before applying the shift operator). After two shifts to the left, a new SSE matching has been generated as represented in dark gray.
Replacement
The generational replacement strategy was used to replace the entire population of the current generation with the generated population after applying the crossover, mutation, and shift operations. To prevent missing the individual with the highest score, the best chromosome is passed to the next generation as an elitism.
Termination
Since the genetic algorithm is a random search approach, the algorithm does not generate an exact solution. Two different criteria were used for the termination of the algorithm. The first criterion is the unchanged maximum score (elitism) for 30 numbers of generation, and the second criterion is the termination of the algorithm after the production of 100 numbers of generation.
Dynamic programming
In each iteration, the algorithm submits the chromosome to a dynamic programming procedure to compute and apply the Kabsch rotation matrix.^{
28
} As a result, a new alignment is created by running the dynamic programming on the score similarity matrix, which is defined as:
Eq. (2)
$$S(i,j)=\frac{1}{1+{d}_{ij}^{2}/{d}_{0}{({L}_{\mathrm{min}})}^{2}}$$
where d_{ij} is the distance of the i and j denote residues from the query and target proteins and
$${d}_{0}({L}_{\mathrm{min}})=1.24\sqrt[3]{{L}_{\mathrm{min}}15}1.8$$
with L_{min} is the length of the smaller protein. Furthermore, the gap opening penalty of dynamic programming is defined 0.6. Then, the new rotation matrix and score matrix are computed based on newly aligned residues. This procedure is repeated for n(=5) times to obtain the alignment with the highest score. Algorithm 1 shows the pseudocode implemented for the above iterative dynamic programming algorithm.
Algorithm 1. Iterative dynamic programming
Input
: a chromosome
Output
: the alignment with the highest score
1: AR ←aligned residues in chromosome k
2: for i=1 to n do
3: Compute Kabsch rotation matrix based on AR
4: Rotate the target protein using the rotation matrix
5: Compute the score similarity matrix using formula 2
6: Run dynamic programming on the score similarity matrix
7: AR←new aligned residue pairs
8: end for
Results
The proposed method was implemented using the C++ programming language within visual studio 2013 on a personal computer having 2.60 GHz Core i5 and 6 GB RAM. The genetic algorithm was evaluated forits convergence and stability. The performance of the proposed method was examined on different datasets and compared with similar stateoftheart methods, including TMalign, MMLigne, SPalign, and CLICK. In this section, the results have been presented and discussed.
Convergence analysis
The implemented genetic algorithm has been investigated for its convergence to an optimal or nearoptimal solution. represents the convergence graph drawn for three different protein pairs. In each graph, the highest fitness function values are shown by red curves, while the blue curves represent the average fitness values of the generations. As it can be seen from the figure, the fitness curves are swinging during iterations indicating that different initial correspondence causes different values of the scoring function.
Fig. 5.
The convergence of the algorithm to an optimal solution for (A) 1A8Y & 1A81, (B) 2RHE & 3HLA:B, and (C) 2GMF:A & 1BGE:B. The blue swinging curve shows the average fitness of generations, and the red steps curve represents the best fitness value in the generation.
Stability analysis
Regarding that genetic algorithms are stochastic search methods, it is necessary to run the algorithm multiple times to examine its stability. A standard deviation of less than average concludes the stability of the algorithm. shows the stability diagram for three different protein pairs. The TMscore for alignment of 2GMF:A & 1BGE:B is the same for all 20 runs. However, for 1TEN & 3HHR:B and 1A8Y & 1A81 pairs of proteins, the standard deviation differs for different runs. Additionally, for analyzing the stability and effectiveness of randomized algorithms like genetic algorithm, it is essential to use statistical tests such as t test and the Wilcoxon signedrank test, which are parametric and nonparametric methods, respectively. If the data is approximately normally distributed, the t test would be more reliable to be used. However, the Wilcoxon signedrank test does not assume the distribution of the data. We use ShapiroWilk and KolmogorovSmirnov test to assess whether the data is normal or not. Table 1 shows the ShapiroWilk and KolmogorovSmirnov normality tests obtained by the proposed method. From the results, both tests have a pvalue lower than 0.05, which indicates data are not normally distributed. Thus, it is recommended to use a nonparametric statistical test method. The Wilcoxon signedrank test is used to compare the average of two related samples and assess their difference. A null hypothesis H_{o} is typically defined to state that there is no median difference between pairs of samples and H_{1} is considered as an alternative hypothesis whether the median difference is not equal to zero. The P value of the statistical test denotes whether we accept H_{o} or not. The significant level of α=0.05 is considered for rejecting H_{o}. The last column in Table 1 shows the results of the Wilcoxon signedrank test. The asymp. sig. (2tailed) the value represents the P value of the test. The results indicate that the P value is better than the significance level α (0.05 by default). This observation reveals that there is no statistically significant difference between the first and second 15 runs of the genetic algorithms. As a result, the stability of the genetic algorithm is proven.
Fig. 6.
GADPalign stability diagram of (A) 1TEN & 3HHR:B, (B) 1A8Y & 1A81, and (C) 2GMF:A &1BGE:B
Table 1.
The results of the normality test of four instances in 30 runs of the genetic algorithm
Query Protein

Target Protein

KolmogorovSmirnov

ShapiroWilk

Wilcoxon signedrank

Statistic

df*

Sig.

Statistic

df

Sig.

Z

Asymp. Sig (2 tailed)

3HHR  1TEN  0.389  30  0.000  0.624  30  0.000  0.707  0.480 
1A02  1A8Y  0.211  30  0.002  0.859  30  0.001  0.712  0.477 
1ACZ  1A81  0.194  30  0.005  0.811  30  0.000  1.068  0.286 
1A8Y  1A81  0.249  30  0.000  0.812  30  0.000  1.120  0.263 
*df: The degree of freedom.
Experimental result
To evaluate the performance of the GADPalign algorithm, 10 ‘difficult to align’ protein pairs were used.^{
29
} The results were compared with those of the TMalign algorithm as a stateoftheart alignment method. In general, the quality of alignment depends on the contradictory necessities of earning a lower RMSD and a higher length of the alignment. TMscore is a reasonable particular score to evaluate the alignment quality by making a balance between the alignment length and its accuracy. As shown in Table 2, GADPalign yields a higher TMscore than TMalign in four cases among ten protein pairs. TMalign could not align significantly with two protein pairs, including 1TEN & 3HHR:B and 2RHE & 3HLA:B whereas the TMscore is less than the threshold of 0.5. This is while GADPalign produces a more significant alignment in terms of TMscore for these two protein pairs. Besides, the results of GADPalign were compared with those of TMalign, MMLigne, SPalign, and CLICK as four online structure alignment tools. Table 3 shows the results in terms of root mean square deviation (RMSD) and length of alignment (N_{ali}). As shown in the table, the proposed method produces a higher length of alignment in all 10 cases than MMLigne, SPalign, and CLICK. The method also produces alignments better than or equal to those of TMalign for nine protein pairs in terms of length of the alignment.
Table 2.
Comparison of structure alignments for 10 ‘difficult’ structures obtained by TMalign and GADPalign methods
Query Protein

Target Protein

TMalign

GADPalign

TM1

TM2

TM1

TM2

1UBQ  1FXI:A  0.56841  0.4793 
0.5864

0.49831

1TEN  3HHR:B  0.2685  0.16816 
0.81947

0.40575

2RHE  3HLA:B  0.46138  0.23485 
0.53542

0.48751

1PAZ  2AZA:A  0.55969  0.52731 
0.55984

0.52735

1MOL:A  1CEW:I  0.68484  0.61127  0.68484  0.61127 
2RHE  1CID  0.67452  0.46537  0.67452  0.46537 
1EDE  1CRL  0.58509  0.36486  0.58509  0.36486 
1NSB:A  2SIM  0.66308  0.67683  0.66247  0.67621 
2GMF:A  1BGE:B  0.59183  0.48049 
0.62034

0.50198

4FGF  1TIE  0.72732  0.56967  0.72732  0.56967 
TM1: TMScore which is normalized by the length of Chain 1.
TM2: TMScore which is normalized by the length of Chain 2.
Table 3.
Comparison of structure alignments for 10 ‘difficult’ structures obtained by TMalign, MMLigner, SPalign, and CLICK methods
Query Protein

Target Protein

TMalign
^{
9
}

MMLigner
^{
15
}

SPalign
^{
25
}

CLICK
^{
16
}

GADPalign

RMSD

N
_{ali}

RMSD

N
_{ali}

RMSD

N
_{ali}

RMSD

N
_{ali}

RMSD

N
_{ali}

1UBQ  1FXI:A  2.63  63  2.80077  57  2.48  62  1.91  58 
2.63

63

1TEN  3HHR:B  5.14  51  1.65336  84 
1.74

86
 1.47  82 
1.74

86

2RHE  3HLA:B  3.39  79  3.13553  79  2.73  77  2.22  66 
3.68

86

1PAZ  2AZA:A  2.82  86  2.12664  77  2.39  83  1.92  79 
3.00

87

1MOL:A  1CEW:I  2.25  82  1.91291  77  2.12  81  1.57  72 
2.25

82

2RHE  1CID  2.90  100  2.50037  90  2.56  97  1.71  81 
2.90

100

1EDE  1CRL  4.32  235  3.05788  161  2.86  202  2.32  213 
4.32

235

1NSB:A  2SIM 
3.82

312
 2.80825  273  2.86  286  2.25  242  3.79  311 
2GMF:A  1BGE:B  3.44  103  3.01005  96  2.70  94  1.89  76 
3.80

110

4FGF  1TIE 
2.82

117
 2.73193  111  2.75  115  2.03  95 
2.82

117

Regarding the low differences between the numbers of SSEs of protein pairs in the Fischer set, GADPalign has not remarkably produced different TMscore values except for two cases. Therefore, another experiment was organized using a set of 200 nonhomologous protein chains, which were collected by Zhang and Skolnick.^{
9
} Ten pairs of proteins were randomly chosen from the set and GADPalign as well as the above four alignment tools were employed for their alignment. The alignment results are represented in Table 4, including RMSD and the length of the alignment. Besides, the value of TMScore was calculated for both GADPalign and TMalign methods. The results for GADPalign in Table 4 are the average values calculated based on ten runs of the developed genetic algorithm. As shown in Table 4, the alignment quality in terms of TMscore by GADPalign is higher than that of TMalign for three protein pairs, while both methods obtained a similar TMscore for four cases. In the case of nonhomologous protein pair 1ACZ & 1A02N, the TMScore value obtained by GADPalign (=0.27) is considerably lower than that of TMalign (=0.49 that is very near to similarity threshold of 0.5) indicating that GADPalign precisely identifies nonhomologous proteins by using an appropriate initial correspondence map between a protein pair. From Table 4, it can be seen that the alignment quality in terms of RMSD and length of alignment by GADPalign is higher than MMLigne, SPalign, and CLICK methods for the nonhomologous protein pairs.
Table 4.
Comparison of structure alignments of selecting randomly five proteins of a set of 200 nonhomologous protein chains
Query Protein

Target Protein

MMLigner
^{
15
}

SPalign
^{
25
}

CLICK
^{
16
}

TMalign
^{
9
}

GADPalign

RMSD

N
_{ali}

RMSD

N
_{ali}

RMSD

N
_{ali}

RMSD

N
_{ali}

TMscore 1

RMSD

N
_{ali}

TMscore 1

TMscore 2

TMscore 2

1A02:N  1AOE:A  2.05  22  4.26  67  2.25  102  5.75  108 
0.23 0.30
 5.54  111 
0.26
0.33

1A81  1A8Y 



 4.47  99  2.11  88  6.43  119 
0.25 0.20
 6.59  139 
0.29
0.24

1A02:N  1A81 



 4.71  90  2.38  137  6.20  124 
0.26
0.27
 6.22  125 
0.26
0.28

1ACZ  1A02:N  3.52  79  3.01  76  2.39  122  3.29  79 
0.49
0.23
 5.49  114 
0.27 0.16

1A81  1ACZ 



 3.51  48  2.29  122  4.55  96 
0.18
0.33
 6.54  120 
0.14 0.25

1ACZ  1A8Y  4.39  49  3.72  49  2.49  122  4.96  68 
0.32
0.14
 6.58  124 
0.21 0.15

1ACZ  1AOE:A  1.73  23  3.69  52  2.43  110  5.42  67 
0.28 0.19
 6.23  120 
0.32
0.15

1AOEA  1A81 

   3.79  71  2.41  84  5.33  105 
0.32
0.26
 5.53  106 
0.31
0.25

1A8Y  1A02:N 



 4.12  92  2.34  124  6.06  123 
0.23
0.26
 6.52  131 
0.23
0.26

1AOE:A  1A8Y  3.80  56  3.51  66  2.22  75  6.09  118 
0.32
0.21
 6.56  118 
0.32
0.21

Nali: number of aligned residues, RMSD: rootmeansquare deviation
TMscore 1:TMScore which is normalized by length of Chain 1
TMscore 2: TMScore which is normalized by length of Chain 2
Discussion
According to the previous comparative analysis,^{
30,31
} conventional methods are not powerful enough for protein structure alignment. Despite the proposition of different methods based on effective biological insights, they mostly employ an unsuitable scoring scheme to assess the quality of alignment. The scoring schemes are commonly work based on two contradictory criteria including RMSD and the length of the alignment. TMscore makes a balance between these two scores in order to provide a reasonable measure for assessing the quality of alignment.
The analysis of the experimental results conducted in this study indicates the effectiveness of the GADPalign method for protein structure alignment. The overall results show the method aligns difficult to align pairs of proteins with a quality better than or equal to other stateoftheart tools. In addition, examining the methods on a set of 200 nonhomologous protein chains demonstrates the high applicability of the method in comparison with other similar tools.
GADPalign combines the advantages of the global exploring ability of the genetic algorithms and fast convergence mechanism of dynamic programming. This combination makes the method free from the typical requirement of the usersupplied initial guess to achieve the optimal alignment. In this way, the method automatically generates a set of initial residueresidue equivalences using a genetic algorithm, and then, searches between sets of residueresidue correspondence maximizing the scoring function. Furthermore, the method iteratively looks for alternative SSE matching instead of relying on the SSE matching initially produced by the NeedlemanWunsch algorithm. GADPalign looks for the optimal alignment of residues within the matched SSEs through a procedure of randomly choosing the aligned residues. The results in Table 4 depict that the methods, which exclusively use the iterative dynamic programming algorithm with an arbitrary initial alignment, converge to the nearest local minimum RMSD.
Conclusion
The developed genetic algorithm utilizes the novel shift operator, particularly in structures with high differences in size to avoid trapping of the search in a local optimal score. Besides, the results demonstrate that a relevant initial guess of corresponding residues is essential to obtain alignment with a high score. Since the protein structure alignment is a discrete optimization problem, other efficient evolutionary algorithms which are suitable for discrete optimization can be employed instead of genetic algorithms.
Funding sources
None.
Ethical statement
The authors declare no ethical issue to be considered.
Competing interests
The authors declare no conflicts of interest.
Authors’ contribution
SM, JR, SL: conceptualization, writing and reviewing; SM, JR; data handling; SM, JR, SL: experiments design; SM, JR: data analysis; SM, JR, SL: provision of study materials and equipment; SM, JR, SL: study validation; SM: data presentation and draft preparation; JR: supervision and project administration; SL: study consultation.
Research Highlights
What is the current knowledge?
simple
√ Threedimensional structure of query and target proteins
√ Optimization by dynamic programming approach
√ Optimization by genetic algorithms
What is new here?
simple
√ Combining genetic algorithm and dynamic programming helps to explore the global alignment and avoid trapping in local alignments
√ The shift operator helps to find an optimal matching between the secondary structure elements.
References
 Roy A, Yang J, Zhang Y. COFACTOR: an accurate comparative algorithm for structurebased protein function annotation. Nucleic Acids Res 2012; 40:W4717. doi: 10.1093/nar/gks372 [Crossref]
 Joung I, Kim JY, Joo K, Lee J. Nonsequential protein structure alignment by conformational space annealing and local refinement. PLoS One 2019; 14:e0210177. doi: 10.1371/journal.pone.0210177 [Crossref]
 Razmara J, Deris SB. A novel text modelling approach for structural comparison and alignment of biomolecules. WSEAS Transactions on Computers 2010; 9:67585.
 Aslam N, Nadeem A, Babar ME, Pervez MT, Aslam M, Naveed N. The accuracy of protein structure alignment servers. Electronic Journal of Biotechnology 2016; 20:913. doi: 10.1016/j.ejbt.2016.01.005 [Crossref]
 Godzik A. The structural alignment between two proteins: is there a unique answer?. PROTEIN SCI 1996; 5:132538. doi: 10.1002/pro.5560050711 [Crossref]
 Kolodny R, Linial N. Approximate protein structural alignment in polynomial time. Proc Natl Acad Sci U S A 2004; 101:122016. doi: 10.1073/pnas.0404383101 [Crossref]
 Xu J, Jiao F, Berger B. A parameterized algorithm for protein structure alignment. J Comput Biol 2007; 14:56477. doi: 10.1089/cmb.2007.R003 [Crossref]
 Poleksic A. Algorithms for optimal protein structure alignment. Bioinformatics 2009; 25:27516. doi: 10.1093/bioinformatics/btp530 [Crossref]
 Zhang Y, Skolnick J. TMalign: a protein structure alignment algorithm based on the TMscore. Nucleic Acids Res 2005; 33:23029. doi: 10.1093/nar/gki524 [Crossref]
 Eidhammer I, Jonassen I, Taylor WR. Structure comparison and structure patterns. J Comput Biol 2000; 7:685716. doi: 10.1089/106652701446152 [Crossref]
 Taylor WR, Flores TP, Orengo CA. Multiple protein structure alignment. Protein Science 1994; 3:185870. doi: 10.1002/pro.5560031025 [Crossref]
 Yang Y, Zhan J, Zhao H, Zhou Y. A new sizeindependent score for pairwise protein structure alignment and its application to structure classification and nucleicacid binding prediction. Proteins 2012; 80:20808. doi: 10.1002/prot.24100 [Crossref]
 Holm L, Sander C. Protein structure comparison by alignment of distance matrices. J Mol Biol 1993; 233:12338. doi: 10.1006/jmbi.1993.1489 [Crossref]
 Shindyalov IN, Bourne PE. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng 1998; 11:73947. doi: 10.1093/protein/11.9.739 [Crossref]
 Collier JH, Allison L, Lesk AM, Stuckey PJ, Garcia de la Banda M, Konagurthu AS. Statistical inference of protein structural alignments using information and compression. Bioinformatics 2017; 33:100513. doi: 10.1093/bioinformatics/btw757 [Crossref]
 Nguyen MN, Tan KP, Madhusudhan MS. CLICKtopologyindependent comparison of biomolecular 3D structures. Nucleic Acids Res 2011; 39:W248. doi: 10.1093/nar/gkr393 [Crossref]
 Razmara J, Deris S, Parvizpour S. A rapid protein structure alignment algorithm based on a text modeling technique. Bioinformation 2011; 6:344. doi: 10.6026/97320630006344 [Crossref]
 Kolbeck B, May P, SchmidtGoenner T, Steinke T, Knapp EW. Connectivity independent proteinstructure alignment: a hierarchical approach. BMC Bioinformatics 2006; 7:510. doi: 10.1186/147121057510 [Crossref]
 Guerler A, Knapp EW. Novel protein folds and their nonsequential structural analogs. Protein Sci 2008; 17:137482. doi: 10.1110/ps.035469.108 [Crossref]
 Szustakowski JD, Weng Z. Protein structure alignment using a genetic algorithm. Proteins 2000; 38:42840. doi: 10.1002/(sici)10970134 [Crossref]
 Sharma A, Manolakos ES. Multicriteria protein structure comparison and structural similarities analysis using pyMCPSC. PloS one 2018; 13:e0204587. doi: 10.1371/journal.pone.0204587 [Crossref]
 Ma J, Peng J, Wang S, Xu J. A conditional neural fields model for protein threading. Bioinformatics 2012; 28:i5966. doi: 10.1093/bioinformatics/bts213 [Crossref]
 Ma J, Wang S, Zhao F, Xu J. Protein threading using contextspecific alignment potential. Bioinformatics 2013; 29:i257i65. doi: 10.1093/bioinformatics/btt210 [Crossref]
 Subbiah S, Laurents D, Levitt M. Structural similarity of DNAbinding domains of bacteriophage repressors and the globin core. CURR BIOL 1993; 3:1418. doi: 10.1016/09609822(93)90255m [Crossref]
 Yang Y, Zhan J, Zhao H, Zhou Y. A new size‐independent score for pairwise protein structure alignment and its application to structure classification and nucleic‐acid binding prediction. Proteins 2012; 80:20808. doi: 10.1002/prot.24100 [Crossref]
 Razmara J, Deris S, Parvizpour S. TSAMIR: a topology string alignment method for intensive rapid protein structure comparison. Algorithm Mol Biol 2012; 7:4. doi: 10.1186/1748718874 [Crossref]
 Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970; 48:44353. doi: 10.1016/00222836(70)900574 [Crossref]
 Kabsch W. A discussion of the solution for the best rotation to relate two sets of vectors. Acta Cryst 1978; 34:8278. doi: 10.1107/S0567739478001680 [Crossref]
 Fischer D, Elofsson A, Rice D, Eisenberg D. Assessing the performance of fold recognition methods by means of a comprehensive benchmark. Pac Symp Biocomput 1996:30018.
 Ma J, Wang S. Algorithms, applications, and challenges of protein structure alignment. Adv Protein Chem Struct Biol 2014; 94:12175. doi: 10.1016/B9780128001684.000056 [Crossref]
 Slater AW, Castellanos JI, Sippl MJ, Melo F. Towards the development of standardized methods for comparison, ranking and evaluation of structure alignments. Bioinformation 2013; 29:4753. doi: 10.1093/bioinformatics/bts600 [Crossref]
 Suryanto CH, Saigo H, Fukui K. Structural class classification of 3d protein structure based on multiview 2d images. IEEE/ACM transactions on computational biology and bioinformatics 2016; 15:28699.
 Nanni L, Lumini A, Pasquali F, Brahnam S. iProStruct2D: Identifying protein structural classes by deep learning via 2D representations. Expert Systems with Applications 2020; 142:113019.