Protein phosphorylation is an important reversible mechanism in post-translational modifications of proteins, and it affects a lot of kinds of essential cellular processes. Due to the importance of protein phosphorylation in cellular control, there are many schemes and models to predict the catalytic kinase-specific phosphorylation sites. Most of methods are based on the consensus sequences of position probabilities, just like our previous version KinasePhos 1.0, which is also a web server based on the consensus. The known phosphorylation sites from public domain data sources are categorized by their annotated protein kinases. In the previous version, feature based on the profile hidden Markov model, and computational models are learned from the kinase-specific groups of the phosphorylation sites. After evaluating the learned models, the model with highest accuracy was selected from each kinase-specific group, for using in a web-based prediction tool for identifying protein phosphorylation sites. It is a kinase-specific phosphorylation site prediction tool with both high sensitivity and specificity. Moreover, the current release of KinasePhos, version 2.0, adapts the sequence-based amino acid coupling-pattern analysis and solvent accessibility as new features for SVM (support vector machine) to characterize the phosphorylation site. The feature of coupling-pattern [XdZ] denotes the amino acid coupling-pattern of amino acid types X and Z that are separated by d amino acids. We use the coupling strength CXdZ defined by coupling-pattern analysis, and we compute the differences between positive and negative set of phosphorylation proteins. We select the features which are top 250 differences of CXdZ. Then build SVM (support vector machine) to build the models and performed the cross validation. It is about 95% prediction accuracy that using this prediction model and gets 7% more improvement than previous version. Compared with other tools, the special features chosen for SVM model-building produces the best prediction so far.



Let [XdZ] denote the amino acid-coupling pattern of amino acids type X and Z that are separated by d amino acids. Since the protein sequence is directional, the sign of d is determined by the relative positions of X and Z. If X is closer to the N terminal side, d is defined to be positive, and if X is closer to the C terminal side, it is defined to be negative. Let N(XdZ) be the number of occurrences of the pattern [XdZ] . We define the conditional probability CXdZ as:

The coupling strength between X and Z of the pattern is given by

where P(Z) is the probability of the occurrence of amino acid . If CXdZ,then X and Z are positively correlated with respect to the distance d, and if CXdZ they are negatively correlated.