Project

Prediction of Biomineralization Proteins Using Machine Learning Models

Biomineralization is used by Coccolithophores to produce shells made of calcium carbonate that still allow photosynthetic activity to penetrate the shell because the shells are transparent. This project tried to build machine learning models for predicting biomineralization proteins. A set of biomineralization proteins and a set of random proteins were selected and processed using the cd-hit program to cluster similar sequences. After the processing, 138 biomineralization proteins and 1,000 random proteins were used for building the models. Three sets of features with an overall count of 248 features were then computed and normalized to build prediction models by using three different machine learning algorithms: Radial Basis Function Support Vector Machine (RBF-SVM), Linear Support Vector Machine (L-SVM), and K-Nearest Neighbor (KNN). The models were trained and tested using Five-fold cross validation, which indicated a MCC of 0.3834 for the K-NN, 0.3906 for the L-SVM, and the best result being 0.3992 for the RBF-SVM. The accuracy of the K-NN and RBF-SVM models were comparable to that of machine learning models developed for other types of proteins. The models were then applied to predict biomineralization proteins in Emiliania huxleyi. The K-NN predicted 86 proteins, the RBF-SVM predicted 3,441 proteins, and the L-SVM predict 11,671 proteins as possible biomineralization proteins For the E. huxleyi dataset, with 59 shared among all three algorithms, which suggests K-NN might be the best model in this study.

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.