Comparative analysis and prediction of nucleosome positioning using integrative feature representation and machine learning algorithms

您所在的位置：网站首页 › 亲亲子衿悠悠我心 › Comparative analysis and prediction of nucleosome positioning using integrative feature representation and machine learning algorithms

Comparative analysis and prediction of nucleosome positioning using integrative feature representation and machine learning algorithms

2022-12-16 09:05| 来源: 网络整理| 查看: 265

Dataset descriptions

To compare the results of the predictors, the datasets of this work downloaded from two published papers [20, 21]. The first group of datasets involved H. sapiens, C. elegans and D. melanogaster from the paper by Guo et al. [21]. The length of each DNA sequence is 147 bp. The second dataset involved S. cerevisiae genome from the paper by Chen et al. [20]. The length of each DNA sequence is 150 bp. Both of the datasets contain two types of samples: nucleosome-forming sequences (positive data) and nucleosome-inhibiting sequences (negative data). And none of the sequences included has ≥ 80% pairwise sequence identity with any other. The details of the datasets are shown in Table 15.

DNA sequence feature representation

Except for the above mentioned, common DNA sequence representation methods include basic kmer (Kmer) [34], reverse complementary kmer (RevKmer) [35], etc. based on deoxyribonucleic acid composition, and some are based on the correlation between nucleotide physical and chemical indicators, such as dinucleotide-based autocovariance (DAC), trinucleotide-based autocovariance (TAC) [29], etc. and pseudo k-tuple nucleotide composition (PseKNC) [21] based on pseudo deoxyribonucleic acid composition. These feature representation methods have specific calculation formulas and iterative functions, and some calculations are more complex and require a long time. This paper will mainly use a simple and intuitive feature representation.

Chaos game representation (CGR) is a graphical representation method of gene sequence based on chaos theory proposed by Jeffrey in 1990 [36]. The method is as follows: The four nucleotides {A, T, G, C} are located at the four vertices of the plane coordinate system, and the position of each nucleotide in the DNA sequence in the plane is $P_{i}$. According to formula (2) draw the coordinate point of each nucleotide:

$$P_{i} = 0.5 \cdot (P_{i - 1} + N_{i} ),\quad i = 1, \ldots { },L\;and\;P_{0} = (0.5,{ }0.5)$$ (2)

Among them, $P_{0}$ is the given starting point, L is the length of the DNA sequence, and $N_{i}$ represents the corresponding coordinate of the i-th nucleotide, where A = (0,0), T = (1,0), G = (1,1), C = (0,1). This method draws a corresponding image of a DNA sequence through the iterative function and makes the nucleotides in the sequence correspond to the points on the image one by one [36,37,38,39,40]. From Fig. 5, we can see the CGR graphical representation of the two types of sample sequences in the H. sapiens dataset.

Fig. 5

CGR of DNA sequences: a H. sapiens nucleosome-inhibiting sample and b H. sapiens nucleosome-forming sample

Full size image

Divide the CGR image into $2^{K} \times 2^{K}$ sub-blocks and calculate the number of points appearing on each sub-block, we can determine the frequency of K nucleotide combinations, and then convert the CGR image into a $2^{K} \times 2^{K} { }$ matrix, which is called frequency chaos game representation (FCGR) [39]. For example, we divided the CGR graph of Fig. 5a into a $2^{3} \times 2^{3}$ matrix and calculated the number of occurrences of the midpoint of each sub-block, and obtain the frequency matrix shown in Table 16.

Table 15 The quantity composition of the four species datasetsFull size tableTable 16 The frequency matrix of CGR image on H. sapiens nucleosome-inhibiting sampleFull size table

FCGR can be used not only as a numerical matrix, but also as a grayscale image. The original CGR image is divided into $4^{K}$ sub-blocks. The darker the sub-block, the more dots appear in the sub-blocks; the lighter sub-blocks, indicates that the number of dots in the color block is small, and the pixel value of the image is between 0 and 255 [39]. From Fig. 6, we can see the FCGR image of the sample sequence with K = 3, 4 and 5, respectively.

Fig. 6

FCGR image of H. sapiens nucleosome-inhibiting sample with different K: a K = 3, b K = 4 and c K = 5

Full size imageSupport vector machine

Support vector machine (SVM) is a commonly used two-class classification model. Compared with other classification algorithms, it has a good classification effect and strong generalization ability on small data sets. It can also handle nonlinear classification problems through nuclear techniques. Thus, support vector machines have also been widely used in the field of bioinformatics [19, 21, 23]. Its basic idea is to map the sample from the original low-dimensional space to a high-dimensional space, so that the sample can find a partitioning hyperplane with the largest interval in the feature space, and separate samples of different categories.

In this paper, we will use the python package (Scikit-learn 0.23), which can be downloaded from https://scikit-learn.org/stable/index.html. This package contains the SVM module, and the implementation is based on libsvm. We will train the SVM with the radial basis function (RBF) kernel, meanwhile two parameters will be considered: penalty parameter C and kernel coefficient Gamma. In the training process, we used the grid optimization method to determine the best values of the two parameters.

Extreme learning machine

Extreme learning machine (ELM) was proposed by Guang-Bin Huang. The algorithm is a new machine learning algorithm based on single hidden layer feedforward neural networks (SLFNs). Compared with traditional algorithms, ELM has a faster learning speed while maintaining learning accuracy. The core idea is to randomly select the input layer weight and hidden layer bias of the network, and get the corresponding hidden node output [41]. The network structure of ELM model is shown in Fig. 7.

Fig. 7

Basic architecture of ELM

Full size image

The experiment reference used David Lambert's Python version of ELM resources, which can be downloaded from the ELM web portal (https://www.ntu.edu.sg/home/egbhuang/). The code can be found on https://github.com/dclambert/Python-ELM.

Extreme gradient boosting

Extreme gradient boosting (XGBoost) is an open source machine learning project developed by Tianqi Chen et al. [42]. It is one of the boosting algorithms, which has the characteristics of high efficiency, flexibility, high accuracy, and strong portability. It is applied in the field of biomedicine [43].

The idea of XGBoost algorithm is to continuously add trees and perform feature splitting to complete the construction of a tree. In the whole process, each addition of a tree is learning a new function to fit the residual of the previous prediction. When the training is completed, K trees will be obtained. If we want to predict the score of a sample, according to the features of this sample, each tree will fall to a corresponding leaf node, and each leaf node corresponds to a score. Finally, we only need to add up the scores corresponding to each tree to get the predicted value of the sample.

In this experiment, we used the python package (xgboost 1.2.0), which can be downloaded from https://github.com/dmlc/xgboost.

Multilayer perceptron

Multilayer perceptron (MLP) is also called deep neural networks (DNNs) [44]. MLP is based on the extension of perception. Multiple hidden layers are introduced between the input layer and the output layer, and the neurons between the layers are fully connected. So, both the hidden layer and the output layer in MLP are fully connected layers.

For the MLP, we used the AI Studio (https://aistudio.baidu.com/aistudio/index) experimental platform and PaddlePaddle (https://www.paddlepaddle.org.cn/) deep learning framework provided by Baidu (https://www.baidu.com/) to implement the experimental model with python (https://www.python.org/). MLP has three hidden layers with Relu activation function [45], each layer contains 50 neurons, the output layer uses a softmax activation function. Besides, MLP is trained by 5 epchos, with Adamax optimizer a learning rate of 0.001. Adamax algorithm is a variant of Adam algorithm based on infinite norm, which makes the algorithm of learning rate update more stable and simple [46]. We use cross entropy as our loss function.

Convolutional neural network

Convolutional Neural Network (CNN) is a representative algorithm of deep learning. It has demonstrated extraordinary advantages in the field of computer vision and has also been widely used in bioinformatics [47, 48]. Convolutional neural networks can automatically extract features from input data. Compared with fully connected neural networks, it can simplify model complexity and effectively reduce model parameters [49]. Convolutional neural networks are applied to the general framework of image mode, mainly composed of convolutional layers, activation function, pooling layers and fully connected layers [49, 50].

Owing to the limitation of the sample data volume, during the training process, we need to prevent the over-fitting problem faced by CNN, so we add a batch normalization (BN) layer [51] after the convolutional layer and add a dropout layer [52] after the fully connected layer. In our network, the convolutional layer uses a 3 × 3 convolution kernel, the number of filters in the first layer is 64, and the second is 32. The pooling layer use the maximum pooling of 2 × 2, with stride = 2. The first fully connected layer neurons' number is 100, and the second is 50. Then, the dropout probability of the subsequent dropout layer is 0.5. Except the softmax activation function used in the output layer, the activation function in the other layers is Relu. CNN is training by 20 epchos, with Adamax optimizer a learning rate of 0.001. The loss function is cross entropy. Like MLP, we also used the AI Studio experimental platform and PaddlePaddle deep learning framework provided by Baidu to implement the experimental model in python. The specific network structure is shown in Fig. 8.

Fig. 8

The architecture of our CNN model

Full size image

【本文地址】

公司简介

联系我们