Machine learning is a branch of artificial intelligence, concerned with the design and development of algorithms that allow computers to evolve behaviours based on empirical data, such as from sensor data or databases. A learner can take advantage of examples (data) to capture characteristics of interest of their unknown underlying probability distribution. Data can be seen as examples that illustrate relations between observed variables. A major focus of machine learning research is to automatically learn to recognize complex patterns and make intelligent decisions based on data; the difficulty lies in the fact that the set of all possible behaviours given all possible inputs is too large to be covered by the set of observed examples (training data). Hence the learner must generalize from the given examples, so as to be able to produce a useful output in new cases. Machine learning requires cross-disciplinary proficiency in several areas, such as probability theory, statistics, pattern recognition, cognitive science, data mining, adaptive control, computational neuroscience and theoretical computer science for analyzing data.
2 Machine Learning Techniques and its Applications
Machine learning techniques have been widely applied to biological sequences, e.g. to predict drug resistance in HIV-1 from sequences of drug target proteins and protein functional classes. As deletions and insertions are frequent in biological sequences, a major limitation of current methods is the inability to handle varying sequence lengths. Machine learning is a promising alternative to existing methods, especially for protein sequences of variable length. Hence many scientists or biological workers try machine learning methods to analyse biological data and often generating useful patterns.
Starting from 1992 or earlier scientists thought about protein folding problem which still an unsolved puzzle. According to Chou (1994 and 2001) and Wu (2002), a protein is characterized as vector of 20-D space, in which its 20 components are defined by the composition of its 20 amino acid and the similarity of two proteins is proportional to the mutual projection of their characterized vectors and hence inversely proportional to the size of their correlation angles.
In this regard it is important to study biochemical characteristics of proteins. Xu et al. studied the geometry of the hydrogen bonds across protein interfaces. The software HBPLUS is designed to analyze hydrogen bonds (McDonald et al., 1993; McDonald and Thornton, 1994). The program determines the positions of missing hydrogen’s in the PDB and checks each donor–acceptor pair to ascertain its fitness to the geometric criteria.
As researches goes on for antiretroviral therapy for HIV-AID, the database of HIV is created which consist of various mutant and strains of HIV. The biologist Julio-Septiembre (2000) studied HIV-2 strains from Los Alamos database; the largest and oldest database of HIV; and its immunogenicity is compared with HIV-1. The antigenicity profile obtained with the Surface Plot program for epitope II shows that this region has a group of exposed amino acids in this middle part with a high degree of immunogenic potential in both HIV-1 and HIV-2 viruses. Andersson (2001) studied comparative immune response of HIV1 and HIV2 especially during the asymptomatic phase of HIV-2 infection by taking population groups because HIV-2 is less immunogenic.
As protein sequence database is developed, Wu (2002) studied functionally annotated protein sequences. The annotation problems are addressed by a classification-driven and rule-based method with evidence attribution, coupled with an integrated knowledge base system. This approach allows sensitive identification, consistent and rich annotation, and systematic detection of annotation errors, as well as distinction of experimentally verified computationally predicted features.
Medical practiseners identify HIV infection by viral- load, CD4 count in blood samples of infected persons. P24 is a surface antigen of HIV its enhancement would be the indication of HIV infection. Schüpbach (2003) studied HIV progression by p24 antigen test and viral load test for antiretroviral treatment, which identify specific expression pattern shared by a group of well-characterized genes classified based on relevant biological functions, treatment for HIV infection is not possible till date. Researchers are trying to find the potential drug target sites for drug designing of HIV infection. Membrane proteins are the ones which comes first in cell-cell contact hence these are also the better sites for drug targets. To study membrane proteins Mitaku et al. (2004) gives SOSUI prediction software for transmembrane helices. It also gives the type of transmembrane protein with 80 % accuracy by using hidden markov models. As the statistical methods developed Toshiyoki 2004 studied membrane proteins and soluble proteins on the basis of principle component analysis. The verification was done by Jacknife test. Yang (2006) studied membrane protein types on the basis of amino acid and peptide composition. Dubey et al. (2009; 2010) have done the similar work by machine learning techniques.
Major research in HIV enzymes have been done to look for perfect drug design. The possibility is mostly shown in HIV-1 protease. It is a retroviral aspartyl protease (retropepsin) that is essential for the life-cycle of HIV. HIV protease cleaves newly synthesized polyproteins at the appropriate places to create the mature protein components of an infectious HIV virion. Thus, mutation of HIV protease's active site or inhibition of its activity disrupts HIV’s ability to replicate and infect additional cells is studied by Seelmeier et al. 1988), and making HIV protease inhibition the subject of considerable pharmaceutical research is shown by McPhee (1996). Lumini and Nanni (2009) jointly proposed hierarchical classifiers architecture which is a successful attempt to obtain a drastically error reduction with respect to the performance of linear classifiers. This hierarchy is useful for HIV-1 protease cleavage site prediction with greater accuracy.
This cleavage site prediction is important because proteinases play critical roles in both intra and extracellular processes by binding and cleaving their protein substrates. This would help to find protease cleavage sites for identifying potential drug target sites in HIV-1 protease. Various bioinformatics techniques like molecular docking along with machine learning techniques are used for classification. The team of Glick (2010) presented an investigation of the application of machine learning to improve the results of high throughput docking against the HIV-1 protease by Naive Bayes classifier also shows good results.
MicroRNAs are very small pieces of RNA, which have a strong position in the cell. They can bind to RNA those codes for a protein, to repress this protein. MicroRNAs are involved in a variety of disease processes, playing a role in different types of cancer, for instance. Breast cancer is often caused by an error in the production of RNA. This regulatory role makes microRNA very interesting source as a drug target. Millar (2006) explored some of the similarities and differences between the miRNA’s systems of plants and animals and examine whether they are fundamentally different or simply variations of a theme. This gives insight to study miRNA and its biological importance. Similar work was also done by Pant et al. (2009) support vector machine for the classification of plant and animal miRNA’s. Looking into the importance of miRNA, RNAi (RNA interference) come into existence. Aagaard and Rossi (2007) studied about RNAi importance with respect to its therapeutics and shows it would be next biological source for treating diseases. Söllner and Mayer (2006) studied machine learning approaches for prediction of linear B-cell epitopes on proteins. The approach combines several parameters previously associated with antigenicity, and includes novel parameters based on frequencies of amino acids and amino acid neighborhood propensities. Machine learning classifiers clearly outperform the reference classification systems on the HIV epitope validation set.
Hallett and his co-workers (2006) studied the prediction of subcellular localization of viral proteins within a mammalian host cell. PSLT predictor which considers the combinatorial presence of domains and targeting signals in human proteins to predict localization. This localization of proteins greatly helps to identify signature proteins for HIV drug target sites. Song and Shi (2010) jointly using K-Nearest Neighbor Classifier, and test on a known dataset which includes 317 apoptosis proteins, the total prediction accuracy of the method are 88.3%. These results indicate that the composition of dipeptide categories combined with K-Nearest Neighbor Classifier is very useful for predicting subcellular location of apoptosis proteins. Harrison and Langdale (2006) studied both amino acid and nucleotide data to generate a phylogeny by distance based methods and likelihood methods and the results were further analyzed by Bayesian algorithm. Thus, using the DNA data to generate the alignments is very likely to lead to alignments that sometimes do not reflect the actual mutational history. The protein sequence is under selective constraint for protein function and protein structure, and these are conserved over much longer periods than the individual codon choices, hence amino acid sequences are important to study phylogeny.
Prosperi (2009) studied different machine learning and feature selection methods for the classification of HIV treatment, the success based on viral genotype, therapy, and derived input features. HIV positive persons have low CD4 count and somehow retinal damage and visual field defects which were proposed by Kozak and Sample (2007) by Support vector machine and relevance vector machine (RVM), which were sufficiently sensitive to distinguish these eyes from normal eyes. Nanni and his team (2009) proposed Protein classification combining surface analysis and primary structure of proteins. Emily et al. (2007) proposed a hybrid prediction method for Gram-negative bacteria that combines a one-versus-one support vector machines (SVM) model and a structural homology approach. The SVM model comprises a number of binary classifiers, in which biological features derived from Gram-negative bacteria translocation pathways are incorporated and structural homology shows the common amino acids of these bacteria.
G-protein coupled receptors (GPCRs) the seven-transmembrane domain comprise the largest family of proteins targeted by drug discovery. Together with structures of the prototypical GPCR rhodopsin, solved structures of other liganded GPCRs promise to provide insights into the structural basis of the super family’s biochemical functions and assist in the development of new therapeutic modalities and drugs. Neberg (2007) proposed evolutionary analysis of GPCR by DNA extraction methods. Evolutionary data from both sequenced genomes and targeted retrieved orthologs are increasingly used as a source of structural information. Recent success in sequencing and functionally expressing GPCRs from fossils opens the possibility of studying signaling pathways even in extinct species.
Steffen et al. (2008) predicted the outcome of a therapy attempt for a patient who carries an HIV with a set of observed genetic properties; such predictions need to be made for hundreds of possible combinations of drugs, which use similar biochemical mechanisms. In this paired t-test, distribution matching is significantly better than reference methods. As significance of machine learning techniques like support vector machine increases researchers use kernel methods along with SVM. Sebastien et al. (2008) investigated how the SVM can predict HIV-1 coreceptor usage when it is equipped with an appropriate string kernel. The data mining learning models and algorithms are also helpful in disease diagnosis, treatment and targeting potential drugs. Andreeva (2008) worked in this area. This is really a breakthrough idea for biomedical scientists for getting the machine learning models and algorithms that can be used for medical applications.
Prosperi et al. (2009) studied the associations of the whole HIV-1 envelope genetic features and clinical markers with viral tropism. Bootstrapped hierarchical clustering was used to assess mutational co variation. Different machine learning method i.e. logistic regression, SVM, decision trees, rule based reasoning and feature selection method along with loss functions (accuracy, ROC curves, and f-measure) were applied and compared for the classification of X4 variants. The logistic regression model was developed with 92.7% accuracy. Rao (2009) studied machine learning approaches including SVM, K-Nearest neighbour (K-NN), artificial neural network (ANN) and logistic regression (LR), are applied for classification of HIV-1 protease inhibitors from molecular structure. SVM proves better generalization ability and can be used as an alternative fast filters in the virtual screening of large chemical databases.
Ozyilmaz (2009) studied the features of HIV 1 genome by the statistical data of R5X4, R5 and X4 viruses which was analyzed by using signal processing methods and ANNs. The results indicate that R5X4 viruses successfully classified with high sensitivity and specificity values training and testing ROC analysis for RBF, which gives the best performance among ANN structures. Blair et al. (2009) demonstrated a synergistic combination of NMR spectroscopy, denovo structure prediction, and X-ray crystallography in an effective overall strategy for rapidly determining the structure of the coat protein C-terminal domain from the Sulfolobus islandicus rod-shaped virus (SIRV). This approach takes advantage of the most accessible aspects of each structural technique and may be widely applicable for structure determination.
Singh and Mars (2010) proposed support vector machine classification model to predict the degree of CD4 count change in HIV-1 positive patients with parameters genotype, viral load and time. The model produced the accuracy of 83%. Again in 2011 they showed mathematically, forecast a change in CD4 count using machine learning without genome data. That neural network predicts virological response in HIV positive patients with 73% accuracy. These analyses clearly show that SVM is relatively good for analyzing biological data due to its high dimensionality.
Maurizio et al. (2012) proposed machine learning approaches to establish data driven engines able to indicate the most effective treatments for any patient and virus combination. As the biological data is huge and difficult to manage there is a need to mine the data for further analysis. To overcome this problem data mining/machine learning algorithms have been developed.
Now in present era scientists working on HIV-AIDS are trying to develop vaccine for eradication of HIV. And machine learning techniques seem to fulfill such promise. Machine Learning used to create an HIV vaccine by cocktail use of epitopes. The dataset was so enormous that a novel approach was adopted. The task is at hand to look at the genotype of a controller and compare it with epitope: a short chain of proteins, in the virus they carry. The machine learning is able to manage whittle down all the data to a list of the first six epitopes that have the desired dormant mutation property. The vaccine consists of cocktail of such epitopes. However it requires tricky epitopes for successful formulation of vaccines. If the vaccine passes clinical trials, it could be reached mass till 2017 said Mike Szczys.
Machine learning methods are also used to identify and model associations between antibody features (IgG subclass and antigen specificity) and effectors function activity. These antibody features qualitatively and quantitatively useful in classification and regression, provides a new objective approach to discovering and assessing immune correlations (Choi et al., 2015).
Above examples of machine learning methods for HIV clearly shows that present era needs computationally efficient methods to accumulate, change and update intelligent systems. As machine learning methods are fast and economical they can very well help wet lab techniques. Machine learning provides methods, techniques and tools, which can help solving diagnostic and prognostic problems in a variety of medical domains. It also helps medical practitioners in treatment of diseases as in HIV-AIDS. And therefore ethical issues are also presented. It is humans not systems who can act as moral agents and make system efficient that not morality will be maintained. Molecular biology is now seen as encouraging more “personalized medicine” the closer alignment of biological information and therapy selection. The evolution of molecular medicine coupled with the discovery and clinical applications will play a significant role in reshaping medicine or treatment of HIV-AIDS.
In the conclusion, there are several areas of machine learning would be discussed that seem to be of particular challenge and importance in future research. With the growing sophistication of learning programs, there is an interest in increasing the transfer of machine learning programs from university laboratories to the real world, where they can be applied to problems of practical significance. It is a challenge to researchers to test their research in the context of real life problems i.e. HIV and other deadly diseases and may lead to economical treatment available worldwide. It would be better if it is said that machine learning is a cutting edge technology which achieves heights by its betterment in research and medicine.
Aagaard L., and Rossi J.J., 2007, RNAi therapeutics: principles, prospects and challenges, Advanced Drug Delivery Reviews, 59(2-3): 75-86
Andersson S., 2001, Hiv-2 and the immune response, Aids Reviews, 3(1): 112-120
Andreeva P., Dimitrova M., and Radeva P., 2008, Data Mining Learning Models and Algorithms for Medical Applications, 1-5
Boisvert S., Marchand M., Laviolette F., and Corbeil J., 2008, HIV-1 coreceptor usage prediction without multiple alignments: an application of string kernels, Retrovirology, 5(1): 115-123
Choi I., Chung A.W., Suscovich T.J., Rerks-Ngarm S., Pitisuttithum P., Nitayaphan S., Kaewkungwal J., O'Connell R.J., Francis D., Robb M.L., Michael N.L., Kim J.H., Alter G., Ackerman M.E., and Bailey-Kellogg C., 2015, Machine learning methods enable predictive modeling of antibody feature:function relationships in rv144 vaccinees. Plos Computational Biology, 11(4): e1004185
Chou K.C., and Zhang C.T., 1994, Predicting protein folding types by distance functions that make allowances for amino acid interactions, Journal of Biological Chemistry, 269(35): 22014-20
Chou K.C., 2001, Prediction of protein cellular attributes using pseudo-amino acid composition, Proteins Structure Function & Bioinformatics, 43(3): 246-55
Dubey A., Pant B., and Adlakha N., 2010, SVM model for amino acid composition based classification of HIV-1 groups, International Conference on Bioinformatics and Biomedical Technology, pp.120-123
Dubey A., Pant B., and Chouhan U., 2011, Machine learning model for HIV1 and HIV2 enzyme secondary structure classification, J. Comput. Method. Mol. Design, 1 (2): 1-8
Harrison C.J., and Langdale J.A, 2006, A step by step guide to phylogeny reconstruction, Plant Journal for Cell & Molecular Biology, 45(4): 561–572
Klon A.E., Glick M., and Davies J.W., 2010, Applications of machine learning to improve the results of high-throughput docking against the HIV-1 protease, J Chem Inf Comput Sci, 4(6): 2216-2224
Kozak I., Sample P.A., Hao J., Freeman W.R., Weinreb R.N., Lee T.W., and Goldbaum M.H., 2007, Machine Learning Classifiers Detect Subtle Field Defects in Eyes of HIV Individuals, Trans Am Ophthalmol Soc., 105: 111-120
McPhee F., Good A.C., Kuntz I.D., and Craik C.S., 1996, Engineering human immunodeficiency virus 1 protease heterodimers as macromolecular inhibitors of viral maturation, Proc. Natl. Acad. Sci. U.S.A, 93(21): 11477-11481
Millar A.A., and Waterhouse P.M., 2006, Plant and animal MicroRNAs: similarities and differences, Functional & Integrative Genomics, 5(3): 129-135
Nanni L., Mazzara S., Pattini L, and Lumini A., 2009, Protein classification combining surface analysis and primary structure, Protein Engineering Apr, 22(4): 267-272
Pant K., Pant B., and Pardasani K.R., 2007, Support Vector Machine for Classification of Plant and animal miRNA. International Conference on Advances in Computing, Control, and Telecommunication Technologies, (2009), In: Aagaard L. and Rossi J.J., RNAi Therapeutics: Principles, Prospects and Challenges, Elsevier Science
Prosperi M.C., 2009, Robust supervised and unsupervised statistical learning for HIV type 1 coreceptor usage analysis, AIDS Res Human retroviruses, 25(3): 305-314
Rao H., Yang G., Tan N., Li P., Li Z., and Li X., 2009, Prediction of HIV-1 Protease Inhibitors Using Machine Learning Approaches, QSAR & Combinatorial Science, 28(11-12): 1346–1357
Schoenberg T., Hofreiter M., Schultz A., and Rompler H., 2009, Learning from the past: evolution of GPCR functions, Trends in Pharmacological Sciences, 28(3): 59-64
Schüpbach J., 2003, Viral RNA and p24 antigen as markers of HIV disease and antiretroviral treatment success, International Archives of Allergy & Immunology, 132(3): 196-209
Scott M.S., Oomen R., Thomas D.Y., and Hallett M.T., 2006, Predicting the subcellular localization of viral proteins within a mammalian host cell, Virology Journal, 3(1): 1-8
Seelmeier S., Schmidt H., Turk V., and von der Helm K., 1988, Human immunodeficiency virus has an aspartic-type protease that can be inhibited by pepstatin A, Proc. Natl. Acad. Sci. U.S.A., 85(18): 6612–6616
Singh Y., and Mars M., eds, 2011, The use of Neural networks to predict virological response in HIV positive patients, In: The international eHealth, Telemedicine, and Health ICT forum of education, networking and business, Luxamburg, 1-25
Söllner J., and Mayer B., 2006, Machine learning approaches for prediction of linear B-cell epitopes on proteins, J Mol Recognition, 19(3): 200-208
Song C., and Shi F., 2010, Prediction of subcellular localization of apoptosis proteins by dipeptide composition, International Journal of Digital content Technology and its Applications, 4(1): 32-36
Su C.Y., Chiu H.S., Lo A., Hwang J.K., Sung T.Y., and Hsu W.L., 2007, Protein subcellular localization prediction based on compartment-specific features and structure conservation, BMC Bioinformatics, 8(1): 1-12
Szymczyna B.R., Taurog R.E., Young M.J., Snyder J.C., Johnson J.E., and Williamson J.R., 2009, Synergy of nmr, computation, and x-ray crystallography for structural biology. Structure, 17(4): 499-507
Tsuji T., and Mitaku S., 2004, Features of transmembrane helices useful for membrane protein prediction, Chem-bioinformatics Journal, 4(3): 110-120
Valadez-González N., Gevorkian G., and Soler C., 2000, Transmembrane glycoprotein cross reactive hiv-1/hiv-2 epitope, Rev. bioméd, 11(3): 155-160
White S.H., and Von H.G., 2008, How translocons select transmembrane helices, Biophysics, 37(37): 23-42
Wu C.H., Huang H., Yeh L.S., and Barker W.C., 2003, Protein family classification and functional annotation, Computational Biology & Chemistry, 27(1): 37-47
Xu D., Tsai C.J., and Nussinov R., 1997, Hydrogen bonds and salt bridges across protein-protein interfaces, Protein Engineering, Design and Selection, 10(9): 999-1012
Yang X.G., Luo R.Y., and Feng Z.P., 2006, Using amino acid and peptide composition to predict membrane protein types, Biochemical Biophysics Research Communication, 353(1): 164-169
Yavuz O., and Ozyilmaz L., 2009, Analysis and Classification of HIV-1 Sub-Type Viruses by AR Model through Artificial Neural Networks, World Academy of Science, Engineering and Technology 49: 826-831
Zazzi M., Incardona F., Rosen-Zvi M., Prosperi M., Lengauer T., Altmann A., Sonnerborg A., Lavee T., Schülter E., and Kaiser R., 2012, Predicting response to antiretroviral treatment by machine learning: the euresist project, Intervirology, 55(2): 123-127