Services on Demand
Journal
Article
Indicators
- Cited by SciELO
- Access statistics
Related links
- Cited by Google
- Similars in SciELO
- Similars in Google
Share
Revista Colombiana de Estadística
Print version ISSN 0120-1751
Rev.Colomb.Estad. vol.35 no.spe2 Bogotá June 2012
1Universidad Nacional de Colombia, Escuela de Estadística, Medellín, Colombia. MSc student. Email: diasalazarbl@unal.edu.co
2Universidad Nacional de Colombia, Grupo de Investigación en Estadística, Medellín, Colombia. Researcher. Email: jorgeivanvelez@gmail.com
3Universidad Nacional de Colombia, Escuela de Estadística, Medellín, Colombia. Universidad Nacional de Colombia, Grupo de Investigación en Estadística, Medellín, Colombia. Associate professor. Email: jcsalaza@unal.edu.co
The classification of individuals is a common problem in applied statistics. If X is a data set corresponding to a sample from an specific population in which observations belong to g different categories, the goal of classification methods is to determine to which of them a new observation will belong to. When g=2, logistic regression (LR) is one of the most widely used classification methods. More recently, Support Vector Machines (SVM) has become an important alternative. In this paper, the fundamentals of LR and SVM are described, and the question of which one is better to discriminate is addressed using statistical simulation. An application with real data from a microarray experiment is presented as illustration.
Key words: Classification, Genetics, Logistic regression, Simulation, Support vector machines.
La clasificación de individuos es un problema muy común en el trabajo estadístico aplicado. Si X es un conjunto de datos de una población en la que sus elementos pertenecen a g clases, el objetivo de los métodos de clasificación es determinar a cuál de ellas pertenecerá una nueva observación. Cuando g=2, uno de los métodos más utilizados es la regresión logística. Recientemente, las Máquinas de Soporte Vectorial se han convertido en una alternativa importante. En este trabajo se exponen los principios básicos de ambos métodos y se da respuesta a la pregunta de cuál es más recomendable para discriminar, vía simulación. Finalmente, se presenta una aplicación con datos provenientes de un experimento con microarreglos.
Palabras clave: clasificación, genética, máquinas de soporte vectorial, regresión logística, simulación.
Texto completo disponible en PDF
References
1. Anderson, T. (1984), An Introduction to Multivariate Statistical Analysis, John Wiley & Sons, New York. [ Links ]
2. Asparoukhova, K. & Krzanowskib, J. (2001), 'A comparison of discriminant procedures for binary variables', Computational Statistics & Data Analysis 38, 139-160. [ Links ]
3. Cornfield, J. (1962), 'Joint dependence of the risk of coronary heart disease on serum cholesterol and systolic blood pressure: a discriminant function analysis', Proceedings of the Federal American Society of Experimental Biology 21, 58-61. [ Links ]
4. Cortes, C. & Vapnik, V. (1995), 'Support-vector networks', Machine Learning 20(3), 273-297. [ Links ]
5. Cover, T. M. (1965), 'Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition', IEEE Transactions on Electronic Computers 14, 326-334. [ Links ]
6. Cox, D. (1966), Some Procedures Associated with the Logistic Qualitative Response Curve, John Wiley & Sons, New York. [ Links ]
7. David, A. & Lerner, B. (2005), 'Support vector machine-based image classification for genetic syndrome diagnosis', Pattern Recognition Letters 26, 1029-1038. [ Links ]
8. Day, N. & Kerridge, D. (1967), 'A general maximum likelihood discriminant', Biometrics 23, 313-323. [ Links ]
9. Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D., , & Weingessel, A. (2011), e1071: Misc Functions of the Department of Statistics (e1071), TU Wien. R package version 1.5-27. *http://CRAN.R-project.org/packagee1071 [ Links ]
10. Fisher, R. (1936), 'The use of multiple measurements in taxonomic problems', Annual Eugenics 7, 179-188. [ Links ]
11. Furey, T. S., Cristianini, N., Duffy, N., Bednarski, D. W., Schummer, M. & Haussler, D. (2000), 'Support vector machine classification and validation of cancer tissue samples using microarray expression data', Bioinformatics 16(10), 906-914. [ Links ]
12. Gentleman, R., Carey, V., Huber, W. & Hahne, F. (2011), Genefilter: Methods for filtering genes from microarray experiments. R package version 1.34.0. [ Links ]
13. Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J., Coller, H., Loh, M., Downing, J., Caligiuri, M., Bloomfield, C. & Lander, E. (1999), 'Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring', Science 286, 531-537. [ Links ]
14. Hernández, F. & Correa, J. C. (2009), 'Comparación entre tres técnicas de clasificación', Revista Colombiana de Estad\'ística 32(2), 247-265. [ Links ]
15. Hosmer, D. & Lemeshow, S. (1989), Applied Logistic Regression, John Wiley & Sons, New York. [ Links ]
16. Karatzoglou, A., Meyer, D. & Hornik, K. (2006), 'Support vector machines in R', Journal of Statistical Software 15(8), 267-73. [ Links ]
17. Lee, J. B., Park, M. & Song, H. S. (2005), 'An extensive comparison of recent classification tools applied to microarray data', Computational Statistics & Data Analysis 48, 869-885. [ Links ]
18. Li, L., Jiang, W., Li, X., Moser, K. L., Guo, Z., Du, L., Wang, Q., Topol, E. J., Wang, Q. & Rao, S. (2005), 'A robust hybrid between genetic algorithm and support vector machine for extracting an optimal feature gene subset', Genomics 85(1), 16-23. [ Links ]
19. Moguerza, J. & Mu\~noz, A. (2006), 'Vector machines with applications', Statistical Science 21(3), 322-336. [ Links ]
20. Mootha, V. K., Lindgren, C. M., Eriksson, K. F., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E., Ridderstrele, M., Laurila, E., Houstis, N., Daly, M. J., Patterson, N., Mesirov, J. P., Golub, T. R., Tamayo, P., Spiegelman, B., Lander, E. S., Hirschhorn, J. N., Altshuler, D. & Groop, L. C. (2003), 'Pgc-1álpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes', Nature Genetics 34(3), 267-73. [ Links ]
21. Noble, W. (2006), 'What is a support vector machine?', Nature Biotechnology 24(12), 1565-1567. [ Links ]
22. Peng, S., Xum, Q., Bruce Ling, X., Peng, X., Du, W. & Chen, L. (2003), 'Molecular classification of cancer types from microarray data using the combination of genetic algorithms and support vector machines', FEBS Letters 555, 358 - 362. [ Links ]
23. R Development Core Team, (2011), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. *http://www.R-project.org/ [ Links ]
24. Salazar, D. (2012), Comparación de Máquinas de Soporte vectorial vs. Regresión Logística: cuál es más recomendable para discriminar?, Tesis de Maestría, Escuela de Estadística, Universidad Nacional de Colombia, Sede Medellín. [ Links ]
25. Shou, T., Hsiao, Y. & Huang, Y. (2009), 'Comparative analysis of logistic regression, support vector machine and artificial neural network for the differential diagnosis of benign and malignant solid breast tumors by the use of three-dimensional power doppler', Korean Journal of Radiology 10, 464-471. [ Links ]
26. Tibshirani, R. & Friedman, J. (2008), The Elements of Statistical Learning, Springer, California. [ Links ]
27. Verplancke, T., Van Looy, S., Benoit, D., Vansteelandt, S., Depuydt, P., De Turck, F. & Decruyenaere, J. (2008), 'Support vector machine versus logistic regression modeling for prediction of hospital mortality in critically ill patients with haematological malignancies', BMC Medical Informatics and Decision Making 8, 56-64. [ Links ]
28. Wang, G. & Huan, G. (2011), 'Application of support vector machine in cancer diagnosis', Med. Oncol. 28(1), 613-618. [ Links ]
29. Westreich, D., Lessler, J. & Jonsson, M. (2010), 'Propensity score estimation: Neural networks, support vector machines, decision trees (CART), and meta-classifiers as alternatives to logistic regression', Journal of Clinical Epidemiology 63, 826-833. [ Links ]
Este artículo se puede citar en LaTeX utilizando la siguiente referencia bibliográfica de BibTeX:
@ARTICLE{RCEv35n2a03,
AUTHOR = {Salazar, Diego Alejandro and Vélez, Jorge Iván and Salazar, Juan Carlos},
TITLE = {{Comparison between SVM and Logistic Regression: Which One is Better to Discriminate?}},
JOURNAL = {Revista Colombiana de Estadística},
YEAR = {2012},
volume = {35},
number = {2},
pages = {223-237}
}