Comparison between SVM and Logistic Regression: Which One is Better to Discriminate?

SALAZAR, DIEGO ALEJANDRO; VÉLEZ, JORGE IVÁN; SALAZAR, JUAN CARLOS

Services on Demand

Journal

Article

Indicators

Cited by SciELO
Access statistics

Revista Colombiana de Estadística

Print version ISSN 0120-1751

Rev.Colomb.Estad. vol.35 no.spe2 Bogotá June 2012

Comparison between SVM and Logistic Regression: Which One is Better to Discriminate?

Comparación entre SVM y regresión logística: ¿cuál es más recomendable para discriminar?

DIEGO ALEJANDRO SALAZAR¹, JORGE IVÁN VÉLEZ², JUAN CARLOS SALAZAR³

¹Universidad Nacional de Colombia, Escuela de Estadística, Medellín, Colombia. MSc student. Email: diasalazarbl@unal.edu.co
²Universidad Nacional de Colombia, Grupo de Investigación en Estadística, Medellín, Colombia. Researcher. Email: jorgeivanvelez@gmail.com
³Universidad Nacional de Colombia, Escuela de Estadística, Medellín, Colombia. Universidad Nacional de Colombia, Grupo de Investigación en Estadística, Medellín, Colombia. Associate professor. Email: jcsalaza@unal.edu.co

Abstract

The classification of individuals is a common problem in applied statistics. If X is a data set corresponding to a sample from an specific population in which observations belong to g different categories, the goal of classification methods is to determine to which of them a new observation will belong to. When g=2, logistic regression (LR) is one of the most widely used classification methods. More recently, Support Vector Machines (SVM) has become an important alternative. In this paper, the fundamentals of LR and SVM are described, and the question of which one is better to discriminate is addressed using statistical simulation. An application with real data from a microarray experiment is presented as illustration.

Key words: Classification, Genetics, Logistic regression, Simulation, Support vector machines.

Resumen

La clasificación de individuos es un problema muy común en el trabajo estadístico aplicado. Si X es un conjunto de datos de una población en la que sus elementos pertenecen a g clases, el objetivo de los métodos de clasificación es determinar a cuál de ellas pertenecerá una nueva observación. Cuando g=2, uno de los métodos más utilizados es la regresión logística. Recientemente, las Máquinas de Soporte Vectorial se han convertido en una alternativa importante. En este trabajo se exponen los principios básicos de ambos métodos y se da respuesta a la pregunta de cuál es más recomendable para discriminar, vía simulación. Finalmente, se presenta una aplicación con datos provenientes de un experimento con microarreglos.

Palabras clave: clasificación, genética, máquinas de soporte vectorial, regresión logística, simulación.

Texto completo disponible en PDF

References

1. Anderson, T. (1984), An Introduction to Multivariate Statistical Analysis, John Wiley & Sons, New York. [ Links ]

2. Asparoukhova, K. & Krzanowskib, J. (2001), 'A comparison of discriminant procedures for binary variables', Computational Statistics & Data Analysis 38, 139-160. [ Links ]

3. Cornfield, J. (1962), 'Joint dependence of the risk of coronary heart disease on serum cholesterol and systolic blood pressure: a discriminant function analysis', Proceedings of the Federal American Society of Experimental Biology 21, 58-61. [ Links ]

4. Cortes, C. & Vapnik, V. (1995), 'Support-vector networks', Machine Learning 20(3), 273-297. [ Links ]

5. Cover, T. M. (1965), 'Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition', IEEE Transactions on Electronic Computers 14, 326-334. [ Links ]

6. Cox, D. (1966), Some Procedures Associated with the Logistic Qualitative Response Curve, John Wiley & Sons, New York. [ Links ]

7. David, A. & Lerner, B. (2005), 'Support vector machine-based image classification for genetic syndrome diagnosis', Pattern Recognition Letters 26, 1029-1038. [ Links ]

8. Day, N. & Kerridge, D. (1967), 'A general maximum likelihood discriminant', Biometrics 23, 313-323. [ Links ]

9. Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D., , & Weingessel, A. (2011), e1071: Misc Functions of the Department of Statistics (e1071), TU Wien. R package version 1.5-27. *http://CRAN.R-project.org/packagee1071 [ Links ]

10. Fisher, R. (1936), 'The use of multiple measurements in taxonomic problems', Annual Eugenics 7, 179-188. [ Links ]

11. Furey, T. S., Cristianini, N., Duffy, N., Bednarski, D. W., Schummer, M. & Haussler, D. (2000), 'Support vector machine classification and validation of cancer tissue samples using microarray expression data', Bioinformatics 16(10), 906-914. [ Links ]

12. Gentleman, R., Carey, V., Huber, W. & Hahne, F. (2011), Genefilter: Methods for filtering genes from microarray experiments. R package version 1.34.0. [ Links ]

13. Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J., Coller, H., Loh, M., Downing, J., Caligiuri, M., Bloomfield, C. & Lander, E. (1999), 'Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring', Science 286, 531-537. [ Links ]

14. Hernández, F. & Correa, J. C. (2009), 'Comparación entre tres técnicas de clasificación', Revista Colombiana de Estad\'ística 32(2), 247-265. [ Links ]

15. Hosmer, D. & Lemeshow, S. (1989), Applied Logistic Regression, John Wiley & Sons, New York. [ Links ]

16. Karatzoglou, A., Meyer, D. & Hornik, K. (2006), 'Support vector machines in R', Journal of Statistical Software 15(8), 267-73. [ Links ]

17. Lee, J. B., Park, M. & Song, H. S. (2005), 'An extensive comparison of recent classification tools applied to microarray data', Computational Statistics & Data Analysis 48, 869-885. [ Links ]

18. Li, L., Jiang, W., Li, X., Moser, K. L., Guo, Z., Du, L., Wang, Q., Topol, E. J., Wang, Q. & Rao, S. (2005), 'A robust hybrid between genetic algorithm and support vector machine for extracting an optimal feature gene subset', Genomics 85(1), 16-23. [ Links ]

19. Moguerza, J. & Mu\~noz, A. (2006), 'Vector machines with applications', Statistical Science 21(3), 322-336. [ Links ]

20. Mootha, V. K., Lindgren, C. M., Eriksson, K. F., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E., Ridderstrele, M., Laurila, E., Houstis, N., Daly, M. J., Patterson, N., Mesirov, J. P., Golub, T. R., Tamayo, P., Spiegelman, B., Lander, E. S., Hirschhorn, J. N., Altshuler, D. & Groop, L. C. (2003), 'Pgc-1álpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes', Nature Genetics 34(3), 267-73. [ Links ]

21. Noble, W. (2006), 'What is a support vector machine?', Nature Biotechnology 24(12), 1565-1567. [ Links ]

22. Peng, S., Xum, Q., Bruce Ling, X., Peng, X., Du, W. & Chen, L. (2003), 'Molecular classification of cancer types from microarray data using the combination of genetic algorithms and support vector machines', FEBS Letters 555, 358 - 362. [ Links ]

23. R Development Core Team, (2011), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. *http://www.R-project.org/ [ Links ]

24. Salazar, D. (2012), Comparación de Máquinas de Soporte vectorial vs. Regresión Logística: cuál es más recomendable para discriminar?, Tesis de Maestría, Escuela de Estadística, Universidad Nacional de Colombia, Sede Medellín. [ Links ]

25. Shou, T., Hsiao, Y. & Huang, Y. (2009), 'Comparative analysis of logistic regression, support vector machine and artificial neural network for the differential diagnosis of benign and malignant solid breast tumors by the use of three-dimensional power doppler', Korean Journal of Radiology 10, 464-471. [ Links ]

26. Tibshirani, R. & Friedman, J. (2008), The Elements of Statistical Learning, Springer, California. [ Links ]

27. Verplancke, T., Van Looy, S., Benoit, D., Vansteelandt, S., Depuydt, P., De Turck, F. & Decruyenaere, J. (2008), 'Support vector machine versus logistic regression modeling for prediction of hospital mortality in critically ill patients with haematological malignancies', BMC Medical Informatics and Decision Making 8, 56-64. [ Links ]

28. Wang, G. & Huan, G. (2011), 'Application of support vector machine in cancer diagnosis', Med. Oncol. 28(1), 613-618. [ Links ]

29. Westreich, D., Lessler, J. & Jonsson, M. (2010), 'Propensity score estimation: Neural networks, support vector machines, decision trees (CART), and meta-classifiers as alternatives to logistic regression', Journal of Clinical Epidemiology 63, 826-833. [ Links ]

[Recibido en septiembre de 2011. Aceptado en febrero de 2012]

Este artículo se puede citar en LaTeX utilizando la siguiente referencia bibliográfica de BibTeX:

@ARTICLE{RCEv35n2a03,    
   AUTHOR = {Salazar, Diego Alejandro and Vélez, Jorge Iván and Salazar, Juan Carlos},    
   TITLE  = {{Comparison between SVM and Logistic Regression: Which One is Better to Discriminate?}},    
   JOURNAL = {Revista Colombiana de Estadística},    
  YEAR  = {2012},    
  volume = {35},    
  number = {2},    
  pages  = {223-237}    
 }