Supervised Learning for data cleaning in the coherence and completeness dimensions

Amézquita, Juan C; Eslava, Hermes J

doi:10.25100/iyc.v24i2.11361

Services on Demand

Journal

Article

Indicators

Cited by SciELO
Access statistics

Ingeniería y competitividad

Print version ISSN 0123-3033On-line version ISSN 2027-8284

Abstract

AMEZQUITA, Juan C and ESLAVA, Hermes J. Supervised Learning for data cleaning in the coherence and completeness dimensions. Ing. compet. [online]. 2022, vol.24, n.2, e21011361. Epub May 26, 2022. ISSN 0123-3033. https://doi.org/10.25100/iyc.v24i2.11361.

Information has become an asset for companies because most business strategic decisions are made based on data analysis; however, the best results are not always obtained in these analyses due to the low quality of information. It as several evaluation dimensions, making the task complex of achieving an adequate level of quality. One of the main activities before proceeding with any type of analysis is the pre-processing of the data. This activity is one of the most demanding in time; the expected levels of quality are not always obtained, nor are the evaluation dimensions with the most significant impact are covered. This work presents the use of machine learning as a tool to clean data in the dimension of completeness and coherence; its validation is done on a data set provided by a government entity in charge of protecting children’s rights at the national level. It starts from the selection of the information processing tools, the descriptive analysis of the data, the specific identification of the problems in which the machine learning techniques will be applied to improve the quality of the data, experimentation, and evaluation of the different models, and finally the implementation of the best performing model. Among the results of this work, there is an improvement in the completeness dimension, decreasing the null data by 4.9%. In the coherence dimension, 2.6% of the records were identified with contradictions, thus validating machine learning for data cleaning.

Keywords : Quality; Data; Machine learning; Completeness; Coherence..

· abstract in Spanish · text in English · English (

pdf )