Supervised Learning for data cleaning in the coherence and completeness dimensions

Amézquita, Juan C; Eslava, Hermes J

doi:10.25100/iyc.v24i2.11361

Serviços Personalizados

Journal

Artigo

Indicadores

Citado por SciELO
Acessos

Links relacionados

Citado por Google
Similares em SciELO
Similares em Google

Permalink

Ingeniería y competitividad

versão impressa ISSN 0123-3033versão On-line ISSN 2027-8284

Resumo

AMEZQUITA, Juan C e ESLAVA, Hermes J. Supervised Learning for data cleaning in the coherence and completeness dimensions. Ing. compet. [online]. 2022, vol.24, n.2, e21011361. Epub 26-Maio-2022. ISSN 0123-3033. https://doi.org/10.25100/iyc.v24i2.11361.

Information has become an asset for companies because most business strategic decisions are made based on data analysis; however, the best results are not always obtained in these analyses due to the low quality of information. It as several evaluation dimensions, making the task complex of achieving an adequate level of quality. One of the main activities before proceeding with any type of analysis is the pre-processing of the data. This activity is one of the most demanding in time; the expected levels of quality are not always obtained, nor are the evaluation dimensions with the most significant impact are covered. This work presents the use of machine learning as a tool to clean data in the dimension of completeness and coherence; its validation is done on a data set provided by a government entity in charge of protecting children’s rights at the national level. It starts from the selection of the information processing tools, the descriptive analysis of the data, the specific identification of the problems in which the machine learning techniques will be applied to improve the quality of the data, experimentation, and evaluation of the different models, and finally the implementation of the best performing model. Among the results of this work, there is an improvement in the completeness dimension, decreasing the null data by 4.9%. In the coherence dimension, 2.6% of the records were identified with contradictions, thus validating machine learning for data cleaning.

Palavras-chave : Quality; Data; Machine learning; Completeness; Coherence..

· resumo em Espanhol · texto em Inglês · Inglês (

pdf )