SciELO - Scientific Electronic Library Online

 
vol.24 issue2Scoping coupled to the Conesa methodology for the environmental assessment of an advanced system of landfill leachate decontaminationStudy of energy consumption in Haas UMC-750 and Leadwell V-40iT® CNC machining centers author indexsubject indexarticles search
Home Pagealphabetic serial listing  

Services on Demand

Journal

Article

Indicators

Related links

  • On index processCited by Google
  • Have no similar articlesSimilars in SciELO
  • On index processSimilars in Google

Share


Ingeniería y competitividad

Print version ISSN 0123-3033On-line version ISSN 2027-8284

Abstract

AMEZQUITA, Juan C  and  ESLAVA, Hermes J. Supervised Learning for data cleaning in the coherence and completeness dimensions. Ing. compet. [online]. 2022, vol.24, n.2, e21011361.  Epub May 26, 2022. ISSN 0123-3033.  https://doi.org/10.25100/iyc.v24i2.11361.

Information has become an asset for companies because most business strategic decisions are made based on data analysis; however, the best results are not always obtained in these analyses due to the low quality of information. It as several evaluation dimensions, making the task complex of achieving an adequate level of quality. One of the main activities before proceeding with any type of analysis is the pre-processing of the data. This activity is one of the most demanding in time; the expected levels of quality are not always obtained, nor are the evaluation dimensions with the most significant impact are covered. This work presents the use of machine learning as a tool to clean data in the dimension of completeness and coherence; its validation is done on a data set provided by a government entity in charge of protecting children’s rights at the national level. It starts from the selection of the information processing tools, the descriptive analysis of the data, the specific identification of the problems in which the machine learning techniques will be applied to improve the quality of the data, experimentation, and evaluation of the different models, and finally the implementation of the best performing model. Among the results of this work, there is an improvement in the completeness dimension, decreasing the null data by 4.9%. In the coherence dimension, 2.6% of the records were identified with contradictions, thus validating machine learning for data cleaning.

Keywords : Quality; Data; Machine learning; Completeness; Coherence..

        · abstract in Spanish     · text in English     · English ( pdf )