SciELO - Scientific Electronic Library Online

 
vol.29 número3Reactor de película líquida descendente para la sulfonación de ésteres metílicos con trióxido de azufreAnálisis y comparación entre un controlador PI difuso y un controlador PI óptimo convencional para un conversor reductor índice de autoresíndice de materiabúsqueda de artículos
Home Pagelista alfabética de revistas  

Servicios Personalizados

Revista

Articulo

Indicadores

Links relacionados

  • En proceso de indezaciónCitado por Google
  • No hay articulos similaresSimilares en SciELO
  • En proceso de indezaciónSimilares en Google

Compartir


Ingeniería e Investigación

versión impresa ISSN 0120-5609

Resumen

CADAVID RENGIFO, Héctor Fabio  y  GOMEZ PERDOMO, Jonatan. web text corpus extraction system for linguistic tasks. Ing. Investig. [online]. 2009, vol.29, n.3, pp.54-60. ISSN 0120-5609.

Internet content, used as text corpus for natural language learning, offers important characteristics for such task, like its huge volume, being permanently uptodate with linguistic variants and having low time and resource costs regarding the traditional way that text is built for natural language machine learning tasks. This paper describes a system for the automatic extraction of large bodies of text from the Internet as a valuable tool for such learning tasks. A concurrent programmingbased, hardwareuse optimisation strategy significantly improving extraction performance is also presented. The strategies incorporated into the system for maximising hardware resource exploitation, thereby reducing extraction time are presented, as are extendibility (supporting digital-content formats) and adaptability (regarding how the system cleanses content for obtaining pure natural language samples). The experimental results obtained after processing one of the biggest Spanish domains on the internet, are presented (i.e. es.wikipedia.org). Such results are used for presenting initial conclusions about the validity and applicability of corpus directly extracted from Internet as morphological or syntactical learning input.

Palabras clave : web corpus; crawler; unsupervised language learning; concurrent programming.

        · resumen en Español     · texto en Español     · Español ( pdf )

 

Creative Commons License Todo el contenido de esta revista, excepto dónde está identificado, está bajo una Licencia Creative Commons