web text corpus extraction system for linguistic tasks

Cadavid Rengifo, Héctor Fabio; Gómez Perdomo, Jonatan

Servicios Personalizados

Revista

Articulo

Indicadores

Citado por SciELO
Accesos

Links relacionados

Citado por Google
Similares en SciELO
Similares en Google

Permalink

Ingeniería e Investigación

versión impresa ISSN 0120-5609

Resumen

CADAVID RENGIFO, Héctor Fabio y GOMEZ PERDOMO, Jonatan. web text corpus extraction system for linguistic tasks. Ing. Investig. [online]. 2009, vol.29, n.3, pp.54-60. ISSN 0120-5609.

Internet content, used as text corpus for natural language learning, offers important characteristics for such task, like its huge volume, being permanently uptodate with linguistic variants and having low time and resource costs regarding the traditional way that text is built for natural language machine learning tasks. This paper describes a system for the automatic extraction of large bodies of text from the Internet as a valuable tool for such learning tasks. A concurrent programmingbased, hardwareuse optimisation strategy significantly improving extraction performance is also presented. The strategies incorporated into the system for maximising hardware resource exploitation, thereby reducing extraction time are presented, as are extendibility (supporting digital-content formats) and adaptability (regarding how the system cleanses content for obtaining pure natural language samples). The experimental results obtained after processing one of the biggest Spanish domains on the internet, are presented (i.e. es.wikipedia.org). Such results are used for presenting initial conclusions about the validity and applicability of corpus directly extracted from Internet as morphological or syntactical learning input.

Palabras clave : web corpus; crawler; unsupervised language learning; concurrent programming.

· resumen en Español · texto en Español · Español (

pdf )