I. Introduction
Metaheuristic algorithms are being applied every day in a variety of areas of knowledge. It is not unusual, therefore, to use them in the problem of Part-of-speech Tagging (POST) or Identification. This is a complex task of great importance in Natural Language, given the challenges it faces, such as: the ambiguity of words, the size of the tag set, and the tagging of unknown words [1, 2].
Metaheuristic algorithms in the tagging problem (POST) have been used to assign the best sequence of tags (roles) for the words of a sentence, based on both statistical information and rules of transformation to solve this problem, obtaining outstanding results in contrast to traditional approaches. Related work includes: 1) Alhasan and Al-taani [3], who represented the tagging problem as a graph, the nodes are the possible tags of a sentence and use the optimization algorithm by Bee Colony Optimization (BCO), which finds the best solution path. 2) Sierra Martínez, Cobos and Corrales [4] proposed a memetic algorithm for tagging based on Global-Best Harmony Search (GBHS) that includes knowledge of the problem through a local optimization strategy based on the Hill Climbing algorithm. 3) Forsati & Shamsfard [5] presented two improvements to the HSTAgger tagger based on the Harmony Search metaheuristic, called HSTAgger (I) and HSTAgger (II), which increase search efficiency and improve the selection of new solutions for harmony memory. 4) Ekbal and Saha [6] addressed the tagging problem using single-objective and multiobjective optimization based on the Simulated Annealing-Based Multiobjective Optimization Algorithm proposed in [7], exploiting the search capacity of the simulated annealing algorithm.
Said metaheuristic approaches have been applied to corpus tagged in English, the Brown Corpus [8], the Penn Treebank Corpus [9], and other non-traditional languages such as Arabic with the KALIMAT corpus [10], Bengali (Bangladesh) [11], Hindi (India) [12], Telugu (India) [13], and Nasa Yuwe (an indigenous language of Colombia) [14]. Generally, these proposals use the Petrov tag set [15].
Metaheuristic algorithms solve problems using a search process (exploration and exploitation) of optimal solutions for a particular problem [16]. Thus, memetic algorithms [17] use population-based search to explore solutions, and local search based on neighborhood for the exploitation of promising solutions [18, 19]. They also add knowledge of the problem to solve it. Table 1 describes the metaheuristics studied in this research for their subsequent adaptation to the tagging problem.
Figure 1 shows the representation of the solution used for this investigation, which consists of: 1) a first vector of the size of the number of words in a sentence (one position per word), which contains the tags assigned to each word, from position 0 to position n-1 (T0, T1,…,Tn−1); 2) a second vector containing the cumulative probability of each tagged word, and its relationship with its predecessor and successor, and 3) a field that stores the value of the fitness function, calculated as shown in Figure 1, adapted from [5]. In GBHS Tagger, the selected context for the word to be tagged is a trigram (predecessor, word to tag, successor).
In the present work, the adaptation of several metaheuristic algorithms to the tagging problem was carried out, using the representation of the solution proposed in [4], in order to propose improvements to the memetic presented in the same work, at the same time that it was sought to evaluate its performance on the corpus in Castilian IULA [24], English Brown [8] and Nasa Yuwe [14].
The rest of the article is organized as follows: Section 2 presents the methodology used; Section 3 details the adaptation of the selected metaheuristics to the tagging problem; Section 4 shows the results of the experiments carried out, and, finally, Section 5 presents the discussion, conclusions and future work.
II. Methodology
This section describes the dataset used for the evaluation of the algorithms, the activities carried out in each phase of the cycles of the Iterative Research Pattern (IRP) methodology [25], used for carrying out this work, and how the experiments were set up.
A. Used Method
Two cycles were used for this research. The first cycle focused on the adaptation of the metaheuristic algorithms to the tagging problem and the selection of the best one; the second cycle focused on the adaptation of the selected metaheuristic algorithms to the tagging problem and the proposal of a new version of the memetic algorithm. Table 2 describes the activities carried out in each phase.
B. Dataset and Experimental Setup
As part of this work, the IULA (Spanish), Brown (English) and Nasa Yuwe tagged corpus were integrated into a single database designed and developed in SQL Server. The experiments were carried out on this database and, for their execution (both preliminary and complete), a client-server model was used, in which the clients (machines) request the tasks to be carried out. Each task receives the phrase and the algorithm that it must run and evaluate. Likewise, each task is executed 30 times (repetitions of the experiment) on the local machine. Once the task is finished, the results are recorded in the cloud database.
III. Results
In the first instance of this section, the adaptation of the algorithms to the tagging problem and a new version of the memetic GBHS Tagger (GBHS4Tagger) are presented. In the second instance, the experiments and the results obtained with the proposed taggers are shown. It is highlighted that all the adapted algorithms use the representation of the solution presented in [4], described in Figure 1.
A. Proposed JayaTagger
A discrete version of Jaya, called DJaya and proposed by [27], was used, it is free of parameters. The adaptation consisted in moving towards the best-known solution and moving away from the worst solution. Handling of the worst solution parameter was varied. The JayaTagger algorithm only handles three parameters: P𝑜𝑝𝑢𝑙𝑎t𝑖𝑜𝑛𝑆𝑖𝑧𝑒, Maxgenerations and P4. The latter controls the new solution from selecting a tag of the worst solution Xw, making the algorithm simple to implement and evaluate. In Figure 2, the proposed JayaTagger pseudocode is presented.
B. Proposed PSOTagger
The adaptation proposed is done according to the following parameters (a discrete version of PSO [26] was used): 𝑊, that selects a random tag for each dimension of a particle; C1, that selects the tag of the best particle history for that word; C2, that selects the tag of the best global of the swarm for each dimension of the particle, and P, that maintains the components of the current particle. Additionally, PSOTagger involves the parameters P𝑜𝑝𝑢𝑙𝑎t𝑖𝑜𝑛𝑆𝑖𝑧𝑒 and 𝑀𝑎𝑥𝐺𝑒𝑛𝑒𝑟𝑎t𝑖𝑜𝑛𝑠 from its original version. The tuning of the W, C1, C2, and P parameters in PSO was carried out experimentally with cross validation of 5 folders and a small dataset as a sample of the evaluation dataset. The PSOTagger pseudocode is presented in Figure 3.
C. Proposed Random-Restart Hill Climbing (RRHC) Tagger
The adaptation of the RRHCTagger algorithm to the tagging problem was carried out as follows. 1) The parameters: 𝑛_𝑟𝑒𝑠 𝑎𝑟 controls the number of restarts of solution 𝑆; P𝑟𝑜𝑏, list that stores the probabilities of the possible tags of a word; 𝐴𝑐t𝑖𝑣𝑒𝐼𝑛𝑑𝑒𝑥, a list that stores the positions of words that have more than one tag, and 𝑆t𝑎t𝑢𝑠𝑇𝑟𝑖𝑔𝑟𝑎𝑚, a list that stores the words selected to make a stochastic improvement. 2) The solution is stochastically improved, after a certain number of iterations without obtaining improvements, the algorithm saves the current result and the solution is restarted again (n_restart parameter), selecting another word from all the possibilities. 3) A tabu memory was implemented, which saves the words that were selected in the solution restart. In Figure 4, the proposed RRHCTagger pseudocode is presented.
D. Proposed GBHS4Tagger
The GBHS4Tagger algorithm is based on the GBHSTagger memetic algorithm proposed in [4] and its improvement consists of the following steps. 1) The Hill Climbing (HC) algorithm was adapted to the tagging problem involving two neighborhoods. The first one selects a random word, regardless of the condition, and the second selects the word with the lowest probability. These neighborhoods are controlled with the Prob parameter. 2) The proposed HCTagger was incorporated into GBHS Tagger 2 [4] as a local optimizer and, thus, the new memetic version called GBHS4Tagger. In Figure 5, the proposed HCTagger pseudocode is shown and in Figure 6, the proposed GBHS4Tagger is shown.
E. Experiments with the proposed taggers
To carry out the experiments, in the first instance, an adjustment (fine-tuning) of the tagging parameters was carried out using a small dataset (sample) of 5000 sentences, in order to select the best combinations of parameters of each algorithm. Table 3 shows the distribution of the sentences in the test and training datasets for each folder, with which the experiments were carried out on each complete corpus, as seen in Table 5. All the experiments were executed using cross-validation of 5 folders, except for the Nasa Yuwe corpus, with which Leave-One-Out was used, since the dataset has only 175 sentences.
In Table 4, following, the configuration of the algorithms for the experiments carried out with each corpus is presented.
In Table 5, the results of the experiments carried out on the three complete corpuses are presented. It can be seen that GBHS4Tagger surpassed the other algorithms in precision value, in the IULA and Brown corpus, with the Nasa Yuwe corpus being the second best. It should be noted that the adapted algorithms obtained very good results for this problem, but there are differences between the precision values obtained in each algorithm, which allow us to appreciate that some algorithms perform better than others, as established in the second theorem of No Free Lunch Theorems for Optimization (NFLT) [28].
Table 6 shows the ranking of each algorithm in the experiments carried out in each corpus once the Friedman NxN non-parametric statistical test has been applied, obtaining a p value smaller than 0.05, therefore, it makes the ranking statistically significant, complementing the evaluation of the algorithms.
Additionally, the Wilcoxon test showed, with a significance level of 90%, that the results obtained for the winning algorithms, GBHS4Tagger for Spanish and English, and RRHCTagger for Nasa Yuwe, are better in contrast with the other proposed taggers.
V. Discussion and conclusions
This work achieved the adaptation of the metaheuristic algorithms PSO, Jaya, and RRHC to the problem of part of speech tagging (POST), taking into account the characteristics of each algorithm, and performing the parameter adjustment required for each algorithm on each corpus, obtaining competitive results with respect to one of the state-of-the-art algorithms. It was also possible to propose an improvement to the state-of-the-art GBHS Tagger 2 memetic algorithm, which continued to demonstrate that the performance of the tagger improves by including knowledge of the problem, as seen in the IULA (Spanish) and Brown (English) corpus. Consequently, the presented research reinforced the idea that metaheuristic approaches are capable of performing tagging with good results, with acceptable resources and times. Metaheuristic algorithms should continue to be used for tagging on other traditional and non-traditional languages, and seek new improvements for the proposed taggers in combination with other optimization techniques that improve the results of the tagging.