I. INTRODUCTION
Saber Pro is part of a toolset used by the Colombian State to assess the quality of higher education. One of its objectives is to assess the development level of specific competencies in students about to complete undergraduate degree programs offered by higher education institutions [1]. Saber Pro is a compulsory requirement for graduation according to Act 1324 of 2009, and it is applied once a year. This test evaluates general and specific competencies. The general competencies assessment is divided into sections: critical reading, written communication, quantitative reasoning, English, and citizenship competencies [1]. Specifically, the critical reading competency assesses the performance associated with reading, critical thinking, and interpersonal understanding [2].
According to Timarán et al. [3], the studies carried out so far at the national level [4,5,6] regarding the Saber Pro exam are based on information processed through statistic analysis, where mainly variables and primary relationships are considered. They do not consider the actual interrelations, which are usually hidden and can only be described employing more complex data analyses, such as data mining.
The use of data mining in education is not a new topic. Its study and implementation have been very relevant in recent years, and its techniques can be used to explain or predict any phenomenon within the educational field [7]. For example, it is possible to predict the dropout probability of any student with a very high-reliability rate through data mining techniques [7,8,9]. Furthermore, educational institutions can use data mining to analyze their student's characteristics or evaluation methods comprehensively and thus discover successful methodologies, frauds or inconsistencies [9].
This article presents the application of the decision tree-based classification technique in predicting the performance in the critical reading section of the Saber Pro exams presented by students from the Pontificia Universidad Javeriana Cali in 2017 and 2018.
II. MATERIALS AND METHODS
Descriptive studies are designed to describe the distribution of variables without considering the causal or other nature hypotheses. That is why this research was descriptive with a quantitative approach applied to a non-experimental design. The results of the Colombian students that presented the Saber Pro exams in 2017 and 2018, available at the databases of the Colombian Institute for Higher Education (ICFES), were used as the information source. Given that this research involves data mining, the CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology was used. This model is used mainly in the academic and industrial fields, and it is the reference guide most widely used in developing this type of project [10, 11, 12]. It comprises six stages: problem analysis, data analysis, data preparation, modeling, evaluation, and deployment [12].
In the problem analysis stage, the activities that allow to deepen and fully understand the Saber Pro exam and the Critical Reading general competency were carried out, making it possible to collect the correct data to interpret the results adequately.
In the data analysis stage, the socioeconomic, academic, and institutional information available at the ICFES databases, at the time of conducting this research, which corresponded to the results obtained by the students of Universidad Javeriana who took the Critical Reading section of the Saber Pro exams in 2017 and 2018 was identified, compiled, and studied. After integrating the repositories of each year, the result was an initial data set denominated sbpro_lec_2052A101, which included 2052 records and 101 attributes
Considering that high dimensionality is an issue for the discovery of patterns in data mining [12], the sbpro_lec_2052A101 set was cleaned and transformed during the data preparation stage to delete noisy, null, and atypical data, transform some attributes to obtain greater information gain and delete the irrelevant attributes that would not help in the pattern detection process. The result was the data set called sbpro_lec_2052A23 conformed by 2052 records and 23 attributes, which was the base for the modeling stage.
The decision tree classification model was selected during the modeling stage as the data mining technique more suitable for solving the research problem. This model is probably the most used and popular because it is simple and easy to understand [13,14,15]. The importance of decision trees comes from their capability to build interpretable models, being this a decisive factor in its use. The decision tree classification considers disjoint classes so that the tree will result in only one leaf, assigning a unique class to the prediction [16]. This technique has several advantages. First, the reasoning process behind the model is evident when the tree is examined. This is in contrast to other black-box modeling techniques where the internal logic may be difficult to determine. Second, the process automatically includes in its rule only the attributes that indeed count for the decision making. The attributes that do not contribute to the tree's accuracy are omitted [14].
The process to test the model's quality and validity was established before building the model. Considering that to train and test a classification model the data are divided into two sets, training and test [17], the cross-validation method was used since it reduces the dependence of the experiment's results on the way the division is made [12]. For this particular case, the n-fold cross-validation evaluation method was used. In this method, the training set is randomly divided into n disjoint subsets of similar size called folds. The number of subsets can be entered into the field Folds. Subsequently, n iterations (equal to the number of subsets) are made, where a different subset is reserved for the testing set for each iteration and the remaining n-1 (merging all the data) for the construction of the model (training). The partial sampling error of the model is calculated in each iteration. Finally, the model is constructed with all the data, and its error is obtained by averaging those calculated previously in each iteration. Another advantage of cross validation is that the variance of the n partial sampling errors allows estimating the learning method's variability regarding the data set. This research used the 10-fold cross validation considering the recommendation by Hernández et al. [12].
The classifier cost for the sbpro_lec_2052A23 repository was estimated through a confusion matrix during the evaluation stage. The confusion matrix represents in detail the number of instances predicted by class. The sum of the records presented in each row i, i = 1...n constitutes the number of instances that genuinely belong to class i. Similarly, the summation of the examples or records in each column j, j = 1...n is the instances predicted by the algorithm for the j value of the class. The values on the diagonal are the correct matches, and the rest are the classification errors (examples that belonged to the class i of the row i and were classified incorrectly in another) [12].
Furthermore, the discovered patterns were evaluated to determine their validity, remove the redundant or irrelevant patterns, and interpret the patterns useful in terms of being understandable for the user.
In the deployment stage, the discovered patterns were documented, and these can be incorporated into the existing knowledge on academic performance in the Critical reading competency of the students of professional programs in Colombia. The directors of the Universidad Javeriana Cali are responsible for integrating this knowledge into their decision-making processes to improve the education quality of this institution.
III. RESULTS
The purpose of selecting the decision tree classification technique is to obtain a model that can predict the socioeconomic, academic, and institutional factors associated with good (above the mean) or low (below the mean) academic performance in the critical reading section of the Saber Pro exam for new students of the Universidad Javeriana Cali, considering as class attribute the score obtained in this test. To achieve this, several decision tree algorithms were evaluated with the tool Weka, which allowed to select the technique that classified with greater accuracy the sbpro_lec_2052A23 data set. Results are presented in Table 1.
Algorithm | Accuracy |
---|---|
Decision Stump (one-level decision tree) | 53.02% |
J48 | 68.85% |
LMT (Logistic Model Tree) | 62.37% |
Random Forest | 57.89% |
Random Tree | 53.89% |
RepTree | 55.50% |
According to Table 1, the most accurate algorithm was J48. That is why this algorithm was selected for the construction of the classification models with a decision tree. Once the algorithm and method for the testing and training of the models were selected, the decision trees and the J48 algorithm, which implements algorithm C. 45 [18], were built with Weka, see 3.9.4 [17]. The J48 algorithm is based on the usage of the information gain criteria. In this way, it is possible to ensure that the variables with a higher number of possible values are not benefitted in the selection. Additionally, the algorithm includes a classification tree pruning once it has been inducted. The most crucial parameter considered in the pruning was the confidence level C, which affects the size and prediction capability of the tree built. The lower this probability, a more significant difference in the prediction errors before and after pruning is required not to prune. The default value for this factor is 25%, and, as this value decreases, more pruning operations are allowed and thus, smaller trees are obtained [19]. Another parameter used to vary the size of the tree was the factor M, which specifies the minimum number of instances or records per tree node. The global score obtained by the students in the Saber Pro exams was selected as a class, which was discretized in the values "above national mean" and "below national mean".
Different decision tree models were generated to choose the decision tree that best classified the students and with the highest interpretability level of the patterns associated with academic performance in Critical reading. This is why two values were set for the confidence factor C, 25% and 50%, combined with two values for factor M, 2.5% (52 examples) and 5% (104 examples). Furthermore, a post-pruning process was implemented to maintain the most representative branches, and therefore the rules, which are those that exceed a minimum support of 5% and a confidence of 60%.
The best tree was built with the parameters C=0.25 and M=104 for pre-pruning and a support greater than or equal to 5% for post-pruning. Figure 1 presents the classification tree obtained.
The confusion matrix (also called contingency table) was used to evaluate or estimate the cost of the classification model built. A confusion matrix is a tool that allows visualizing the performance of a supervised learning algorithm. This is shown in Figure 2.
IV. DISCUSSION AND CONCLUSIONS
After analyzing the decision tree results of the performance in the critical reading competency of the students from the Universidad Javeriana Cali in the Saber Pro exams in 2017 and 2018 presented in Figure 1, it can be observed that the tree classified 1231 instances correctly, which corresponds to an accuracy of 60%, and 821 instances were classified incorrectly, corresponding to a 40%.
When evaluating the model with the confusion matrix in Figure 2, obtained with the tool Weka, it predicts correctly 864 cases of students whose performance in critical reading is above the mean (VP) and 367 cases below the mean (VN). On the other hand, the model classifies incorrectly as below the mean (FN) 272 cases whose performance is above the mean, and 549 cases as above the mean (FP) whose performance is below the mean.
For the case where students are above the mean in critical reading performance, the model has a prediction accuracy of 0.611, which means that, of the total cases predicted above the mean, 61% are correct. The model's sensitivity (TPR) and recall is 0.761, indicating that the model correctly classifies 76.1% of the students that indeed are above the mean. On the other hand, the model's false positive rate is 0.599, meaning that 59.9% of the students below the mean were classified as above the mean. The F-measure is 0.678, which means that the harmonic mean between the accuracy and recall of those above the mean is 67.8%. When combining these measures, a better performance of the model is appreciated for those above the mean.
For the case where students are below the mean in critical reading performance, the model has a prediction accuracy of 0.574; that is, of the total cases predicted below the mean, 57.4% are correct. The model's specificity (TNR) and recall is 0.401, indicating that the model correctly classifies 40.1% of the students who genuinely are below the mean. Furthermore, the false-negative rate of the model is 0.239, meaning that 23.9% of the students above the mean were classified as below the mean. The F-measure is 0.472, which means that the harmonic mean between the accuracy and recall of those below the mean is 47.2%. When combining these measures, a poorer performance of the model is appreciated for those below the mean.
The model built to detect performance patterns in the critical reading competency of the Saber Pro exam of students from Universidad Javeriana Cali is not highly unbalanced. There is a 220 (10%) cases difference between those that were above the mean (55%) and those below the mean (45%). For this reason, within the evaluation metrics calculated previously, it can be said that this model has an accuracy of 60%, and it is better at predicting students above the mean than those below. This is also noticeable in the relation between the recall and accuracy given in the PRC area, where it is 0.631 for the students above the mean and 0.516 for those below the mean. Furthermore, the Mathews correlation coefficient of the model is 0.173, indicating a weak relationship between the prediction and the observed, that is, a low quality in prediction. Finally, regarding the areas, since the ROC area of the model is above 0.5, the model has a good performance in the classification of the students from Universidad Javeriana Cali regarding the performance in the critical reading competency of the Saber Pro exams of 2017 and 2018.