Introduction
In industrial processes, online quality measurements are critical to ensure suitable process parameters, automatic control of key-variables, and timely decision-making. The online calculation of these variables is, in some cases, difficult or money and time expensive. When hardware sensors are unavailable or unsuitable, data-based inferential estimators called soft sensors or virtual sensors developed in recent years can be used. The soft sensor indirectly estimates primary variables through inference from process observations. In the industrial sector, the estimation of the desired process variable is usually based on secondary easy-to-measure variables, such as temperature, flow rate, level, among others. Some works show a complete methodology for soft-sensor design, for instance in [1], [2].
Thus, each methodology reported considers the problem of datasets contain many variables, which are challenging to handle since they often include redundant and non-relevant information. Therefore, prediction performance is reduced due to overfitting and data dimensionality problems. This situation has been studied, and several works have been reported to reduce the number of variables by removing irrelevant information variables. Three main approaches are used for variable selection in machine learning. Filtering techniques use statistical metrics to score variables regarding the target variable [3]. Wrapper methods use a learning model to select relevant variables, making them more accurate to fit the particular model through an error metric for selection. In contrast, they are more computationally expensive than filtering methods [4]. A third approach is a hybrid, in which filtering and wrapper elements are combined to improve the quality of selection and computational cost reduction [5].
Several successful soft sensors have been reported dealing with feature dimensionality problems through filtering approaches. For instance, [6] implemented a soft sensor based on artificial neural networks (ANN), [7] designed a support vector machines (SVM) to predict the melt index for a polymer in, and [8] employed fuzzy systems to obtain methanol concentration from a simulated distillation column. These cases provide a filtering approach based on correlation analysis through Pearson's correlation coefficient. Although some other metrics such as information theory, nonlinear correlation or even expert knowledge have been reported in soft-sensor design, Pearson's correlation analysis is the most used. Likewise, several soft sensors have been reported using wrapper methods for variable selection. The wrapper approach uses the performance of the learning technique itself to provide the most representative subset from several options. For example, [9] implements a soft sensor for monitoring gases emission, for which variable selection was carried out through ANN training using different subsets of variables. In [10], a soft sensor based on ANN is implemented to estimate kerosene's endpoint. The work reports a wrapper selection of secondary variables using Mallow's Cp as the model performance metric. Also, [10] uses several ANN architectures and mean square error (MSE) to select the most relevant variables for predicting the target variable.
Recently, hybrid methods have drawn more attention from soft-sensor designers to achieve a suitable prediction performance and reduce computational costs employing combined filtering and wrapper methods. Thus, filtering scores such as correlation coefficient, Spearman coefficient, among others, show linear dependence between two variables. However, industrial processes present strongly non-linear relationships between their variables. Therefore, more powerful techniques using information theory-based metrics such as mutual information (MI) and gain information (GI) have been used. These are entropy-based metrics; thus, the dependency is measured through the quantity of information computing. In [11]-[13], the authors show soft-sensor selecting variables through MI as filtering metric and regression techniques such as principal component analysis (PCA) or partial least squares (PLS), to obtain a suitable subset of variables. Although many hybrid algorithms have been proposed to detect the relevance of variables of a dataset, scarce attention has been paid to non-linear dependencies between variables of industrial datasets considering the redundancy and model performance measurement.
In [14], [15] the authors present a complete review of feature selection techniques for industrial processes soft sensors. They have considered application cases of industrial soft sensors where filtering, wrapper and hybrid approaches were applied. Although many of these approaches have been used on successful soft-sensor applications, the more recent works explore hybrid method due to its balance between accuracy and computational cost. Also, MI has been drawing attention from researchers, since it deals with nonlinear datasets and low computational cost. Lastly, soft-sensor researchers have focused on performance metrics regarding machine learning technique; however, the prediction performance may be strongly affected by redundant variables in the dataset. Therefore, redundancy should be measured and involved in the feature selection algorithm.
This work proposes a hybrid selection variable algorithm based on the MI coefficient, followed by redundancy analysis and reduction, for industrial processes soft sensors. The approach's effectiveness lies in the variable selection, decreasing the search for suitable models that achieve consistent and accurate results for developing soft sensors oriented to industrial applications. The proposed method starts by computing the relevance for each process variable using the MI coefficient. The variables are then scored by a redundancy coefficient, and finally, the subset of the suitable variables is reduced using Mallow's Cp coefficient as the performance metric. Moreover, a study case demonstrates the algorithm's use and results applied to a distillation column process for water-ethanol mix separation. The soft sensor has been designed using an adaptive neuro-fuzzy inference system (ANFIS) to estimate ethanol concentration at the top of the column. A comparative study was performed to show the results of applying correlation analysis and the proposed method in developing a soft sensor for the distillation column dataset.
This paper is organized as follows: The second section introduces the mutual information coefficient and the model performance metric's mathematical basis and presents the proposed variable selection algorithm. The third section describes the study case's simulation results for applying a soft sensor based on ANFIS. Finally, the fourth section gives concluding remarks.
Materials and methods
Mutual information coefficient
Industrial processes often require finding the best subsets of easy-to-measure variables to estimate the hard-to-measure variable regarding the available dataset. Several soft sensors have been proposed, including feature selection techniques based on correlation and collinearity, for instance [16]. These methods were designed for linear behavior datasets; however, industrial systems are non-linear processes. Hence, correlation metrics are not a suitable choice for feature selection. Recently, information theory metrics have been used to solve non-linear feature selection problems such as entropy (E), MI, and GI, among others.
Mutual information is a non-linear dependency metric between two variables of a system. It can be calculated through information entropy as follows:
Entropy is the measure of uncertainty of a random variable [17]. Equ. (1) shows the entropy calculation for a continuous random variable, where fx is the probability density function for a random variable x. Equ. (2) shows the conditional entropy calculation for two continuous random variables, where fx, y is the conditional probability density function for variable y given variable x. Although Equs. (1) and (2) allow entropy calculation for continuous variables, these density probability functions are hard to obtain.
Because of this, assuming x and y as discrete random variables, it is necessary to obtain the entropy as follows:
In this manner, MI can be calculated as
where i and j represent the input and output variables to analyze respectively.
Greater MI means greater dependency between x and y. MI is more relevant for describing the relation between variables, valid for both linear and non-linear cases.
Measure for predicting performance
Filtering methods on industrial datasets may not be accurate by themselves, since variables are not being considered as part of a whole model. Wrapper selection methods deal with input variable selection by model performance measurement through several metrics such as MSE, RMSE, R2, Mallow's Cp, among others. The soft-sensor model's predictive performance is frequently measured through error metrics, such as MSE or root mean squared error (RMSE), among others, expressed by
where y j are the observations, % predicted values of a variable, and the number of observations available for analysis. However, the minimal error model is not necessarily optimal since high dimensional models might result in biased error metrics by overfitting [2] . Therefore, to deal with this problem, it is common to use a measurement that penalizes overfitting. For example, Mallow's Cp [18] allows determining the optimal tradeoff between model size and model performance by penalizing those overfitted models. Mallow's Cp can be calculated as
where y j (k) are outputs obtained by using a subset of p variables, y j are the predicted values, n is the number of samples and σ2 d are the residuals for the full trained dataset. Thus, Cp-value measures the relative bias and variance of a model with p variables. Therefore, the unbiased model's value will be p so that the optimal model will have the Mallow's Cp number closest to p [1].
Proposed variables selection algorithm
This subsection presents a new variable selection algorithm based on the MI coefficient and a wrapper technique to obtain a suitable subset of variables for soft sensors with industrial orientation. The Mi-based variable selection method may be considered as filtering, where the MI coefficient is used to score relevance for each process variable [19], [20]. Industrial datasets contain a large number of variables, thus some of these measurements might provide redundant information and prediction performance degradation. Therefore, redundancy between two random variables may be calculated as [21]
where R takes values between 0 and maximum valor. Therefore, normalized R can be written as [21]
The stages of the proposed algorithm are as follows. First, the selection of essential variables is performed using the mutual information criterion. Then detection and exclusion of redundant variables are carried out, penalizing relevant variables selected previously. Finally, the subset of the most suitable variables is determined using the wrapper method to assess the prediction performance with Mallow's Cp metric as the selection criterion. To calculate MI, Equ. (6) is employed.
The proposed algorithm's detailed structure is presented below; at the final iteration, S is the selected set of variables for developing the soft sensor.
Algorithm 1 Variable Selection Algorithm 1: procedure
//filtering stage
2: Given F={fi} the total variable set and S an empty set; i=1,2,...,k
3: Compute MI for each feature regarding target values; MI(x1,yj,)
4: Adding the first variable to S; s1=max{MI(x1,yj)} have the answer if r is 0
5: while Mallow's Cp decrease: //Wrapper stage and redundancy computing
6: Compute redundancy for s1 regarding all remaining sЄS
7: Selection of next feature/variable by max
8: Training and validation by using a machine learning technique with the obtained subset
9: Compute Mallow's Cp performance prediction
10: end while
11: end procedure
High redundancy variables are strongly penalized by using average redundancy and possibly removing them, otherwise, the variables are selected for the S subset.
The Hill climbing method is used as the wrapper and consists of an iterative trial and error technique, starting with an empty variables subset and progressively incrementing one variable by one until the best performance subset is complete. Lastly, Cp-value is chosen as the prediction performance index. The Mallow's Cp will be degraded if high dimensionality affects soft-sensor performance prediction.
Result and discussions
The proposed study case is a distillation column process that is broadly used for petrochemical and food factory industries. Much work has been reported with successful applications of distillate product soft sensors [10], [11], [22].
Simulation dataset
A distillation process consists of separating two or more compounds by applying energy to lead vapor toward the column's top. The remaining liquid is transported to the bottom of the column. This process is repeated until the separation is completed. Fig. 1 shows a distillation column scheme.
This work considers a non-linear simulated mathematical model of a binary distillation with 12 trays to separate the water-ethanol mix. A data-set containing 60 input variables with 4000 observations (data points) was collected at a sample rate of 10 samples per hour with no-shutdown phases. The dataset was contaminated with 10% amplitude noise, and random 5% of total observations are outliers, to represent common environmental conditions of industrial conditions. Ethanol distillate concentration is the dataset output, and the model inputs are shown in Table 1.
Outliers detection
The presence of outliers in industrial datasets is expected due to environmental issues; therefore, these observations should be detected and replaced, aiming for a suitable training and prediction performance. A Hampel filter is applied to the dataset to avoid high deviated outliers sensitivity [1], [14], [23]. Thus, the Hampel filter uses median absolute deviation (MAD), an outlier-resistant metric, and applies the filter through a moving window with two tuning parameters, threshold, and width. MAD can be implemented with [19]
where x1 are the values of the data sequence and x* is the median.
Training and validation
The soft sensor was developed using a five-layer ANFIS model with gaussian membership functions, N input variables, and Takagi-Sugeno consequent. Fig. 2 shows the considered architecture. The training was carried out with a hybrid approach using a back-propagation algorithm for tuning the membership function parameters and least squares for adjusting Takagi-Sugeno function parameters [24], [25].
A first-order, non-linear autoregressive with exogenous input (NARX) model was considered for ethanol distillate concentration prediction. MSE and RMSE were used as training performance metrics.
The proposed algorithm was applied to the initial 60 variables dataset from Table 1 to obtain the most suitable subset for prediction. First, the ranking stage was performed to determine ethanol concentration's dependence regarding distillation column secondary variables; the results are presented below. Table 2 shows the top ten variables obtained through mutual information coefficient, for which ethanol concentration depends on several flow-rates and trays temperature. A greater MI coefficient means greater dependence regarding ethanol concentration.
Variable | Label | MI coefficient |
---|---|---|
Temperature plate | U30 | 3.42 |
Temperature plate | U31 | 2.47 |
Temperature plate | U33 | 2 |
Temperature plate | U35 | 1.52 |
Temperature plate | U36 | 1.38 |
Temperature plate | U41 | 0.98 |
Liquid flowrate | U15- U28 | 0.69 |
Bottom flowrate | U59 | 0.57 |
Feed flowrate | U60 | 0.23 |
Distillate flowrate | U58 | 0.17 |
Source: The authors
Table 3 shows the ten highest Pearson correlation coefficients in magnitude, between the distillation column inputs and the ethanol concentration. It shows that the estimated variable has a high dependence only on the temperatures in the column plates. However, Pearson's coefficient does not consider a non-linear correlation between the variables, while the MI coefficient does.
Variable | Label | Pearson's coefficient |
---|---|---|
Temperature plate 1 | U30 | -0.9903 |
Temperature plate 2 | U31 | -0.9658 |
Temperature plate 3 | U32 | -0.9645 |
Temperature plate 4 | U33 | -0.9486 |
Temperature plate 5 | U34 | -0.9180 |
Temperature plate 6 | U35 | -0.8873 |
Temperature plate 7 | U36 | -0.8528 |
Temperature plate 8 | U37 | -0.8134 |
Temperature plate 9 | U38 | -0.7714 |
Temperature plate 10 | U39 | -0.7249 |
Source: The authors
The results of this first stage of the algorithm, after an extensive graphical study, indicate high dependence between the temperatures in the trays and ethanol concentration, however, these results show redundancy between the variables. Therefore, it is likely that the soft sensor will have a performance degradation if all these temperatures are considered for the ANFIS model. Additionally, the correlation method only considers linear relationships between the variables, therefore it is likely that at the end of the list there are variables that represent the ethanol concentration in a non-linear way. On the other hand, the MI coefficient presents very different results regarding the variables with the highest weight. In this case, the first 10 variables correspond to some temperatures but also some flows, which indicates that in these variables there is information that represents the ethanol concentration, regardless of whether these relationships are linear or non-linear.
Next, the algorithm proposed was applied, aiming for the best prediction subset to estimate ethanol concentration. Fig. 3 presents the behavior of the algorithm with several groups of variables. Training and validation processes were performed for each variable's subset to assess the performance prediction in these cases. Fig. 3 shows a decrease of MSE with subset size increment, although soft-sensor performance is degraded with subset size greater than seven variables.
Table 4 shows several candidate groups of variables and their validation metrics. From Fig. 3 and Table 4 it is possible to conclude that subset 5 achieves the best balance between subset size and prediction accuracy.
Subset | Labels | Subset Size | MSE Train | MSE Validation | Mallow's CP |
---|---|---|---|---|---|
1 | u30 | 1 | 0.0964 | 0.1911 | 5.019x106 |
2 | u30,u31,u58 | 3 | 0.0484 | 0.0553 | 1.446x106 |
3 | u30,u31,u35, u58,u60 | 5 | 0.0052 | 0.0242 | 6.370x105 |
4 | u30,u31,u35, u41, u58, u59,u60 | 7 | 0.0001 | 0.0001856 | 4933 |
5 | u15,u30,u31, u33,u35, u41,u58,u60 | 8 | 0.0003 | 0.0135 | 3.637x105 |
6 | u15,u30,u31, u33,u35, u36,u41,u58, u59,u60 | 10 | 0.0001 | 0.0002491 | 6609 |
Source: The authors
From Table 5, an 8-variable subset is selected from the algorithm; trays 1, 2, 3, 5 and 18 temperatures are highly relevant for prediction. Likewise, distillate and feed flowrate affect ethanol concentration. Expert knowledge confirms the results obtained from the proposed algorithm.
Subset | Labels | Subset Size | MSE Train | MSE Validation | Mallow's CP |
---|---|---|---|---|---|
1 | u30 | 1 | 0.0834 | 0.8312 | 8.021x106 |
2 | u30,u31,u32 | 3 | 0.0571 | 0.0761 | 3.381x106 |
3 | u30,u31,u32, u33,u34 | 5 | 0.0135 | 0.0517 | 7.631x105 |
4 | u30,u31,u32, u33,u34, u35,u36 | 7 | 0.0021 | 0.0128 | 4.131x105 |
5 | u30,u31,u32, u33,u34,u35, u36,u37 | 8 | 0.0015 | 0.00461 | 1.637x104 |
6 | u30,u31,u32, u33,u34,u35, u36,u37, u38,u39 | 10 | 0.0004 | 0.00137 | 0.983x104 |
Source: The authors
Table 5 presents variable selection considering correlation analysis. Results show subset 6 as the most suitable regarding Mallow's Cp value. After applying the correlation analysis and wrapper methodology, subset 6 presents a minimal validation metric. However, this MSE is higher than the proposed method because correlation only considers the temperature of trays, and probably information redundancy exists. In this case, at least six variables are necessary to obtain an acceptable performance of the proposed soft sensor
After variable selection and NARX model architecture were applied, an ANFIS-based soft sensor was obtained. Thus, it predicts ethanol concentration at the output of a distillation column for water-ethanol mix separation. Several operation points at steady state are considered during 120 hours, without shutdown phases. he data were collected at six minutes time-sampling, the noise of 10% amplitude has been added and 5% of total samples were contaminated with outliers. These are typical conditions of industrial instrumentation systems.
Fig. 4.a. shows the ANFIS well-trained system's validation results with selected subset by the Mi-based proposed method. The vertical axis represents ethanol concentration at the top of the distillation column, and the horizontal axis is time. Fig. 4.b. presents residual error from soft-sensor validation. Results show an error near 0 for each time instant.
Table 6 shows a comparison between soft sensor based on proposed algorithm and other algorithms reported in literature. Minimal redundancy maximal relevance technique (ITIRMR) [20] and Mutual information feature selection algorithm (MIFS) [19] have comparable results with the proposed algorithm, however the proposed algorithm has a lower computational cost.
Feature Selection Algorithm | ANFIS MSE |
---|---|
Proposed | 0.000207 |
Correlation Analysis | 0.004989 |
LASSO | 0.002485 |
MRMR (Nonlinear) | 0.000358 |
MIFS (Nonlinear) | 0.000317 |
Source: The authors
Fig. 5 shows 50 hours of operation where the red line represents predicted data, and the blue line is actual data. After the validation stage, MSE was obtained, which has suitable accuracy for this type of measurements.
Conclusions
This paper has proposed a mutual-information (MI) based algorithm to select soft-sensor design variables in an industrial context. This algorithm's main advantage over previously reported methods is the ranking of variables by relevance and redundancy before applying the wrapper method. Thus, the classification of variables reduced the search in the subsets space and provided relevant results faster.
The proposed algorithm was applied to a distillation column to separate a water-ethanol mix for a data-driven soft-sensor application based on ANFIS. A comparative analysis was carried out to explore the best performance between correlation analysis and the proposed Mi-based method, which improves prediction accuracy, showing suitable performance for industrial environments.
The study case concludes that ethanol concentration depends on eight variables such as 4 tray temperatures, liquid flowrate, distillation flow-rate, among others. These results were compared to expert knowledge; most of the control systems act over these variables, which shows that it was a suitable variable selection result.