1. Introduction
Design of experiments (DOE) is a well-known methodology[1, 2] which can be applied to physical experiments without any difficulty. However, when the data comes from numerical experiments obtained from deterministic mathematical models, problems arise due to the impossibility to compute the pure error to evaluate the lack of fit. Although the literature deals with the issue of applying the DOE technique in numerical simulations, the information provided is not enough to be applied directly to hydrodynamic simulation because it does not specify how to replicate points with numerical experimentation[3-7].
To overcome this problem, the statistical theory of the lack of fit with no replications was used to generate the information for the neighbor points, treating them as the real points to the center[8, 9]. Additionally, the descending scaling methodology was used to identify the points of minima outside the initial experimental region[1, 2].
The numerical simulations in this work were performed by an Eulerian-Eulerian hydrodynamic model for the conservation of mass and momentum laws. It was applied to a two-phase flow of water and air, in order to find the interface surface between these two fluids, which is useful for the study of flow in open channels when the water surface is to be determined [10, 11]. Taking into account that it is necessary to compare predicted values with reference (experimental) ones for the calibration of the hydrodynamic model, experimental values reported in the literature for water depths in a curved open-channel flow were used in this case. Specifically, data reported by Han et al [12] were used for calibration, and data reported by Blanckaert, Rozovskii and Poggi, were used for validation[13-18].
As a result of this work, a methodology for calibration and validation of numerical models was developed. This methodology is valid to define the optimum configuration of the numerical parameters of the model. A new second order regression model was developed to predict the optimum point of the response variable with a reasonably high level of precision for hydraulic engineering applications.
2. Methodology
The methodology used for the hydrodynamic-numerical model calibration and validation is based on that of the response surface. It includes the analysis of lack of fit through the neighbor nodes and the down-scaling technique, which makes it appropriate to apply a numerical experimentation with deterministic mathematical models[1, 2]. The contribution of this methodology to the classical DOE methodology applied to physical experimentation is the inclusion of near neighbor points in the lack of fit test, instead of using exact points to the center. The developed methodology flow chart is shown in Figure 1, where it is compared to the classical DOE methodology presented by Gutiérrez and Vara[1] and Montgomery[2]. The difference is easy to detect if we perform the lack of fit test without replication. This methodology is applied to the simulation of a curved channel open flow.
2.1. Experimental data for curved channels
Water depths in a curved channel measured by Han et al.[12] were used for the calibration of the numerical parameters of the model (see channel geometry in Figure 2(a)); and water depths measured by Blanckaert, Rozovskii and Poggi [13-18] (see Figures 2(b), 2(c) and 2(d) for channels geometry) were used for validation. The flow characteristics and channel dimensions for each test case are presented in Table 1, (where B: channel width; h e: water depth at the channel entrance; h s: water depth at the channel exit; R m: mean radius of the channel; Q: discharge; (: development angle of the channel curve; S o: longitudinal slope; L e: straight zone of the channel upstream the entrance; L s: straight zone downstream the exit).
Data from Han´s horizontal curved channel include water levels for the internal and external Plexiglas walls, taken from a figure[12]. Data from Blanckaert´s horizontal channel include water levels at the center of the channel with a reported roughness coefficient of 5x10-6 m for the lateral walls and 0.003 m for the bottom[14]. Data from Rozovskii´s horizontal channel include water levels for the internal and external smooth walls and a roughness coefficient of 0.0004 m for the bottom[16, 17]. Finally, data from Poggi’s channel include water levels for the internal and external walls with longitudinal slopes of 5% and 10%[18].
2.2. The hydrodynamic model
The set of governing partial differential equations for the Eulerian-Eulerian model used in this work includes three conservation of momentum scalar equations and one conservation of volume equation [10, 11]. The complete set of equations for an incompressible permanent flow model is:
Conservation of mass for α and β phases in Eq. (1) and Eq. (2):
Conservation of momentum equation, in vector form, Eq. (3):
Conservation of volume equation in Eq. (4):
Where in Eq. (1) to Eq. (4), u is a 3-D velocity vector in x, y, z directions; r α is the volume fraction for the α phase; r ( is the volume fraction for the ( phase; ( and ( are density and the dynamic viscosity of the weighted mixture, given in terms of the volume fraction as: ( = ( α r α +( β r β and ( = ( α r α +( β r β, where ( α is the density of the α phase, ( β is the density of the ( phase, ( α is the dynamic viscosity of the α phase, and ( β is the dynamic viscosity of the ( phase; p is the pressure;(( represents the tensor product; ( is Kronecker’s operator; S M represents the source term in the conservation of momentum equation due to internal forces (buoyancy forces per unit volume computed as S M = ((-( ref)g, where ( ref is the reference density corresponding to the fluid of smallest density and g is the acceleration due to gravity). The closed mathematical model includes six dependent variables: pressure, three Cartesian velocity components, and the volume fractions for α and ( phases. Additionally, the RNG((-( turbulence model is coupled to the homogenous model. This turbulence model was chosen due to its simplicity in terms of its empirical parameters and due to its wide use in engineering applications[19].
2.3. Computational grid configuration
The numerical solution for the model described above requires the computational domain to be divided into a discrete domain of elements forming a grid of cubes as shown in Figure 2(a). Here the size of each element (Te) and the height of the domain (Hd), a height different from the water depth, are considered the numerical factors that define the geometry of the computational domain. An adaptive grid is also required by the numerical model during the intermedia time iterations so that the computational domain changes during this iterative solution process[10]. The new node distribution follows Eq. (5),
where S ( is the number of additional nodes for every iteration step; MNP is the maximum number of iteration steps; M is the number of new nodes to be distributed in the domain and is computed as M =NDi*NF, where NDi is the initial number of nodes which depends on the domain size defined by Hd and Te (see Figure 2(a)), and NF is the node factor that allows the user to define the total number of nodes to be added; NAP is a parameter related to how the nodes are added to the computational domain (if positive, nodes are added at the beginning of the simulation; if negative, nodes are added at the end of the simulation; and if it is zero, nodes are added evenly distributed during every grid adaptation step); and ( represents each one of the adaptation steps and varies between 1 and MNP. In addition to the previous geometric factors, the numerical model includes two numerical factors related to the iteration process: a) the MIPS factor that indicates the maximum number of iterations per computational step of the adaptive grid, and b) the NI factor that represents the maximum number of steps (iterations) for convergence. In Figure 3, a graphic representation of the MIPS, MNP, ( and NI factors when the RMS error is plotted as a function of the number of iterations is shown Figure 3 was obtained for MNP = 5 and MIPS = 20). Once the r α and r β fractions are known, the grid adaptation procedure is used for a region near the water-air interface, adding more nodes where the answer variable shows steeper gradients.
The purpose of Eq. (6) is to determine the effect of each factor on V R in terms of their corresponding levels that make V R a minimum.
2.4. Statistical methodology for calibration and validation
1) Statistic Indexes: In the literature, several statistical indexes have been used to evaluate the predictions obtained by numerical models[20-24], the root mean square error (RMSE) being one of the most widely used for calibration and validation[25-29]. For the case of water depths in curved channels, the literature reports RMSE values in the range of 0.0009 to 0.055 m[30]. The RMSE error is computed with Eq. (7) as:
where N is the number of data, O i is the observed (measured) values of the response variable and P i is the predicted (simulated) value of the answer variable. Both O i and P i occupy the same spatial position (simulated values may need to be interpolated to the same location of the measured point). The RMSE error is interpreted as a deviation of the simulated results from the measurements[29]. As the RMSE error is a dimensional variable, it is convenient to scale it to a reference value which, in most cases, is the mean value of the measurements.
2) Lack of fit testing: The lack of fit test, in our case, is used to verify the order of the suggested regression model, using the water depth as the response variable stated by a linear form of Eq. (6). The hypothesis to be verified by this test includes: a) Zero hypothesis Ho: when the regression model properly fits the RMSE values and b) the alternative hypothesis HA: when the regression model does not properly fit the RMSE values. When performing physical modelling, this test can be applied without any problems due to its repetitive characteristics. But this is not possible when the numerical modelling is performed because if the same level for the model factors is maintained, the model will provide the same results for the response variable, leading to a misinterpretation of the experimental error. Mathematically, this problem is stated as: Let Y 11, Y 12, …Y 1n(1) be the series of repetitive observations for X 1; let Y 21, Y 22, … Y 2n(1) be the series of repetitive observations for X 2; and so on until Y m1, Y m2, … Y mn(m) is the repetitive observations for X m. Then an estimate of the variance “pure” error (SS PE ) for these experiments is , where is the mean value for the group i. The number of different levels for X is m, and the total number of experimental arrangements is n, formed by m groups n(1), n(2),.. n(m), satisfying the relationship n = For hydrodynamic simulations, all yij values are identical (considering the computing machine intrinsic rounding error) so that all are identical to every y ij value, making SS PE = 0, which implies a test failure.
To solve this problem, a set of “points to the center” is generated using the concept of the lack of fit test when no replicas are available[8] [9]. To apply this method, it is necessary to generate levels for the factors near (located at a D ii´ 2 distance) to the “point to the center” so a different value for the answer variable is obtained for each group. The D ii´ 2 distance is computed with Eq. (8) as:
where B j represents the regression coefficients obtained with the real levels of the factors, k is the number of parameters of the regression model, MSE is the mean square error, X ij are the real levels of the “point to the center”, and X ij are the real levels of the neighbor points. All pairs of points located at the D ii´ 2 distance are the neighbor points which are used to calculate the pure error. Given that values of ( j, X ij and MSE are known, the levels of the neighbor points X ij are determined such that D ii´ 2 < 1.
The lack of fit test was performed according to Daniel and Wood's method[8, 9]. This method compares the standard deviation of the population with the root mean square of the total error (MSE); if there is no appreciable lack of fit. For samples of size two, there is a relationship between the range of a sample from a normal population and the population standard deviation, given as where E is the sum of the values of E u, , and where e i, and e i', respectively, are the residuals at points i and i', for the uth pair; with m’=4*Nv-10 and Nv, the number of near neighbor points.
In this research, we confirm that the near neighbor points can also be used to perform the lack of fit with the standard DOE procedure according to Gutiérrez and Vara[1] and Montgomery[2], using the near neighbor points as exact replica points.
3. Results and discussion
3.1 Experimental design and numerical simulations
Given that 6 factors were selected for the analysis (see Eq. (6)), a 27-2 fractioned factorial design was chosen, leading to 32 simulations (model runs) that allow the analysis of the main factors and their interactions with the secondary ones, and including three double hidden interactions. The corresponding factor levels were selected following recommendations from the specialized literature[10]. Once the runs were completed according to the experimental design, the RMSE error was scaled to the water depth at entrance (h e ), expressed in percentage. The analysis was performed with factors being scaled according to the formula(( = (X n-X o)/((X , where ( is the scaled factor, X n is the real factor level, X 0 is the level of the “point to the center”, and (X is the level increment of the real factor related to the level of the point to the center. A negative value for the scaled factor indicates that the factor has a smaller level than the “point to the center”.
3.2. Statistical analysis of the 2 7-2 fractioned factorial
In order to obtain the best ANOVA, a confidence interval of 5% was defined, including in the error some double interactions of smaller significance[1, 2]. For practical purposes, Eq. (6) was rewritten in the form RMSE P = f(A, B, C, D, E, F, G), where there is a correspondence between the factor in Eq. (6) and its rewritten form. The linear regression model obtained for the best ANOVA is: RMSE P = 2.92831 + 1.27684*A + 0.0741274*B - 0.32016*C - 0.52464*D + 0.247451*E - 0.104724*F + 0.160621*G + 0.227865*A*C - 0.322946*A*D + 0.291371*A*F - 0.228789*B*C - 0.657372*B*D + 0.233018*B*E - 0.458708*B*G - 0.205209*C*E - 0.209814*C*F - 0.427151*D*E + 0.228444*D*G, where the factors are in encoded values.
A MSE value of 0.1854 was obtained from this analysis with an adjusted determination coefficient R 2 adj of 90.33% for the regression model. Not all factors that prove to be non-significant must be eliminated; only the least significant effects are eliminated. This causes the R2 adj statistic to increase in value. If by eliminating an effect the R2 adj decreases by 3.0% or more, it means that possibly this effect should not be eliminated. Therefore, there could be non-significant factors that form the RLM model (1) The criterion for the factor sensitivity analysis is obtained from the absolute value of the magnitude of the regression coefficients, indicating the relative importance of each factor in the RMSE P . A graphical representation of these results is presented in the Pareto Diagram (Figure 4(a)) and Daniels Plot (Figure 4(b)); additionally it was confirmed with the ANOVA analysis “p-value”. From the highest to the lowest significance, the factors are organized as: A, D, C, E, G, F and B. It was possible to establish that the size of the grid elements (factor A in the list) has the highest influence on the RMSE P .This means that the grid configuration is the most sensitive factor in the analysis. This conclusion is ratified by Figure 4(c) and Figure 4(d) where it is clear that the size of the element must be as smallest as possible. It is also ratified when the double interactions AC, AD, and AF are analyzed.
The sign of the regression coefficients indicates that the corresponding level factor needs to be increased or decreased in order to reach the optimal value of the answer variable. In our case, the optimal point is reached for a minimum value for RMSEP, so the regression coefficients must change as indicated by the plus sign (increase) or minus sign (decrease): A(-), B(-), C(+), D(+), E(-), F(+) y G(-).
With the obtained regression coefficients, the MSE and the defined “point to the center”, Eq. (8) was used to check the near neighbor points in addition to four “points to the center” to guarantee an ANOVA test with at least eight degrees of freedom for the error. The result produced 17 degrees of freedom for the best ANOVA by using 36 data points and 18 regression coefficients. The near neighbor points were checked with the regression coefficients in real levels of the factors. With the center points, the lack of fit test was performed and as result a “p-value” of 0.7441 was obtained, meaning that a non-significant lack of fit was achieved. This result is in agreement with Daniel and Wood’s method, for which we obtained that with values of and with a difference of 0.56% with respect to , indicating that there is no appreciable lack of fit. This result implies that the H o hypothesis is accepted. As a conclusion, it is possible to state that the numerical experiments are governed by the regression model, which generates a non-curvature plane to explain the response variable (compared with a 3D topography where the response variable is the height and the factors are the east-west coordinates, forming a valley). This means that the region of experimentation is located on the slopes of the mountain, so it is necessary to continue the methodology to reach the lowest point of the valley.
In order to guarantee the successful application of the methodology, an error surface with a certain degree of curvature is required. In our case, to achieve this behavior, an adaptation to the descendent scaling method was developed according to Gutiérrez and Vara[1] and Montgomery[2]. Factors A (Te, element size), E (Hd, total domain depth at entrance) and G (MNP, maximum number of iteration steps) were disregarded from the analysis due to their limitation for their corresponding minimum possible levels, and their central levels were defined, instead. Factor A was disregarded due to computational limitations because very small values imply big computational efforts; so a mean value of 0.030 m was defined. Factor E has its limit value at the water surface and the water depth of 0.225 m was assigned. Finally, the G factor cannot take levels smaller than 2 so a level of 4 was defined. These restrictions led to a new scaling design for the factor with the highest significance out of the 4 factors left, from which factor D (the node factor that allows the user to define the total number of nodes to be added during the grid adaptation process) was selected as a base point, given its previously obtained high regression coefficient (( = 0.52464). Five values were taken for the descendent scaling with a discrete step of two (2) for the D factor, initiating from zero up to the central value, and the corresponding runs were performed. The RMSE values were plotted vs. the number of steps and it was observed that this error did not diminish after the fourth step of the scaling, and a new “point to the center” was obtained for the experimental region. Two extra simulations were performed and this result was validated.
3.3. 2 4 Complete factorial design and analysis
With the remaining four factors, a 24 complete factorial design was performed for this new numerical experiment, with eight center points as near neighbor points. The new experimental region was defined taking as a base point the “point to the center” obtained from the 4th step of the above-mentioned descending scaling process. With this information the best ANOVA, the new regression coefficients and a new MSE were obtained for the real level of factors. The corresponding model runs were executed for the “points to the center”. The ANOVA analysis and the lack of fit test were performed producing a result of a significant test for “p-value” equal to zero, leading to reject the H 0 hypothesis. This means that the experimental region has a curvature and a second order regression model is needed. The lack of fit test was also performed using Daniel and Wood' method, with eight points of near neighbors points, for a total of pairs of points of m=4*8-10=22. This resulted in with values of and , indicating that there is appreciable lack of fit, coinciding with the standard procedure proposed by Gutiérrez and Vara[1] and Montgomery[2].
3.4. Draper-Lin model: Design and analysis
In search for the optimal model, the Draper-Lin model, built out of the 24 factorial design, was chosen for the experiment design. In this case, the same “points to the center” used for the lack of fit test were added and the “star-points” were also added to guarantee an orthogonal and rotatable design. After the best ANOVA analysis, the new regression model, now a second order model, was obtained as: RMSE P = 1.15325 - 0.205049*D + 0.00933529*B - 0.0449354*C - 0.0446673*F + 0.365531*D^2 + 0.17816*D*F + 0.216632*B^2 - 0.161874*B*F + 0.0402612*C^2 + 0.199838*F^2, where the RMSE P is expressed in percentage and the level of factors are scaled as mentioned earlier. In this new equation, it is clear that the second order terms have high relevance, indicated by the relative high values of the regression coefficients. These results are validated by the Pareto diagram (Figure 5(a)), the Daniels plot (Figure 5(b)) and the “p-value” of the regression coefficients.
From the estimate response surface plot (Figure 5(c)), it is clear that the factor interaction DB shows a well-defined local minimum; and that the factor interaction DC (Figure 5(d)) shows a tendency to decrease as the C factor increases. Note that the upper limit for factor C (NI, number of iterations for convergence) will be the limit that guarantees the physical meaning of the experiment represented by the imbalances, the stability of the physical parameters and the precision at convergence quantified by RMS.
Factor levels at the optimum point were obtained by minimizing the second order regression model (taking the first derivative of RMSE P with respect to each factor and equating it to zero). Those dimension-dependent factors kept constant in the Draper-Lin design and were normalized by the hydraulic radius (R H) in m of the section at entrance with the purpose of future use for the validation channels, assuming a direct proportional behavior. The regression coefficients obtained for the optimal point are: A (Te) = 0.432948R H, B (NAP) = -1.80, C (NI) = 1311, D (NF) = 17.4, E (Hd) = 3.2468R H, F (MIPS) = 108 and G (MNP) = 4. The advantage of writing RMSE P as a function of the defined factors through a regression model is that a hydrodynamic model can be configured for a pre-defined RMSE P value, different from that of the optimal point, to meet the specific requirements of a given project that may not need the minimum value of RMSE P .
To validate these results, Han´s channel data were used for a model run with the following configuration: A (Te) = 0.432948R H, B = -1.82, C (NI) = 1696, D (NF) = 15.67, E (Hd) = 3.2468R H, F (MIPS) = 106 and G (MNP) = 4, for R H = 0.0693 m. The simulated RMSE P was 1.05%, a close value to the one predicted by the estimated response surface of 1.15%. These results confirm the validity of the second order regression model used to predict the RMSE P , at least inside our experimental region.
3.5. Model validation
During validation, the model configuration obtained during calibration is used to test it for different geometries and flow conditions looking for a low value for RMSE P , the response variable. In our case, Blanckaert´s, Rozovskii´s and Poggi´s channels (see Table 1 for channels characteristics) were tested with the model factors obtained with Han´s channel data.
After running the model for those geometries and flow conditions, we found a RMSE P = 0.30% for Blanckaert´s channel, values of RMSE P = 4.96% for Rozovskii´s channel and values of RMSE P = 5.13% and 24.04 % for Poggi´s channels with longitudinal slopes of 5% and 10%, respectively. These results indicate that the calibration factors were properly validated for horizontal channels (Blanckaert´s and Rozovskii´s channels) but they are not the proper ones for high slope channels for which high values of RMSE P , away from the optimal one (RMSE P = 1.15%), were obtained. This indicates that the longitudinal slope plays an important role in the factor configuration of the hydrodynamic model. Similar behavior has been reported by Montazeri et al.[18], who obtained a RMSE P = 18.59% for Poggi´s channel of 10% longitudinal slope. This factor, the longitudinal slope, should be added to the list of factors to be considered during the calibration process. Figure 6 shows the simulated water free surface for the Poggi´s channel of 10% longitudinal slope. The flow complexity is evident in this figure.
In the literature, other works related to water flow in curved channels are reported, which present data similar to those analyzed in this article corresponding to water levels. They can be used to analyze in detail the validation of the calibration in future works. Gholami et al.[31], [32] presents transverse water surface profiles on several sections of a curved open channel, with flat bottom and 90° of curvature, including sections on the straight portion of the channel before and after the curve.
The purpose of including an inclined bottom channel was to validate the calibration with a factor not included in the calibration, as is the case of the longitudinal slope. As major errors were obtained with its inclusion, the calibration cannot be extrapolated directly, and it would be necessary to perform a new calibration for sloped bottom curved open channels.
4. Conclusions
The results obtained in this work show that the developed methodology can be applied for calibration and validation of numerical hydrodynamic models which, in our case, was applied, for illustration purposes, to numerical simulation of the free surface flow in curved channels. In our particular case, the results showed that the longitudinal channel slope should be included in the list of factors to be considered for calibration. Without this factor, the obtained factor configuration in the regression model was not the optimal one.
We found that the method for generating points to the center using the theory of near neighbor and down-scaling technique can be applied to data obtained from numerical experimentation to perform the lack of fit test and to find the optimal value of the simulation. We confirm that Daniel and Wood's method, used to perform the lack of fit test, leads to the same conclusions as the standard procedure of the lack of fit test given by Gutiérrez and Vara[1] and Montgomery[2]. In the latter, the near neighbor points are the true points to the center.
The application of the DOE to numerical experiments showed that an acceptable calibration of the hydrodynamic model was obtained. This is a low value for the answer variable, the RMSE P in our case, performing a relative low number of model runs despite the high number of factors considered in the experimental design; especially if the classical 2n experimental design is used when no exact repetitions can be used given the nature of the experiment: a numerical experiment