1. INTRODUCTION
Advances in computational and analytical techniques allow for continuous monitoring of many processes. New statistical methods are needed to analyze large data sets arising from these processes. Functional data analysis (FDA) has emerged in recent decades as an alternative to statistical modeling of large data volumes. FDA is a framework for analyzing data consisting of random functions (usually curves) rather than observations of a few variables or random vectors [1]. New challenges have arisen in extracting meaningful information hidden in functional data [2]. As in classical statistics, in FDA data preprocessing, modeling, hypothesis testing, parameter estimation, and predictive analysis using parametric or nonparametric models are fields of interest. Many theoretical and applied contributions have been proposed in these areas [2], [3]. In the last decade, the FDA has already found applications in several areas of research, including ecology [4], epidemiology[5], remote sensing [4], outlier detection in environmental applications [6], and traffic volume forecasting [7]
To construct a functional observation Xij(t) from the discretely observed data one can employ a standard smoothing technique such as cubic B-splines [8]. The FDA package [9] implements the smoothing techniques in R [10].
This work focuses mainly on proposing a methodology for comparing groups when the same functional variable has been observed in several individuals in each of these. Specifically, a traditional nonparametric tool to solve the k-sample problem for a functional response is adapted to the FDA scenario. Let X i1 (t), X i2(t), X in(t) … i = 1,2, …, k random set of functions defined over an interval T = [a,b] which come from Gaussian processes GP (μ k (t), γ k (s,t)) [8]. The hypothesis of interest is given in (1).
Against the alternative that at least two functional means are different. The statistical literature has a widely considered hypothesis established in (1). The proposed approaches are proposed for point-wise t-tests, functional ANOVA, functional principal components analysis, and permutation tests.
Some authors have extensively studied the functional ANOVA problem. For example, [9] introduced an asymptotic version of the ANOVA F-test, and [2] considered asymptotic or bootstrapped versions of a L 2 norm based test, F-type statistic-based test, and globalizing pointwise F-test. Furthermore, [1] introduced a method based on a representation of a basis function, and [10] described a bootstrap procedure based on pointwise F-tests. However, Bayesian functional ANOVA has received less attention. But [11] introduced a Gaussian process ANOVA modeling approach under a Bayesian framework.
Other approaches were considered by [12], [9], and [13]. Furthermore, [14] proposed a new method using a graphical interface based on the global rank test, and this procedure for functional ANOVA was applied using permutations. Other authors have proposed other approaches, such as that used the Westfall-Young randomization to correct for multiple tests. However, this method cannot obtain an overall p-value. Meanwhile, [15] divided the domain of interest into regions. However, a disadvantage is that the partition must be respected. Furthermore, [16] developed a multi-way functional ANOVA to determine rejection regions. Our interest is to provide an alternative to the case where the Gaussian assumption is unrealistic, and [17] presented a unified methodology for performing computation-free permutation tests for the testing of the k sample in commutative and noncommutative L q spaces, which includes multivariate and functional data.
This work is organized as follows. Sections 2.1 and 2.2 review the Kruskal-Wallis test and random projections. Section 3 presents an extension of the Kruskal-Wallis test for functional data and shows its respective pseudocode. In Section 4.1, we present the simulation study and in Section 4.2, we present the application with real data. Finally, we present the discussion and some conclusions.
2. BACKGROUND
2.1 Kruskal-Wallis test
This section briefly reviews the main statistical technique used in the analysis. Kruskal-Wallis [18] is a non-parametric statistical test that compares the median values of two or more independent samples. The null hypothesis for the Kruskal-Wallis test is that all the samples come from the same population, and the alternative hypothesis is that at least one group's sample comes from a population with a different median than the others. The test is based on the ranks of the observations within each group. It is an alternative to ANOVA when the normality assumption is unrealistic. The hypothesis of interest is shown in (2).
Which establishes that there are no significant differences in the effects of the treatments. The null hypothesis states that the following distributions 𝐹1= 𝐹2 = ⋯ = 𝐹𝑘 are equal. To calculate the Kruskal-Wallis statistic, all N observations from the k-samples are combined and ordered from smallest to largest. Let r ij be the rank of X ij in this joint classification, and Rj defined as (3).
Thus, for example, R 1 is the sum of the ranks received by the observations of group 1 and R 1 is the average rank for these same observations. Kruskal-Wallis H statistics are given by [18] as shown in (4)
At a significance level of α, H 0 is rejected if H ≥ h α otherwise, do not reject. The values of hα are given in Table A.12 of [18]. When H 0 is true, the statistic H has, as min(n1, ⋯. nk) tends to infinity, an asymptotic chi-square X 2 distribution with K - 1 degrees of freedom. Under this assumption, the reject rule is.
Reject H ≥ χ 2 k-1,α ; otherwise, do not reject.
When the null hypothesis is rejected and it is concluded that at least one sample comes from a population with a different median, some post-hoc tests (e.g., Dunn's test) can be used to identify which samples differ significantly.
2.2 Random Projections
The hypothesis of interest (see hypothesis in (1)) can be tested using the projections of the functions. These involve mapping high-dimensional data points into a lower-dimensional space using a randomly generated projection matrix [19]. The basic idea is to use a randomly generated projection matrix to map each high-dimensional data point onto a lower-dimensional space. By doing this, we can reduce the number of dimensions of the data while still retaining important information about the data structure.
Random projections are often used in situations where the dimensionality of the data makes it difficult to work with or analyze. In other words, random projections can be a handy tool for reducing the complexity of the data without losing important information. Given a set of data or a distribution in spaces of dimension greater than one, random projections consist of projecting the data or calculating the marginal of the distribution in a lower-dimensional subspace that has been chosen randomly [20]. Random projections preserve certain properties that are very important in the FDA. One of them is that it preserves distances with a high degree of probability if a projected subspace is the uniform distribution. This result is extended to the standard Gaussian distribution [10]. In this sense, [21] showed that if two distributions are defined in a separable Hilbert space and have finite moments of some order, then projecting the distributions onto a random one-dimensional subspace is sufficient to distinguish them with high probability, as long as the moments of one of the distributions match those of the random projection. In other words, if we have two distributions with similar moments up to some order, projecting them onto a random one-dimensional subspace will produce similar one-dimensional marginal distributions. However, if the moments of one of the distributions differ from those of the random projection, then the one-dimensional marginal distributions will be different, and the two distributions can be distinguished with high probability.
Once the functional data have been projected onto a lower-dimensional space, a hypothesis test can be performed to determine whether the functional means are equal. The choice of hypothesis test depends on the specific application, but a common approach is to use a t-test or an ANOVA test. One advantage of using random projections to test the equality of functional means is that it can be computationally efficient, mainly when dealing with high-dimensional functional data. It can also be robust to noise and outliers in the data, as random projections can help filter out some of the noise.
3. KRUSKAL-WALLIS TEST FOR FUNCTIONAL DATA
This research presents an extension of the Kruskal-Wallis test for functional data based on random projections.
We propose extending the Kruskal-Wallis test to the case of functional data (the observation for each individual in the sample corresponds to a functional datum). As in the univariate case, in the context of functional data analysis, statistical tests require the fulfillment of some assumptions. When the samples are small and the curves do not underlie a Gaussian stochastic process, the functional ANOVA could be inappropriate, and a non-parametric method may be used as a valid alternative. Specifically, a Kruskal-Wallis test for functional data based on random projections (KWFD) is proposed as an alternative methodology to the one-way functional ANOVA when the Gaussianity assumption is unrealistic. The KWFD is a non-parametric alternative for comparing the medians of functional data of three or more groups. We extended the KW test by randomly projecting the functional data onto a low-dimensional subspace.
Let X ij (t), i = 1,2, ⋯, n j , j = 1, ⋯, k a functional random sample of curves, where t ∈ [a, b] is the domain (generally time), i correspond to an individual, and j the index for the level factor. The functional random variables are considered independent trajectories of the stochastic processes SP(μ j (t),γ(s,t)),j = 1,⋯ ,k with a common covariance function γ(s, t). Let x ij (t), i = 1,2, ⋯, n; j = 1,⋯, k be the recorded set of curves under the k treatments. In the following, we describe the procedure for calculating the H statistic to test the null hypothesis in (1).
Generate one Brownian motion υ(t) in the interval of interest T ∈ R.
Calculate the random projections x ij = ∫ a b x ij (t)υ(t)dt, i = 1, ⋯, n; j = 1,2, ⋯, k.
Calculate the rank of each projected curve within its group.
Using the random projections, proceed as in the usual way to calculate r ij ,R ij , and the statistic H in (3).
Reject the null hypothesis in (2) at the level α if Hc ≥ χ2 k-1;1-α . An alternative is calculating the p-value using a permutation test.
The Kruskal-Wallis test for functional data based on random projections is calculated similarly to the univariate Kruskal-Wallis test. It is based on the sum of the ranks of the projected curves within each group. The test assumes no specific distribution for the functional data and can be robust to atypical curves.
4. RESULTS AND DISCUSSION
Section 4.1 presents a simulation study based on a single Brownian motion simulation. Section 4.2 shows the p-values obtained by generating 1000 random projections.
4.1 Simulation study indicators
We assess the power of the test to detect differences between medians of k-samples of functional data. To establish the performance, we show the results of a simulation study. We follow the procedure given in [15] to perform the analysis. For simplicity, just three groups of curves are considered.
Where μ(t) = sin(2πt), t ∈ (0,10), is the mean function and the errors ε ij (t) = 1,2,3, follow a uniform distribution on [ -1, 1]. As an initial illustration, a graph of a Brownian motion and 120 simulated curves according to the equations given in (5) are shown in Figure 1. The curves in red and green are very similar (these come from analogous models (rows 1 and 2 of the equations in 5, and the curves in blue involve an additional parameter δ(t) = δ = 1.2 that makes these different from the previous ones. Notice in Figure 1 that the highest periodic peaks of the blue curves are close to 3, while in the other two cases (red and green curves), these are close to 2, i.e., the null hypothesis should be rejected. The errors are assumed to be uniform in the interval (1,1). Performing a hypothesis test on the means of functional data assuming that the processes are Gaussian with data such as those presented in Figure 1 would be inappropriate.

Figure 1 Brownian motion v(t) = v(t - 1) + ϵ(t),ϵ(t) ∼ Normal (0,0.5),t ∈ (0,10) (above left) and curves simulated under the models 𝑋 𝑖1 𝑡 =𝜇 𝑡 + 𝜀 𝑖 𝑡 Xi1 (t) = μ(t) + εi (t) (above right), X i2 (t) = μ(t) + ε i (t) (below left), and X i3 (t) = μ(t) + δ(t) + ε i (t) (below right), with μ(t) = sin(2πt),δ(t) = 1.2 and ε(t) ∼ uniform(-1,1)
To evaluate the power of the test, we considered δ(t) = δ, for all t ∈ [0,10], with δ = 0.0, ⋯, 0.7. Four sample size scenarios are considered (n = 10, 30, 80, 120) for each sample group. In each case, 1000 realizations are generated. Based on each sample size, we performed a Kruskal-Wallis test as defined in Section 3. In each case, the power of the test is obtained as the percentage of p - values less than 0.05. We used the libraries fda.usc and stats of R to perform the analysis [22].Figure 2 shows the empirical power curves for each of the sampling sizes 𝑛 and δ(t) = δ values. Note that the power of the test increases when δ and 𝑛 increase; that is, the simulation study provides evidence that the Kruskal-Wallis test for functional data is unbiased and consistent. The R code used is available at https://github.com/frajaroco/KWfdRP/blob/main/KWtest.R

Created by the authors
Figure 2 Empirical power curves of the Kruskal-Wallis test according to the variation function δ(t) = δ and the sample size n. n = 10 (blue line), n = 30 (green line) n = 80 (red line), and n = 100 (black line) for each sample group. The bottom dashed line corresponds to the significance level α = 5 %
4.2 Real data analysis: Temperature curves in Canada
We apply the Kruskal-Wallis test for functional data from Section 3 to a widely used meteorological data set in the context of the FDA [23]. This corresponds to the average daily (30-year) temperature (in degrees Celsius) at each of the 35 weather stations located in four climatic zones of Canada (in brackets the number of stations in each zone): Arctic (4), Pacific (7), Continental (9), and Atlantic (15) (see Figure 3). The Pacific zone is located on the west coast of Canada, including British Columbia and parts of Yukon and the Northwest territories. This area is defined by mild, rainy winters and cool, dry summers. The continental region covers the central parts of Canada, including Manitoba, Saskatchewan, and parts of Alberta and Ontario. Its climate is marked by cold winters and short and hot summers. The Atlantic zone covers the eastern parts of Canada, including Nova Scotia, New Brunswick, and Prince Edward Island. It has mild, wet winters and cool, moist summers. The Arctic region covers the northernmost parts of Canada, including Nunavut, the northwest territories, and parts of Yukon, Quebec, and Labrador. This zone has long, harsh winters and short, cool summers (see Canada's Climate Regions at the link https://sites.google.com/a/ocsb.ca/cgc-1d/a-unit-4-climate/1-canadas-climate-regions). The daily temperature data for the four climatic zones were smoothed using a Fourier basis function. The curves obtained after smoothing are shown in Figure 3. The interest is to determine whether there are significant differences between the mean (median) curves of these areas. For this purpose, we apply the Kruskal-Wallis test presented in Section 3. We generate random projections using (6) with i the index corresponding to the weather station in each one of the four climatic zones (j = 1 (Arctic), 2 (Pacific), 3 (Continental), 4 (Atlantic)) and ν(t) a Brownian motion. The number of stations in each zone is 4 (Arctic), 7 (Pacific), 9 (Continental), and 15 (Atlantic).

Created by the authors.
Figure 3 Temperature curves (x ij (t)) for the Atlantic, Continental, Pacific, and Arctic climate zones obtained after daily data (averages of 30 years) are smoothed using Fourier basis functions
After obtaining the random projections, we conduct a classical Kruskal-Wallis test with these values. For this case, a p - value = 0.00361 was obtained, and consequently, in concordance with Canada's Climatic description above, the null hypothesis is rejected. Note that there are some atypical curves in each panel of Figure 3. Using a classical ANOVA test based on random projections can be limited in this case. A robust methodology, as proposed here, could be more appropriate. Wilcoxon’s post-hoc tests [24] (Table 1) at a 10 % significance level of 10 % show that the medians of the Atlantic and Pacific zones are significantly different from the median of the Arctic region. At the same level, there are differences between the medians of the Atlantic and Continental regions. A graphical comparison (Figure 3) indicates marked differences between the curves of these regions.
The results described above are based on random projections from a particular BM. The attached R code (https://github.com/frajaroco/KWfdRP/blob/main/KWCanadianWeather.R), shows the values found with 1000 Brownian motions, and the general conclusion is the same.
4.3 Discussion
ANOVA for functional data has been widely discussed, and several approaches have been considered [1], [2]. Many of these are based on the Gaussianity assumption [8]-[10]. Here, we adapt a classical non-parametric test to this scenario. The strength of the Kruskal-Wallis test for functional data proposed here lies in its versatility. It does not depend on the assumption of Gaussianity, thus extending its applicability to various real-world scenarios where data may deviate from a Gaussian distribution. This test is flexible and can be used with various types of functional data, including curves and time series. It does not impose strict assumptions on the data distribution, making it suitable for analyzing diverse datasets. This approach is particularly advantageous when dealing with data that may not conform to normality or have unknown distributions. Like other statistical tests, the Kruskal-Wallis test assumes the independence of observations within and between groups. Violations of this assumption could potentially affect the accuracy of the test results. If the Kruskal-Wallis test indicates significant differences between groups, post-hoc tests can be conducted to identify differences between groups. Many other non-parametric methods are available for post-hoc testing, each with strengths and limitations.
5. CONCLUSIONS
We propose a non-parametric method for the k-functional problem, which is useful when the sample size is small, the assumption of normality is not reasonable, or when there are atypical curves. We propose the use of one-dimensional random projections to solve the problem. After obtaining scalars from functions using random projections, a classical Kruskal-Wallis test can be used to test the hypothesis. The results obtained from the simulated and real data show a good performance of the methodology. The results (Figure 2) illustrate that the Kruskal Wallis test extension performs well under the null hypothesis. Power increases for larger sample sizes and distance parameter. This plot allows us to validate that the proposed test is unbiased and consistent. Some authors consider using points-wise test statistics for functional data problems with two samples and similarly for the k-sample problem, although they are not global tests. Our approach is a helpful alternative when the sample is small, and the Gaussian assumption is inappropriate.