Introduction
This study is part of a research line which proposes calculation of earthquake hypocentral parameters applying artificial intelligence methods in order to develop an early warning system for the city of Bogota. In case of a destructive seismic event in this area, the entire country would face many harmful social and economic effects; this is why a seismic early warning system around Bogota is important and the earthquake arrival azimuth estimation is one of the main parameters in this system. Nearly a third of Colombia's population lives in Bogota's Savannah and surrounding which is the country's main economic center with around 40% of the gross domestic product (Ojeda et al., 2002). The density of seismological stations around the city is not high enough, this makes the required time for seismic events localization to be longer than the travel time to areas where the early warning is required. An alternative solution to this problem is employing seismological data from previous events recorded at one single station to estimate the earthquake hypocentral parameters (Ochoa et al., 2014). Automatic computation algorithms in a single broadband three-component station have been mainly developed for P and S waves onsets detection, allowing estimation of source location using the back-azimuth and the apparent surface speed measurements (Magotra et al., 1987; Roberts et al., 1989; Saita & Nakamura, 2003), or seismic moment estimation (Talandier et al., 1987; Reymond et al., 1991; Odaka et al., 2003; Horiuchi et al., 2005; Wu et al., 1998; Espinosa, 1995). Supervised machine learning techniques based on kernel methods have become a very powerful tool for mathematicians, scientist, and engineers, providing solution in areas like signal processing and pattern recognition; its implementation is quite simple and can be performed by applying mathematical functions that combine input variables as a combination of themselves, obtaining an enhanced new space with more dimensions where separation of classes can be achieved.
There are several methods to detect seismic wave and its arrival azimuth in a single three-component station (Magotra et al., 1987; Anant & Dowla, 1997); these authors employed algorithms that measure the level of linear polarization in the P wave's arrival. The methodology proposed in this research consists of applying SVMs along with a kernel function in order to estimate the arrival azimuth with minimal processing of data acquired at the station, similar to methodology applied to a fast determination of earthquake magnitude and epicenter distance using a single seismological station (Ochoa et al., 2017; Ochoa et al., 2018).
Data Sets And Methods
The dataset used in this research belongs to the "El Rosal" seismological station, located toward north-west Bogota as shows Figure 1. This station is part of the Colombian Seismic Network operated by The "Servicio Geológico Colombiano - SGC" (Colombian Geological Service).
The "El Rosal" station employs a Guralp CMG - T3E007 sensor in three components and a Nanometrics RD3-HRD24 digitizer, which provides simultaneous sampling of three channels with 24-bit of sampling rate (Bermudez & Rengifo, 2002). The data correspond to the three component raw waveforms recorded directly in this station and a seismic catalog with 2164 characterized events, selected between January 1st 1998 and October 27th 2008; all of them located less than 120 kilometers from the station.
The Colombian seismic network consists of 42 stations, with an average distance of 162 kilometers between them, which record and transmit seismic data in real time for the entire country as shown in Figure 2.
Before starting the processing related to SVMs, waveform files from "El Rosal" station were converted to the American standard code for information interchange (ASCII) format, using a Seisan package tool; earthquakes with magnitudes lower than 2.0 ML were ignored; therefore the followed processes were applied on the remaining 1011 events. Since the selected seismic records present variable levels of noise, it was necessary to filter them out with both high and low-frequency filters. Low frequencies correspond to instrumental noise that can be easily eliminated through implementation of a high-pass filter with cut-off frequency of 0.075 Hz (Wu & Zhao, 2006), while high frequencies were removed with a low-pass filter with cut-off frequency of 150 Hz.
The statistical distribution of azimuth values is presented in Figure 3, where the main distribution of the whole dataset is observed. This histogram shows a high frequency in 240 degrees (azimuth), indicating an active seismogenic zone in that direction. Although this is not a homogeneous distribution, it represents the regular seismic behavior of the area.
Descriptors - Input data set of the SVM
In the first stage, parameters that have been previously used by other authors to earthquake magnitude estimation were calculated and employed as input variables or descriptors for the SVM of this research. In this sense, the relationship between maximum P wave amplitudes and local earthquake magnitudes was considered (Wu & Kanamori, 2005), where a linear regression was performed for each one of the three components of the station. Three parameters were taken from this linear regression which corresponds to the slope (M), an independent term (B) and correlation coefficient (R). The maximum amplitude values (Mx) obtained for each component's time window were used as descriptors as well. Therefore, each event had 12 descriptors associated with this concept.
In the second place, 9 descriptors used for epicenter distance estimation were added adjusting a linear regression of an exponential function in time (t) by applying the expression "Bt exp (-At)"; this expression belongs to the envelope of the seismic record in logarithmic scale (Odaka et al., 2003) determined also by a linear regression and its respective correlation coefficient (R), for each of component in the seismic station. The correlation coefficient (R) along with the parameters (A) and (B) were calculated for each component; where (B) represents the slope of initial part of P waves and (A) is a parameter related to amplitude variations in time.
Maximum eigenvalues of two-dimensional covariance matrix were employed as input, calculated as described in Magotra et al., 1987, and Magotra et al., 1989. A windowing scheme with one second time windows was performed to obtain consecutive values for which a linear regression was calculated, also determining the slope (M), the independent term (B), the regression correlation factor (R), and this time with addition of the arithmetic mean of the eigenvalues (P). This last processing works with all components of the station at the same time, thus 4 descriptors were added as input related to this process.
In summary, the SVM of this study employs 25-time signal descriptors as input (Table 1); 12 of them related to works on magnitude calculation, 9 were associated with epicenter distance estimations and the last 4 were used in the back-azimuth determination. These descriptors were calculated for 5, 10 and 15 seconds signal of the 863 selected events.
The SVM Model
A SVM is a supervised classification technique that has its roots in statistical learning theory and has shown promising empirical results in many practical applications, from handwritten digit recognition to text categorization. SVM also works very well with high-dimensional data and avoids the curse of dimensionality problem (Tan et al., 2006). In geosciences e.g., it have been applied in earthquake characterization (Ochoa et al., 2017), automatic recognition of natural fractures (Leal et al., 2016), automatic indicator of lithologies in open hole logs (Leal et al., 2018), among other applications related to pattern recognition. The SVM model of this research was trained with the refined data set for each time window using the Waikato Environment for Knowledge Analysis WEKA 3.6 (Frank et al., 2016) and the 25 descriptors explained before. This algorithm has strong statistical support and can be easily implemented on the station by electronic processing cards.
After performing several processing tests, a standard normalized polynomial kernel was selected. In order to choose the kernel exponent and the complexity factor, correlation factors and minimum absolute error obtained by a 10 fold cross-validation process were compared. These processes were carried out testing multiple combinations of exponents and complexity factor for selected earthquake magnitudes and time signals. The correlation coefficient calculated for each partition corresponds to the Pearson's coefficient, which measure the linear relationship between two variables independently of their scales. This coefficient takes values between 1 and - 1; a value of zero means that a linear relationship between two variables could not be found. A positive value of this relation means that two variables change in the same way, i.e. high values of one variable correspond to high values of the other and vice versa. The closer this value is to one, the greater certainty that two variables have a linear relation. The Pearson's coefficient in this research is as high as 0.6, showing the relevance of SVM in the estimation of earthquake arrival azimuth; the lower value of this coefficient was 0.096 as shown in Table 4.
Results and discussion
Using the 25 descriptors and real magnitudes for each seismic event, a group of 12 datasets was evaluated (Table 2). Each dataset corresponds to combinations of 4 minimum magnitude filters (2.0, 2.5, 3.0 and 3.5 ML) and 3 signal length filters (5, 10 and 15 seconds), evaluating combinations of 7 values for kernel exponent (E = 1.5, 2, 4, 5, 10, 20 and 50) and 6 values for complexity factor (C = 1, 3, 5, 10, 20 and 50); testing 504 models of SVMs in order to find the combination of parameters with the best correlation factor in arrival azimuth determination. Table 2 shows values of correlation coefficients in each combination of cut-off magnitude and time signals where kernel exponents and complexity factors were calculated. According to Table 2, the best correlation coefficient is 0.6 for a time signal of 15 seconds (Magnitude > 2 ML).
Table 3 shows statistical summary for the best model of earthquake arrival azimuth in each combination of time signal (15, 10 and 5 seconds) and magnitude (2, 2.5, 3 and 3.5 ML). According to skewness values, all distributions present similar shape except samples from 5 seconds and 3.5ML (Skewness = -4.86), theses samples show a negative skew related to low number of samples for this set (count = 33). The mean arithmetic can be affected by extremely low and high values; therefore, any interpretation from this parameter might produce a wrong perception of the data. High kurtosis values are related to over-fitting in some combinations and lower kurtosis are related to time signals of 10 and 15 seconds with cut-off of 3.5 ML, which can be interpreted as the arrival azimuth estimation may be reliable for earthquake greater than 3.5 ML and time signal of 15 seconds; however, the objective of this research is to develop the best model with the lowest base magnitude; in consequence and according to Table 3 the best time signal should be 5 seconds and cut-off magnitude of 3.0 ML.
The choice of SVM final parameters is shown in Table 4, where Pearson's correlation coefficient and mean absolute error are presented for each combination of kernel exponent "E" and complexity factor "C", all of them for the time signal and the cut-off magnitude previously selected. These parameters were calculated using SVM algorithms in WEKA 3.6 (Frank, et al. 2016) with a standard normalized polynomial kernel and 10 fold cross-validations. According Table to 4, the earthquake arrival azimuth can be performed at "El Rosal" station using the SVM based model with a normalized polykernel of exponent 2 and complexity factor of 10 for a time signal of 5 seconds and 3.0 ML of earthquake cut-off magnitude, showing a standard deviation of 45.4 degrees.
Figure 4 shows the cross-plot with a relationship between the real arrival azimuth (X-axis) and the azimuth calculated by the model (Y-axis), where a normal statistical pattern can be observed in the distribution of residual values (histogram). The dashed blue line represents the linear behavior of predicted data, corresponding to the locus where prediction is equal to real values. This model tends to place the prediction further south to the real location, because of seismogenic zones toward east and west of the station producing more data in those directions and lower among of information from north and south; this is a normal operational condition of "El Rosal" station and therefore, it is a behavior implicit into the model.
Conclusions & Recommendations
This model is proposed and evaluated for fast earthquake arrival azimuth determination, based on support vector machine regression through pattern recognition and characterization of earthquake signals recorded on a three components seismic station in only 5 seconds, anticipating the arrival of earthquakes in the city of Bogota. Additionally, this model can be implemented directly in the electronic devices of a seismological station, where the main mathematical process corresponds to a simple matrix product, a kernel function of exponent 2 and complexity factor of 10 with an earthquake of 3.0 ML.
The accuracy reported in this research is lower compared with results reported by Lockman & Allen, 2005 and Eisermann et al., 2015; however it must to be consider that these authors works with data from several seismological station and not with a single station as this research does. Nevertheless, the result of 45.4 degrees in earthquake arrival azimuth estimation showed in this research is an improvement on that of Noda et al., 2012, who had standard deviation between 49.0 and 67.9 degrees working also with a single station.
The implementation of additional input variables such as predominant period, Fourier and wavelet frequency spectra should be considered in order to obtain higher correlation factors. Furthermore, the use of an updated dataset is recommended, adding information from October 27th 2008 to present; this new data along with additional input variables might improve the model performance and reaching better estimation of earthquake arrival azimuth.
It is important to find ways to improve the prediction accuracy based on further research, supported by computational intelligence and geophysics research groups as well as the seismological network in Bogota's Savannah and its surroundings managed by the Universidad Nacional de Colombia.