Hierarchical mining algorithm for high dimensional spatiotemporal big data based on association rules

The traditional data mining algorithm focuses too much on a single dimension of data time or space, ignoring the association between time and space, which leads to a large amount of computation and low processing efficiency of the mining algorithm and makes it difficult to guarantee the final data mining effect. In response to the above problems, a hierarchical mining algorithm based on association rules for high-dimensional spatio-temporal big data is proposed. Based on the traditional association rules, after establishing the association rules of spatio-temporal data, the data to be mined are cleaned for redundancy. After selecting the local linear embedding algorithm to reduce the dimensionality of the data, a hierarchical mining strategy is developed to realize high-dimensional spatio-temporal big data mining by searching frequent predicates to form a spatio-temporal transaction database. The simulation experiment results verify that the algorithm has high complexity and can effectively reduce the processing volume, which can improve the processing efficiency by at least 56.26% compared with other algorithms.


1Introduction
The essential function of spatio-temporal data is to reflect the quantitative and qualitative characteristics, spatial structure and spatial relationships of the elements or phenomena in space-time and their changes over time, which is the basis for human cognition of the geographic world [1][2] . Spatio-temporal data reflects the spatiotemporal laws of human activities, and is the basic spatio-temporal framework for all big data collection and aggregation. Spatio-temporal big data is the fusion of big data and spatio-temporal data, i.e., big data with the Earth as the object, based on a unified spatio-temporal datum, and activities directly or indirectly associated with location in space-time [3] . High-dimensional spatiotemporal big data is the data with higher dimensionality and complexity based on spatio-temporal big data with the continuous development of spatio-temporal detection technology. The prevalence of the existence of highdimensional spatio-temporal big data makes the study of high-dimensional data mining of great importance. The traditional spatio-temporal big data mining algorithms often focus on temporal granularity or spatial density, and when applied to high-dimensional spatio-temporal data mining will result in low data mining efficiency due to the high data dimensionality and large computational load of the algorithm [4][5] . With regard to the above analysis, in order to improve the efficiency of highdimensional spatio-temporal big data mining and reduce the data mining process, this paper will study the association rule-based hierarchical mining algorithm for high-dimensional spatio-temporal big data. 2Association rule-based hierarchical mining algorithm for high-dimensional spatio-temporal big data 2.1Establishing spatio-temporal data association rules To measure the reliability of association rules, two parameters, support and confidence, are introduced. Support refers to the percentage of data transactions that contain A and also contain B . Support indicates the prevalence of A B  in D . Confidence refers to the percentage of the number of transactions containing A and B compared to the number of A transactions, representing the degree of certainty that A B  holds.
In data mining using association rules, it is necessary to set the minimum support and minimum confidence according to the size of the mining object and the needs of the mining purpose [6] . When the rule is greater than both the minimum support and the minimum confidence, the rule is called a strong association rule. To facilitate the calculation, a value between %100 and 0, instead of a value between 0.1 and 0, is used to represent the support and confidence. From equation (1), there is [7].
In equation (1), codfidence is the confidence of the association rule; support is the support of the association rule.
For spatio-temporal big data, the association rules between the data, i.e., spatio-temporal association rules, are established based on the traditional association rules with the introduction of spatial and temporal constraints.
The core of the spatio-temporal association rule lies in spatial association analysis and time segmentation, which is expressed as contains at least one spatial predicate, valid-time is the effective time, % s is the support of the spatio-temporal association rule, and % c is the confidence of the spatiotemporal association rule. Spatio-temporal association rules consider the constraints of time and space, i.e., the knowledge discovered in the dataset by spatio-temporal association rules is valuable only at a certain valid time and in a certain space. After determining the above spatio-temporal big data mining association rules, data pre-processing is performed on the high-dimensional spatio-temporal big data to be mined in order to reduce the complexity of dimensionality reduction processing of the data at a later stage.

2.2High-dimensional spatio-temporal big data pre-processing
There may be duplicate data in the high-dimensional spatio-temporal big data to be processed, and in order to avoid redundant big data affecting data mining efficiency, the high-dimensional spatio-temporal big data is cleaned and processed based on the clustering principle. The detection of duplicate big data is divided into field detection and record detection, and the field detection problem is the core. If the attribute values of attributes i A of two sets of spatio-temporal big data are x and y , respectively, their attribute similarity is calculated by the following formula [7][8][9] .
In the above equation, d is the set data attribute similarity threshold. min d is the minimum distance between the spatio-temporal big data attribute values x and y . max d is the maximum distance between data attribute values x and y . The data attribute similarity is used to calculate the matching degree of the records. That is, the weight of each attribute in the record and the calculation of attribute similarity are superimposed, and the record similarity is defined as follows.

 
, Sim x y is the record similarity of spatio-temporal big data. n is the total number of data attributes such as temporal and spatial dimensions of spatio-temporal big data. This formula also needs to be used in combination with the following formula for calculating attribute weights.
So the record similarity is calculated by first calculating the attribute similarity for attribute i . Then formula (4) is applied to calculate the similarity between records. After determining the similarity between data, different types of data attributes are matched to achieve data cleaning. The principle of data field matching is that a string is viewed as an arrangement of several different sequences of letters combined. A simple definition of this aspect of similarity can be given: the value of identical characters accounting for half of the sum of each field is called similarity. If the strings are identical, they must match, and the specific steps of field matching are as follows.
In the first step, the starting strings are sorted in the original segment. In the second step, the strings are compared against the other recorded strings in order. The number of matches is recorded. In the third step, the matching degree is calculated by the formula. For the high-dimensional spatio-temporal big data to be processed, the data with matching fields are redundant data, and the data cleaning process can be completed by deleting this part of data. After the above pre-processing steps, the high-dimensional spatio-temporal big data are dimensionally reduced to improve the efficiency of subsequent data mining.

2.3Dimensionality reduction of highdimensional spatio-temporal big data
High-dimensional spatio-temporal big data contains a large amount of effective data information, but at the same time, the impact of dimensional disaster caused by the high dimensionality of the data also makes the mining of high-dimensional spatio-temporal big data exceptionally difficult, and it is necessary to reduce the dimensionality of the data while retaining the effective information of high-dimensional spatio-temporal big data.
In this paper, we choose local linear embedding algorithm to reduce the dimensionality of highdimensional spatio-temporal big data. The local linear embedding algorithm converts the global nonlinear problem into a local linear problem by expressing the overall global structure information through the overlapping local neighborhoods. The global lowdimensional representation of the data points is obtained by regrouping each sample data point and its neighborhood after dimensionality reduction by some rules [10] . If for a given high-dimensional spatio-temporal large data set   A local reconstruction weight matrix is calculated from the nearest neighbors of the sample points, and the following error function is defined.
In equation (6) In equation (7), Y is the output vector matrix after dimensionality reduction of the data. i I is the unit matrix. W is the weight matrix. To minimize the loss function, Y is taken as the eigenvector corresponding to the smallest d non-zero eigenvalues of M . Usually, the eigenvector corresponding to the eigenvalues between the 2nd and d+1 is taken as the output result. After the dimensionality reduction of high-dimensional spatiotemporal big data, the spatio-temporal big data hierarchical mining strategy is developed to complete the hierarchical mining of high-dimensional spatio-temporal big data.

3Simulation experimental research
Spatio-temporal data contains a large amount of data information, and the performance of data mining algorithms will directly determine the use of effective information in spatio-temporal data. Regarding the limitations of traditional data mining methods in dealing with high-dimensional data, a hierarchical mining algorithm based on association rules for highdimensional spatio-temporal big data is proposed in the above paper. In this section, the performance of this data mining algorithm will be tested by simulation experiments.

3.1Experiment content
This simulation experiment compares the performance of the data mining algorithm proposed in this paper with the spatio-temporal transaction-based mining algorithm and the clustering mining algorithm for testing. Among them, the data mining algorithm proposed in this paper is used as the experimental group, and the spatio-temporal transaction-based mining algorithm and the clustering mining algorithm are used as comparison group 1 and comparison group 2, respectively. The performance of the data mining algorithm is compared by comparing three indexes: algorithm complexity, hardware load situation, and algorithm processing efficiency when data mining is performed by the data mining algorithm. Comprehensive analysis of experimental data

3.2Experiment Preparation
Test data for the experiment: the total number of data is 1292567, which are all high-dimensional spatio-temporal big data collected from various monitoring channels, which do not include spatial transactions that represent only binary items. The collected high-dimensional spatio-temporal big data are composed into a data set, and the total number of all data attributes in this data set is 13.
Experimental environment: Intel(R) Celeron(R) M CPU 964@3.720 GHz, 16G RAM, operating system Windows 8. All data mining algorithms in this simulation experiment are implemented on Visual C# 2010.NET development platform. The experimental data were processed by MATLAB 2012a software. A training set consisting of low-dimensional data was used to test the operation of the three algorithms on the computer simulation platform before the experiments to avoid interference during the experiments.

3.3Experimental Results
The experimental data were data mined using the three data mining algorithms, and the minimum support degree set in the data mining process was changed. The processing complexity of the algorithms is compared by comparing the change in time and space complexity of the data mining algorithms under different minimum support degrees. The algorithm complexity of the three data mining algorithms is shown in figure 2 below, where figure (a) shows the curve of time complexity of the data mining algorithm with the minimum support degree; figure (b) shows the curve of space complexity of the algorithm with the minimum support degree. In the mining process, the more the number of transaction items, the higher the complexity of the corresponding algorithm, and the less the algorithm repeats the processing computation. The time complexity of the comparison algorithm increases and stabilizes as the support decreases, while the time complexity of the comparison algorithm 2 increases slowly and then decreases rapidly, indicating that the time complexity of the comparison algorithm 2 is better than that of the comparison algorithm 1 within a certain range, but overall, the time complexity of the comparison algorithm is higher. In terms of space complexity, the space complexity curve of the algorithm in the paper shows two large increases, but the overall trend is to increase as the support decreases. The spatial complexity of the two compared algorithms increases slowly and then decreases. The above analysis shows that the algorithm in the paper has a higher time complexity and space complexity than the comparison algorithm when performing data mining, i.e., the algorithm has a better complexity.
The hardware load and algorithm processing time of the data mining algorithm when mining the experimental data are shown in table 1 below. The data in the experimental dataset are divided into subsets with different data volumes and labeled in ascending order according to the data volumes. The processing efficiency of the algorithms and the load on the hardware when the algorithms run are compared for different data volumes of spatio-temporal data mining. The analysis of the data in the above table shows that the processing time of the two comparison algorithms increases as the data volume of the data mining object increases, and the hardware space occupation increases, which also indicates that the load pressure on the hardware caused by the algorithms is also increasing. In comparison, the algorithm in the paper has almost no significant increase in the load pressure on the hardware when mining data of different data volumes, and the data mining time of the algorithm is significantly shorter than that of the two comparison algorithms. Further processing of the data in the above table shows that the average processing time of the algorithm in the paper is 9.08 s, compared with 20.76 s for the comparison algorithm 1 and 29.6 s for the comparison algorithm 2. From the data perspective, the processing efficiency of the algorithm in the paper is improved by at least about 56.26%. In summary, the association rule-based hierarchical mining algorithm for high-dimensional spatio-temporal big data studied in this paper has a high complexity and significantly improved processing efficiency.

4Conclusion
With the development of modern technology, people gradually realize the importance of spatio-temporal big data applications. Regarding the problems of traditional data mining algorithms, this paper proposes a hierarchical mining algorithm for high-dimensional spatio-temporal big data based on association rules, and through comparative simulation experiments, it is proved that the algorithm has a high spatio-temporal complexity and can effectively reduce the computation in the data mining process. In order to improve the complexity of the algorithm, this paper adopts the strategy of dimensionality reduction followed by hierarchical mining, while to cater for the development of spatiotemporal data, the direct processing of data should be attempted to improve the algorithm processing efficiency.