Features Clustering Around Latent Variables for High Dimensional Data

. Clustering of variables is the task of grouping similar variables into different groups. It may be useful in several situations such as dimensionality reduction, feature selection, and detect redundancies. In the present study, we combine two methods of features clustering the clustering of variables around latent variables (CLV) algorithm and the k-means based co-clustering algorithm (kCC). Indeed, classical CLV cannot be applied to high dimensional data because this approach becomes tedious when the number of features increases.


Introduction
Cluster analysis is the process of the partitioning of data into subsets of identical characteristics, and it can be used to construct clusters in such a way that objects in the same group are similar to each other [1] [2]. Clustering technique also concerns the task of grouping similar variables into different groups [3].
The Clustering of variables is an alternative technique that allows variables to arrange in homogeneous clusters [4,5]. Moreover, it may be used to describe objects. The classification of variables around latent variables CLV [4] method involves two stages, a hierarchical clustering algorithm is used to find an initial partition of dataset followed by a partitioning algorithm. CLV method works well when applied to small datasets; mainly, it is hard to apply to high dimensional dataset because of memory and analyze barriers.
Large datasets are difficult to manage, visualize, and analyze on a massive scale in terms of volume. Two computational barriers for large datasets analysis: (i) the data can be too big to support in a computer's memory; (ii) the computing task can take too long to wait for the results [5].
CLV method shows a good results when applied to several datasets; However, in the case of high dimensional data, it is difficult at the time of big dataset [6]. Indeed, it is no longer able to group similar variables to develop meaningful structure. It becomes computationally expensive in terms of memory and execution time requirements.
There is a need to manage such large volumes of data and to cluster them easily for data analysis [7] while choosing a cluster's number and maximizing the clustering criteria for large datasets. Therefore, CLV algorithms should be efficient, scalable, and highly * G.ezzarrad@gmail.com accurate. There is a need to enhance them to suit large datasets.
To deal with this problem, a k-means based coclustering (kCC) algorithm [8] have proposed better cluster initialization dealing with outliers, instead of the hierarchical clustering algorithm proposed in the CLV algorithm, accordingly, we propose a novel method by using cluster initialization algorithm of kCC in the CLV algorithm.
The rest of this paper is organized as follows. In section 2, the background work related to the CLV and kCC algorithms is briefly described. The proposed solution is explained in section 3. Section 4 contains our obtained results, and we conclude this paper by discussing some limitations and prospects in Section 5.

Clustering of variables around latent variables
The objective of Classification of variables around Latent Variables (CLV) [4] is to find meaningful clusters of variables according to the criteria and algorithms on which the method is based.  The aim of CLV method is to get a couple (6, :), and thus to maximize the internal coherence of the clusters concerning the following clustering criterion: where Γ measures the linear link, in each cluster 7 A , between the variables 1 & in this cluster and ; A the associated latent variable.
Two types of criteria (6, :) are considered, which define two different cluster types of variables: directional groups and local groups.

Directional groups
Positively or negatively correlated variables will be merged together, no matter whether their sign of the correlation coefficients (Figure 1), and the Γ considered for maximization on (1) is: are thevariables to be clustered, TUV 2 #1 & , ; A ' the squared correlation measuring the link between the variable 1 & and the latent variable ; A , the variance of ; A equal to 1, and

Local groups
Only positively correlated variables will be associated in the same group (Figure 2), and the Γ considered for maximization on (1)

Latent Variable
In cluster 7 A , the latent variable ; A used on (2,3) is defined as: the first standardized principal component of j A formed with the variables belonging to their 7 A for direction groups, and the standardized centroid variable in 7 A for local groups.

Partitioning Algorithm
The aim of a partitioning algorithm is to find an optimum couple (6, :), therefore it is used for optimization of the clustering criterion ℒ(6, :) given in (1).
It consists of two alternating steps: the estimation step, on which the latent variables ; A are defined given the initial partition 6, and the allocation step of the variables given the latent variables. More precisely, this algorithm is developed as follows:

k-means based co-clustering algorithm
In the case of k-means algorithm, the optimization problem is to find a number of homogeneous clusters. However, finding the initial k cluster centers is a big challenge, these initial clusters are chosen at random, consequently, it can lead to poor starting points. This means that the final clusters of k-means are highly dependent on the initial clusters. To improve the selection of such cluster centers, k-means based coclustering (kCC) algorithm proposes a new cluster initialization algorithm based on a random walk based similarity measure.
This method constructs clusters in such a way that objects/variables are merged according to find k data objects/variables instead of a single cluster center. These selecting k points represent each cluster. This helps to decrease the chance of selecting a single outlier. The points selected in the same cluster have high intra-cluster similarity amongst themselves but have low inter-cluster similarity with all other clusters.
Given two documents $ l and $ m , the t th order walk between them is given by and the t th order walk between two words _ l and _ m is given by Using the similarity measure given in (4), the z o{ (z ∈ 1. . k) cluster center for the | o{ (| ∈ 1. . 9) cluster is given by The aim of the kCC algorithm is to find M number of initial clusters 6 â = (7 ( â , 7 2 â , . . . , 7 8 â ), it is considered as the initialization step on the CLV algorithm (algorithm 1)

Proposed kCC+CLV
The problem of initializing the partitioning algorithm is solved in the CLV method by implementing a hierarchical clustering algorithm (HCA) [12] based on the maximization of the criteria ℒ(6,:).
However, when the number of variables is large, the HCA needs time consuming to execute. For this reason, we change the initialization phase with the kCC method. This method has the same parameters and options as the classic k-means [13] but performs only the partitioning procedure. In this case, the objective is to find a partition 6 â = (7 1 0 , 7 2 0 , … , 7 9 0 ) of a set ! = #$ %& '. Indeed, the result of the algorithm dependent not only on the choice of the number of clusters but especially on the initial cluster centers. The number of clusters 9, should be given as an input parameter.

Results
To evaluate the performance of the proposed method kCC+CLV, an experiment was used on artificial data to analyze its effectiveness compared to other methods. The results are evaluated using the accuracy mesure which describes the correctness of an algorithm with respect to the real class labels of the data.
The results of the accuracy measure, kCC shows an improvement of 62% over the other methods h-means and HCA.
In general, the running time of our proposed kCC is lower than other approaches when either the number of variables is high or the number of clusters are large ( We calculate the CPU time in seconds for p variables (from p=50 to p=50000) and n fixed objects (n=1000). Table 1 shows that the kCC+CLV remain fast even if the number p of variables increases. In the same case, the CLV with HCA is the slower function.

Conclusion
Clustering method converts information into various clusters where the object in that group has similar properties as compared to other but not same to other clusters properties.
The iterative procedure of CLV algorithm initialization supplies a partition into 9 clusters, which maximizes the criteria ℒ.
However, this optimum is often local and can depend on the initial partition [14].
A solution to avoid this problem and to reduce the influence of the arbitrary choice of the initial partition is to consider kCC initialization algorithm.
In this case, all phases in the algorithm (algorithm1) are repeated several times, and the highest value of ℒ is considered as the best one.