Contribution of Collaborative Filtering Approach on Environmental Big Data Analytics Context

In recent years and following the spread of internet using, the way of life in our society as well as the management of the main production sectors have changed towards digitalization (E-Commerce, E-Surgery, road traffic management, etc.) to minimize the . This change challenged researchers and organizations to come up with solutions to manage, maintain, classify and make the best decision from the gigantic flows of data created daily while reducing environmental Damage (Consumption of Energical resources....) ,From which was born the Collaborative filtering approach which facilitate decision making based on the experience feedback of Internet users. This approach is used generally the big Data Analytics Algorithm to predict and classify data


Introduction
Due to the rise of internet usage, users have daily an endless supply of information about various topics (news, choice of products on E-Commerce Websites…etc.). These massive amounts of data flows put researchers and organizations in a difficult situation as they try to innovate solutions that help internet users make best decisions while minimizing unnecessary data from incoming flows to protect the environment (. Hence the birth of the concept "Collaborative Filtering" [1] which is a set of algorithms aimed at building a System of recommendation to an individual based on the assessments of a group of people having a similar profile. This technique is spread over several approaches such as the User Model Approach which consists to build models of users based on their information and the Item-Centered Approach consists to use a measure of similarity between items to determine the K items most similar to item A. This paper presents a comparative study of various collaborative filtering algorithms that use big data Analytics Algorithm. It is structures as follows Section 2 describe Collaborative Filtering Concept and the main families of this method, While Section 3 present the Decision Tree Algorithm [6] by taking as an example the CHAID Method [5]. Section 4 presents the K-NN Algorithm [2-4] Section 5 represents an interpretation and finally Section 6 is conclusion of our work.

Presentation
In general, we can say that Collaborative Filtering [8] is a set of techniques of making automatic predictions (Filtering) about User interest by collecting preferences of many users (collaborating). This technique is used by Recommender System -to filter data sets by techniques involving collaboration among multiple agents, viewpoints, data sources-on different kinds of data including (Monitoring Data, Financial Data, and Electronic Commerce Data). This process is based on the following steps: • Firstly, System collect all User Information to obtain their preferences, this collect can be ▪ Explicit: User assign rating to their products or they indicate their appreciation for example by Like.
▪ Implicit: the collect is based on behaviors (purchase, click and duration on a page).
• Secondly, System compare New User information with that of similar users.
• Finally, system recommended the products to user based on the similarities founded Today, several Web-Sites and organization (YouTube, Jumia, Netflix…) are using this approach to helps their user find the most suitable products.

Main families of Collaborative filtering approach
Nowadays, we can distinguish between three main families of Collaborative filtering approach:

Filtering System
Use the explicit tastes of users (this is thanks to a rating system) on a range of products

Passive Filtering System
Analyze the history of a user's behavior to deduce their preferences.

Filtering Based on the content System
Based on the characterization of the content of the information to be filtered (i.e. the information is only retrieved if it coincides with the topics of interest to the user).
Today, many Big Data Analytics Algorithms are used on Collaborative Filtering approach. In our study, we used two types of algorithms given their power and reduced consumption of resources 3 Decision Tree Algorithm

Presentation
The decision Tree [6, 10] is a decision support tool which represent a set of decision in graphical form of Tree. As the final decision are located at the end of branch (leaves) and they are reached by the choices made at each steps.
This method makes it possible to distribute the elements of a heterogeneous groups according to a set of discriminating variables and according to a fixed objective. It is generally used in various fields such as the banking Sector (Processing Credit Request from a set of customers), Medicine (Preventing adverse effects of a medical care) and Psychology Sector (study of models human behavior). Technique used for the detection of interaction between variables using the criterion' Chi-square formula'

Advantages and Disadvantages
Nowadays, the popularity of decision tree method is based on the fact that this method is easy to understand and allows to increment already existing trees by new options and to choose the most appropriate option and finally we can easily combine it with other decision-making tools. Despite these advantages, this method has certain limitations such that in some cases one can have very complex trees (to remedy this concern of over learning, new pruning procedures are used), and the absence of the expression of certain concepts in decision trees (XOR, parity….).

Principle of the method
In this part, we have chosen to put Decision Tree Method into practice by studying the Chaid Decision Tree Method [5] which generates no-binary trees (i.e. a node can have several branch) This method can be applied to all types of inputs and accepts observation weights.
To establish a chaid type decision tree, sur must respect the following order: ▪ Define the root of the decision tree ▪ Choose the segmentation variables using the KHI-2 formula ▪ Identify the nodes and leaves of the tree.
The KHI-2 formula is a technique founded by statistician Karl Pearson in 1900 which makes it possible to test the adequacy of a series of data to a family of probability laws or to test the independence between two random variables.
Such as • nij: is the number contained in the box spotted by the i line and j column. • n: Total workforce.
To understand the application of this formula, we decided to use the following example to predict the possibility of playing a tennis match depending on the weather. • The first step is to define the frequency , the possibility of the game based on the Sun Factor • The second Step is based on the calculation of the T0 value for all cases  In on our study , we use Sipina Software to implement the decision Tree Chaid based on a data set to predict if there is a relation between the frequency of fire , and the high temperature , the flow of water (we use 2942 rows) , we obtain an simple hierarchical result on 1200 ms.
To conclude this section, we can say that the decision tree method is generally used in areas of decision support, or data mining. It makes it possible to distribute a population of individuals into homogeneous groups according to a set of discriminating variables and according to a fixed objective.

Presentation
The

Advantages and Disadvantages
Nowadays, the popularity of the K-NN method is based on the fact that this method is easy to understand and adapted to the fields where each class is represented by several prototypes and whose borders are irregular (such as the recognition of handwritten figures or satellite images). Despite these advantages, this method has certain limitations such as slow prediction since each time it reviews all of the examples, the method takes up a lot of memory.

Principle of the K-NN method
The principle of K-NN Algorithm is based on the following steps: • Define the input data (the learning base D, choice of the distance function d, the parameter k and finally the data of unknown class x). • The data of unknown class x is compared with all the data stored in the learning base d. We We can distinguish between different distance calculation methods, this table presents the most used:

Manhattan Distance
The route taken by a taxi when it moves from one node of the network to another using the horizontal and vertical movements of the network.
It is used to calculate the sum of the absolute values of the differences between the coordinates of two points.

Euclidean distance
The shortest distance between two points, it is calculated by the following formula this method is used for quantitative data having the same type

Minkowski Distance
The measurement in a normalized vector space which can be considered as a generalization of both the Euclidean distance and the distance from Manhattan.
To understand application of calculation methods, we decided to base on the following example which makes it possible to predict if 'Sebastian' is loyal to a bank • The first step is to choose the parameter k, for our case we choose the value 3. We can say that Sébastien is a loyal client. In our study we use TANAGRA Software to process 3442 rows of data set to predict if there is a relation between the frequency of fire , and the high temperature , the flow of water (we use 3442 rows) . The result are Computation Time is 797 ms an allocated memory is 252KB.

Interpretation:
Following our study, we can say that two main algorithms of the Collaborative Filtering vision are decision trees and the K-NN algorithm. Indeed, the Decision Tree method makes it possible to distribute the elements of a heterogeneous group into homogeneous groups according to a set of discriminating variables and according to a fixed objective.
In return, the K-NN method is a collaborative filtering technique which aims to find the k closest neighbors for each of the objects according to a distance calculation method. After finding the neighborhood closest to the object, a majority vote is taken to make the recommendations.

Conclusion
This article represents a synthesis of the comparative study between the two main algorithms of the Collaborative Filtering vision, namely decision trees and the K-NN algorithm. Given its characteristics (the K-NN algorithm consumed 797ms to trait the data, the Chaid algorithm consumed about 1200 ms to treat the same data) and their results, we can say that the most suitable algorithm for developing active recommendation systems is the K-NN algorithm.
Our goal is to improve the K-NN algorithm to reduce processing time and resource consumption thus ensuring data security by adding the implementation of security algorithms.