Ant Colony Algorithm for Determining Dynamic Travel Routes Based on Traffic Information from Twitter

. Combining the search method for fire suppression routes with ant colony algorithms and methods of analyzing twitter events on the highway is the basis of the problems to be studied. The results of the twitter data feature extraction are classified with Support Vector Machine after it is implemented in the Simple Additive Weighting method in calculating path weights with criteria of distance, congestion, multiple branching, and many holes. Line weights are also used as initial pheromone values. The C-means method is used to group the weights of each path and distance so that the path with the lowest weight and the shortest distance that will be simulated using the Ant Colony. The validation results with cross fold on SVM with linear kernels produce the greatest accuracy value is 97.93% for training data distribution: test data 6: 4. The simulation of the selection of the damkar car path from Feather to Pleburan with Ant Colony obtained 50 seconds of computation time, whereas with Ant Colony with Clustering the computation time was 15, resulting in a reduction in computing of 35. Ant colony with MinMax optimization gives the best computation time of 14.47 seconds with 100 iterations and 10 nodes.


Background
Highway is one of the infrastructures that connects one place to another. In the current modern era with the high demand for roads and the large number of road users, highway management is needed to respond and detect quickly and precisely the dynamic phenomena that occur [1]. Road management includes road maintenance management and traffic management for rapid and appropriate handling of traffic jam phenomena [2]. Roads are dynamic, because road events such as traffic jams caused by traffic signs, road construction, traffic accidents, driving style of drivers change with time, therefore traffic management is needed to solve these problems [3].
One of the traffic management arrangements is the management of fire engine routes. This management system must be able to present the fastest route from the location of the pool of fire engines based on distance, congestion value, number of road holes and branching, so that the calculation of the weight value is needed to obtain the optimal solution [4]. Algorithms for determining the fastest route include ant colony, djikstra, tabu search and warshall where from the four algorithms ant colony gets a near-optimal solution [5]. Because ant colony has quite a lot of complexity, computation time is quite long, so it requires path clustering for algorithm optimization [6].
Visibility values can be used to give ant directions for choosing the desired route [7]. The calculation results of SAW weight values can be used to determine the value of visibility. Therefore, accurate congestion data is needed so that firefighters can avoid traffic jams [8]. One provider of congestion data that is quite accurate and is real-time in nature is twitter data. In 2017, the number of active Twitter users reached 330 million and 80% of active users on mobile devices [9]. Twitter is flexible where Twitter users are able to create, consume, promote, distribute, get and share information with their communities so that large clusters of data are needed to reduce errors in data processing [10]. Previously, a classification analysis of twitter data with various methods such as naïve bayes, Maximum Entropy, or Support Vector Machine was conducted. It was concluded that the Support Vector Machine (SVM) method gave the best results compared to other methods, with accuracy up to 82, 2% [11]. SVM can be applied to handwriting recognition and text categorization, and SVM can avoid the difficulties of dimensionality problems [12].
The merger of the search methods for fire truck travel lines with the ant colony with clustering algorithm with the method of analyzing data on tweets about events on the road becomes the basis of the problems to be studied. Collection of twitter data detected based on predetermined keyword tables will be extracted based on syllables and the time the tweet was made. Furthermore, the system will classify the event with the Support Vector Machine algorithm after being implemented on the simple additive weighting method in calculating the criteria for distance, congestion, multiple branching, and holes. Ant colony with clustering algorithm is used to get the optimal path with twitter data in realtime traffic conditions and displayed with a mobile-based interface.

Literature review
The twitter data mining process consists of information extraction, information summarization, and document classification [13]. Previously, sentiment analysis studies on Twitter have been conducted with various methods such as Naïve Bayes Classification, Maximum Entropy, or Support Vector Machine. It was concluded that the use of the Support Vector Machine (SVM) method gave the best results compared to other methods, namely the accuracy of up to 82.2% [11]. Research related to traffic congestion in real-time in Chicago and New York (USA) is presented by Jorge et al [14] to extract information from the user's geotag location, time, date, and content of tweet text.
The use of twitter media as an ingredient for the case of land occupancy based on weather conditions with the algorithm naïve bayes classifier and SVM presented by Marchel et al [15], obtained results where the SVM method has a greater level of performance than naive bayes with a difference in accuracy of 0.0015 due to inability to naive bayes in overcoming data outlier problems. There are previous studies that also use 2 algorithms namely Naïve Bayes Classifier and SVM presented by Hassan et al [16]. This research was carried out under text enrichment through Wikitology using data from 20 newsgroups with 1000 categories, the results showed that the more training data, the greater the accuracy of SVM, the more features and external enrichment for labeling, the greater the accuracy of Naïve Bayes.
The ant colony with clustering algorithm is presented by Gao et al [17] where immigrant schemes are proposed to address dynamic location routing (DLRP) problems. DLRP is divided into two parts formed by locationallocation problems (LAP) and vehicle routing problems (VRP) in a dynamic environment. To handle LAP, the C-Means clustering algorithm is used to handle the location of nodes and surrounding cities in each class. Immigrant schemes are used to overcome dynamic location routing (DLRP) problems with ant colony with clustering (KACO), the results of the study show that compared to normal ACO algorithms such as US, EAS, MMAS, ACS, P-ACO, SA and GA, KACO produce better performance.
This study will use the approach taken by Gao et al. [17] to address the problem of dynamic travel route selection and in-depth learning to detect traffic accidents from social media data as done by Hasan et al. [16] using twitter data around events on the road with methods Support Vector Machine. The time-based data will be weighted by the SAW method so that the visibility value is obtained as the basis for calculating the probability node to determine the fastest route in real-time with the Ant Colony algorithm and applied to the problem of fire truck transportation.

Twitter Application Programming Interface
To facilitate interaction between developers and services, Twitter has provided an API that can be accessed publicly via the internet.

Support Vector Machine
Support Vector Machine (SVM) is one of the classification methods using machine learning (supervised learning) that predicts classes based on models or patterns from the results of the training process. Classification is done by finding a decision boundary that separates a class from another class, in this case the line acts to separate positive-sent tweet (labeled +) with negative-sent tweet (labeled -) [18] In this SVM the first is training (learning) which is to study the text patterns on tweets into a model, according to the reference given, then the second step is prediction (testing) which uses a model to classify text and the results are predictions of classification. In the design, the tweets and labels will be divided into training and testing data as shown in Figure 1. Then after training, proceed with a prediction process that will result in system accuracy. Figure 2 shows the Hyperplane is formed between class-1 and +1, where

SAW Method
The Simple Additive Weighting (SAW) method is often also known as the weighted sum method. The basic concept of the SAW method is to find a weighted sum of performance ratings on each alternative on all attributes. The SAW method requires the process of normalizing the decision matrix (X) to a scale that can be compared with all available alternative ratings. The formula for normalizing is shown in equation 2 as follows:

Fig. 2.
Hyperplane is formed between class -1 and +1 [19] (2) Where rij is the normalized performance rating of the alternative Ai in the Cj attribute; i = 1,2, ..., m and j = 1,2, ..., n. The preference value for each alternative (Vi) is given by equation (3). A larger Vi value indicates that the alternative is more chosen.

Ant Colony algorithm
Dorigo and Colorni, called the ant system, were first applied to the problem of travel salesman (TSP) in 1991 [20] In general, the ACO algorithm for TSP follows the algorithm scheme [21] as follows: procedure ACO Algorithm for TSPs Determining parameters, initialize Pheromone trails while (termination condition not met) do ConstructAntSolutions ApplyLocalSearch (optional) UpdatePheromones Endwhile End procedure Ant (k) will move from node (i) to (j) with the probability namely: Where τij is the number of pheromones found at the edge between point i and point j, ηij is visibility, α is a parameter that controls the relative weight of Pheromone and β is the distance controller parameter (α> 0 and β> 0). In selecting the next node (s) the equation (5) is used Furthermore, the global pheromone intensity in all arcs is updated using equation (6): with ξ is the rate of local pheromone evaporation, and n is the ant cycle. Furthermore, the intensity of the local pheromone in all arcs is updated using equation (7): With ρ is the rate of evaporation of the global pheromone, and is the pheromone issued by ant k with Lk is the length of the ant line k and Q is a constant.

Elitist Ant System (EAS)
The additional intensity of Pheromone from Tbs tour is with gives additional quantity for each edge, where e parameters given to define the best tour value (Tbs) and Cbs is the best tour length. Pheromone changes are defined (8) The advantages of EAS are as follows: • The number of iterations is done less, this is because the edges on the best tour get more Pheromone additions and cause the search space to be narrower. • The space for finding solutions is narrower, making it faster to find solutions. Although the search space is more perfect but EAS does not show stagnation, it means that it continues to look for the possibility of a better new tour.

MAX -MIN Ant System (MMAS)
The MAX-MIN Ant System (MMAS) is a development of the next US algorithm, with several main changes as follows • First, adding Pheromone can be done on the edges which are part of the best tour found at the start of the algorithm (best so-far-tour) or on the best tour found in the iteration (best-tour iteration). It could also be the addition of Pheromone to both, the best so-far-tour and the best-tour iteration at the same time. • Second, to overcome the problem of the first change, the MMAS gives a limit in giving Pheromone values with [τ min, τ max] intervals. • Third, initialize Pheromone with the upper limit of Pheromone's value, which together with Pheromone's small evaporation rate, has increased tour exploration since the start of the search. • Fourth, Pheromone is re-initialized when there is stagnation or when no tour is found that matches the desired iteration.

Railroad Car Route Routing
Data on road names and distance traveled by each road (km) that can be traversed by assumptions on arterial roads and provincial roads and survey of homebase damkar coordinates. Distance and one-way data for each street name were obtained from the Semarang City Transportation Office, while data on the number of Branches and holes were obtained from field observations on January 1, 2019. Table 1 shows examples of main and alternative fire trucks where the initial location is on Jalan Madukoro or the location of the pool fire engine and the destination is Jalan Imam Barjo, Pleburan Semarang.

Research Procedure for Twitter Data Classification
Tweet data obtained through the Twitter API with information tokens and APIkey. Data taken based on keywords is used to provide information about the traffic of the City of Semarang, namely #LalinSMG. The training tweet data taken is 500 tweets while the test data is 50 tweets automatically when reloading the page. Data validation is done by displaying location data (geotag) where only the location of Semarang City is taken for further data stored in the database. Pre-processing steps can be seen in Figure 4 as follows

Fig. 4. Text Pre-processing
Twitter API is used for the twitter data retrieval process. The results of twitter data collection can be seen in Figure 5

Fig. 5. Twitter data
Then the stemming process is carried out by breaking each word and removing the @ symbol, #, number. The stop word removal process is then carried out to eliminate the word stone, conjunctions so that feature extraction can be performed. Next is the feature extraction process to get the data results needed for classification. Fig. 6. Results of the feature extraction process Labeling is then done to describe the classes. The first label shows the occurrence of a highway, the second label shows the condition of the highway. Figure 3.7 shows the labeling criteria obtained from the extraction process of 500 training tweets, values (+) show obstacles in the highway, while values (-) have no obstacles, while Table 2 shows an example of the results of labeling as follows:

Fig. 8. Linear classification with Support Vector Machine
The purpose of the validation is to obtain the composition of training data and test data that provide the most optimal accuracy results, called the fold value. In Table 3 Table 1 shows the congestion value (nk) will be calculated as the number of tweets in class + in the specified data set (TPx) compared to the number of tweets + on all data sets (∑TPx). x is the name of the street (9) Table 4 shows the value of the NK as an input for the calculation of SAW using data from the example of the main and alternative fire trucks from Jalan Madukoro to Jalan Imam Barjo where there are two tweets (+) at Tugumuda and Jl Pandanaran. The road mapping scenario used by the SAW method to do road weights by calculating the distance, congestion and position criteria.

Table 4. Congestion Value (NK) Damkar Travel Path
In general, the system made in this study to find the best route based on 4 criteria, namely distance, many holes, lots of branching, and congestion value (NK). The workings of this system in general are: • the determination of the initial node is the DAMKAR homebase until the destination node • The ranking of pathways with SAW is based on 4 criteria and C-Means 2 cluster centers are clustered with distance as X axis and weighted ranking with SAW as Y axis. • The path chosen for the next process (Ant Colony) is a path that is close to the cluster of the shortest path and ranks the lowest weight. • Weighting value with SAW will also be used as a visibility value as the basis for calculating the probability node to determine the fastest route in real-time with the Ant Colony algorithm.

Clusterization calculation of the final matrix preference value
Clustering is based on the distance of each node to optimize computing time. The preference matrix calculation uses the C-Means algorithm  as follows: C-Means Algorithm 01 enter the clustering value k = 1 and n data 02: while k ≤ √n 03: for i = 1, 2,. . . , bro 04: random initiation of cluster centers 05: end for 06: while termination criteria not satisfied do 07: calculate distance and class, calculate the average distance in each class, value the new cluster center 08: end while 09: calculate the distance function F and save 10: k = k + 1 11: end while specifying the optimal value k based on the minimum F value

Research result
Twitter data retrieval is done by initiating the twitteroAuth library, followed by determining the keywords and authentication of the twitter API. The training tweet data taken is 500 tweets while the test data is 20 tweets automatically when reloading the page as shown in Figure 9. The data stored in the database is extracted features for SVM classification The feature extraction process begins with retrieving data from the database and activating stemmer to break each word, change to lowercase letters and stopword removal to remove symbols, conjunctions and prepositions. The literary library is used to carry out the stemming process. The process is continued with feature extraction and labeling. Feature extraction is done by calculating the frequency of occurrence of words related to highway events such as fire, accident, crowds and road conditions such as congestion, silence, etc. The words have been changed to become the basic words in the stemming process, making it easier to find the frequency of data. Table 5 shows the frequency of occurrence of words from 500 tweets. b. Tweets related to highway events are symbolized as k1, k2, ..k6. c. Tweets related to road conditions are symbolized as ko1, ko2, ko3. The extraction process of the feature is used to label using preg_match () The data obtained in the labeling process shows the number of tweets related to congestion in 23 street names according to the criteria in Table 2. There is also a process to anticipate tweets related to the mention of places, for example, Sampookong's name is defined as Jl Kaligarang, krapyak with Jl Siliwangi, Fatmawati for Jl Majapahit, and Central Java Regional Police as Jl Pahlawan as shown in Table 6   Table 6. Frequency of Labeling Values on Each Road Table 6 shows that we can determine the congestion value (NK) where NK is based on the frequency of occurrence of tweets + (jammed). The classification with SVM starts with giving values to the event and road conditions as in Table 2, giving the value of each word using an interval of 0.01 to get a good data density as shown in Table 7. The table data is then plotted into a graph so that the data distribution is obtained with two classifications that have been determined in the labeling process, namely the clause is not jammed and the clause is jammed. The data distribution forms a data distribution pattern where data between classes has a distance that can be separated by a line. the line divides the hyperplane into two classes as shown in Figure 10. Table 7. Classification data with SVM

Fig. 10. Classification with SVM Linear
Evaluation of the classification results is done using the cross fold method so that the confusion matrix is obtained. Confusion matrix is a tool used as an evaluation of classification models to estimate objects that are right or wrong. The process flow begins by reading the * .csv data file then the data is divided between training data and test data to find predictive values and accuracy. In this study, there are 2 folds so that the 2x2 confusion matrix was obtained to obtain different accuracy values for each distribution of training data and test data in accordance with Table 8 as follows Table 8. Value of Accuracy of Training Data Distribution and Test Table 8 shows the greatest accuracy value is 97.93% for training data distribution: test data of 6: 4 and confusion matrix values as in Table 9 below:   Table 9 shows that it can be seen that there are 2 +1 prediction errors in class 1 so that it can be said that the training data distribution: 6: 4 test data is optimal. Table  4 shows that there are 4 paths that can be passed by a fire truck (DAMKAR) from the pool to Jl Imam Barjo so that when added to each parameter that is in each lane With equal weights of 60% for distance, 10% for branching, 10% for lots of holes, and 30% for congestion values obtained by weight values as shown in Table 10.  Table 8 can be plotted a graph with 2 clusters using C-Means as shown in figure 11. Figure 11 shows that it can be seen that lines A, B, C, D, E are chosen as the selection of paths using ant colony. Fig. 11. Results of clustering with C-Means Clustering Figure 12 shows the interface page consists of four subinterface sections, the first part shows the Semarang city map, the second interface is the ant colony parameter controller, the third interface is realtime tweet test data, and the fourth part interface shows the stemming results and the number of tweets + tweet -(not jammed) each way. For example, if the blue dot in the Pleburan area is pressed, a route and alternative route will appear as shown in Figure 13 with 50 seconds computation time, then the C-Means algorithm will be selected with the shortest distance and the lowest weight value as shown in Figure 4 Figure 15 shows the results of the simulation with ant colony, the blue line shows the best tour results while the green line shows the results of unfinished tours (there are remaining pheromones). With 100 iterations, it can be seen that in ACS there are still pheromones that connect Feather, Kiduldan Sekayu Education, whereas in Elitis there are still pheromones in Sekayu and Pendrikan kidul, while MinMax can produce the best tour. Figure   16 shows the time of the simulation process and it can be seen that the ant colony with MinMax optimization gives the best computation time of 14.47 seconds with 100 iterations and 10 nodes

Conclusion
From the discussion described in the previous chapter, a conclusion can be drawn about the Ant Colony Clustering algorithm by weighting SAW on the twitter data The extraction of 207 data twiter features taken from May 2, 2019 to June 30, 2019 with the #lalinSMG hastag is the most "smooth" with 111 tweets. The SVM classification of twitter data generates liner Hyperplane, while the validation results with cross fold produce the greatest accuracy value is 97.9% for training data distribution: test data is 6: 4 with 2 prediction errors stuck (+1) in the class not jammed ( -1) in the confusion matrix. The simulation of the selection of damkar car paths from Feather to Pleburan with Ant Colony obtained the shortest distance of 19.9 km and 50 seconds of computation time, while with Ant Colony with Clustering the computation time was 15 seconds and the number of nodes reduced from 25 nodes to 10 nodes computing for 35 seconds or 51%. Ant colony with MinMax optimization gives the best computation time of 14 seconds with 100 iterations and 10 nodes.
For the future works, the data obtained now is the sampling data provided by Twitter. To get all the national-scale tweets, you need access to the Firehose API that requires a lot of money so you can get more number of keyword events and road conditions. In addition, in this study the number of ant colony iterations is input from the system so it is expected that there are calculations to get the number of iterations automatically when the system reaches the best route.