Sentiment Analysis of Public Opinion Towards Tourism in Bangkalan Regency Using Naïve Bayes Method

. Sentiment analysis is natural language processing (NLP) that uses text analysis to recognize and extract opinions in text. Analysis is used to convert unstructured information into more structured information, also to determine whether an object has a positive, negative, or neutral tendency, and is an effort to facilitate decision making for tourism managers as a recommendation in developing tourist attractions. In this study, opinions were conducted on tourism reviews in Bangkalan using the Naïve Bayes method. This method is a machine learning algorithm to classify text into concepts that are easy to understand and provide accurate results with high efficiency. This method is proven to provide excellent results with a high level of accuracy, especially for large data, but has some drawbacks, sensitive to feature selection. Thus, a feature selection process is needed to improve classification efficiency by reducing the amount of data analyzed, with the Information Gain feature selection method. The word weighting method uses TF-IDF, while the data used comes from google maps reviews taken through web scraping, where tourist visitors provide reviews and ratings of places that have been visited. However, the large number of reviews can make it difficult for tourist attractions managers to manage them, so the process of labeling the sentiment class of the review data obtained 3649 reviews, with 2583 positive, 275 negative, and 457 neutral. Based on the test results that have been carried out using the Information Gain threshold of 0.0001, 0.0003, and 0.0007 can improve the accuracy of the Naïve Bayes model, for the best test at threshold 0.0007, with an accuracy value of 78.68%, precision 80.44%, recall 82.59%, and f1-score 82.53%, from the test results it shows that the use of information gain feature selection and SMOTE technique has a fairly good performance in classifying public opinion sentiment data on tourism in Bangkalan Regency, meaning that tourism management is good seen from the results of visitor satisfaction sentiment.


Introduction
Tourism is one of the attractions that must be developed in each region because the growth of this sector can increase opportunities to provide employment and business for the surrounding people, which in turn can encourage regional development and regional income [1].Bangkalan Regency is one of the areas in Madura that has a lot of tourism potential that can be developed [2] [3], whether from natural, religious, or artificial tourism.The Bangkalan Regency Government has made a tourism policy that emphasizes the arrangement of tourist attractions to attract tourists and improve existing tourist attractions in the area.With this policy, existing tourist attractions in the area need to be developed [3], to find out whether the success of tourism in an area can be measured by increasing the number of tourists who come to the tourist attractions [4].However, tourism managers must know what affects the satisfaction of visitors or tourists.So a study is needed to find out, one of which is by knowing the opinions of visitors regarding these tourist attractions.
Sentiment analysis is a way to find out the opinion of user satisfaction with a tour or system, besides that it can also be used to measure whether visitors to tourist attractions like the tourist attractions or not [5].Sentiment analysis is a part of natural language processing (NLP) that uses text analysis to recognize and extract opinions from text, it is used to convert unstructured information into more structured information and is also used to determine whether an object tends to be positive, negative, or neutral [6].There are many ways to perform topic sentiment analysis, one of which is using machine learning [7].
Machine learning is a technology that allows machines to learn data patterns and perform certain tasks independently [7].Machine learning in sentiment analysis is used to find sentiment patterns in text.This allows machine learning to be used to classify text based on positive, negative, or neutral sentiment.The Naive Bayes method is a fairly popular machine learning method or algorithm used to perform text classification, providing accurate results and high efficiency, but the naïve Bayes method still has some drawbacks when used, one of which is sensitive to feature selection.Classification performance can be reduced by a very large number of features.As a result, a feature selection process is required to make the classification process more efficient by reducing the amount of data to be analyzed.So in this study, the novelty is to use a feature selection method, namely Information gain, which is to overcome the shortcomings that exist in the naïve Bayes method in feature selection, Information gain is a method that works by selecting features with the highest weight according to the desired number of features.This method uses entropy to find the best term.The greater the Information Gain value of a term, the more significant the feature is.Meanwhile, to measure how often a word appears in a document, and count the number of documents in which the word appears, the Term Frequency-Inverse Document Frequency (TF-IDF) word weighting is used [8].
The Naïve Bayes method in this study uses increased TF-IDF word weighting and Information Gain feature selection, where this method is proven to have very good results with a very high level of data accuracy [9] [10].The data source in this study comes from review data obtained through reviews of tourist visitors in Bangkalan Regency which are taken from Google Maps reviews, because from Google Maps reviews it can be seen that visitors' opinions on the tour and also know the rating of places that have been visited [11].In Bangkalan Regency, many visitors to tourist attractions have provided reviews on the Google Maps application, which can help tourism managers know the level of visitor satisfaction through the reviews given [12].However, the large number of reviews can make it difficult for tourism managers to manage them efficiently.Therefore, to process data derived from tourist visitor reviews in Bangkalan Regency taken from Google Maps reviews, naïve Bayes is used to calculate probabilities.One of the advantages of the naïve Bayes algorithm when compared to other machine learning algorithms is the high level of accuracy, making the naïve Bayes algorithm suitable for use on very large data [13].
The review data in this study, which had previously been obtained through reviews from Google Maps, had an unbalanced number of classes in the tourism review data.So it is necessary to do a data balancing process, another addition to this research is the use of sampling techniques to overcome data balancing.The process of handling a balanced dataset in classification can be done instantly without following certain steps.However, when the number of dataset classes is unbalanced, sampling techniques are needed to handle unbalanced datasets so that the classification model continues to function properly.Thus, the novelty in this research, using SMOTE (Synthetic Minority Oversampling Technique) is used to overcome the imbalance of classes in the dataset in this research problem [14].
The ideas in this research that have been put forward above are compared with several similar studies that have been conducted by several other researchers including, research to analyze the sentiment of the Bjorka hacker on Twitter social media by comparing the accuracy results of the Multinomial Naïve Bayes algorithm, Naïve Bayes Bernoulli, and Naïve Bayes Gaussian, using 1000 data, with TF-IDF word weighting showing the Multinomial Naïve Bayes, Bernoulli Naïve Bayes, and Gaussian Naïve Bayes algorithms have 73% accuracy, 73% precision, and 100% recall; Bernoulli Naïve Bayes has 72% accuracy, 73% precision, and 98% recall; and Gaussian Naïve Bayes has 55% accuracy, 75% precision, and 63% recall [15].In similar research, the use of the Support Vector Machine (SVM) method is used to conduct sentiment analysis on Madura tourism, where the data used comes from various social media platforms, that discuss Madura tourism.Reviews of public opinion are divided into three categories positive, negative, and neutral, from the test results using K = 5 fold cross validation resulted in a positive sentiment of 192 tweets and an accuracy of 92.592% using Confusion Matrix [16].Another advanced research, comparing the Naïve Bayes algorithm with SVM on Twitter data.The analysis results show that the Naïve bayes algorithm can achieve 3.45% accuracy, precision 0.02, recall 0.04, and f1-score 0.03 compared to the sentiment analysis model using a Support Vector Machine.The data used in this study amounted to 2030 data and were divided into two models, namely training model data as much as 1624 data, and test data totaling 406 data [17].Research on the use of SMOTE techniques to overcome unbalanced data.The secondary data used in this study came from the national socioeconomic survey, which consisted of 494 samples consisting of 8 independent variables and 1 dependent variable.The CART (Classification and Regression Trees) method was used [18].This study shows that the model with SMOTE produces more accurate values compared to the model without SMOTE.The model with SMOTE produced a higher sensitivity of 67.05% compared to the previous value of only 36.36% [19] [20].
Based on a series of explanations that have been explained previously in this study, namely Sentiment Analysis of Public Opinion Towards Tourism in Bangkalan Regency Using the Naïve Bayes Method to overcome the weaknesses of the Naïve Bayes method with the addition of Information gain for feature selection, TF-IDF for word weighting and using SMOTE to overcome class imbalance in the dataset.It is hoped that these additions can provide a high level of accuracy in providing the results of tourism visitor reviews in Bangkalan district.

Methods
The research method used to describe the research design carried out to complete research on sentiment analysis of public opinion towards tourism in Bangkalan Regency using the Naïve Bayes method, which is explained according to the figure 1  In the picture above is an overview of the process carried out in this study.

Data collection and labeling
The data used in this research is secondary data derived from tourist visitor reviews on Google Maps.The selected tourist destinations are 10 recommended tours in Bangkalan Regency, namely Bukit Pelalangan Arosbaya, Bukit Geger, Bukit Jaddih, Siring Kemuning Beach, Labuhan Mangrove Education Park, Rongkang Beach, Sambilangan Lighthouse, Syaikhona Kholil Tomb, Beramah Lantern Hill, and Sumber Pocong.Data collection is done by web scraping using webharvy tools with a total of 3649 review data from 2021 to 2023.Sentiment reviews are labeled as positive, neutral, or negative, based on visitor stars. 1 and 2 stars are categorized as negative sentiment, 3 stars as neutral sentiment, and 4 and 5 stars as positive [21] [22].

Data preprocessing
The preprocessing process to produce clean data the stages used are: a) Data Cleaning is a process intended to ensure that the data from the data set has the best accuracy, consistency, and usability.b) Case Folding is used to homogenize or convert all letters from "a" to "z" in the dataset into lowercase letters.c) Normalization is a procedure to equalize or homogenize words that are written in different ways but have the same meaning as well as converting nonstandard words into standard words.This is done by using a slangword dictionary obtained from the Indonesian Colloquial Lexicon which can be accessed on Github, as well as a slangword dictionary created by the author based on the needs of the dataset.d) Tokenization is by breaking the sentence into words or tokens.The tokenization stage is performed using the split() method in the python programming language.This method breaks the words from each sentence into tokens.e) Filtering is the process of removing a list of unimportant and useless words using the stopword word list from the nlp_id library.f) Stemming, at this stage, words with affixes are converted into basic words.For the stemming stage, a standard word dictionary is needed, in this study using the Sastrawi library using the StemmerFactory module, from the preprocessing process getting the final data amount of 3326 clean review data.After the data is cleaned, TF-IDF word weighting will be carried out, where this method combines two concepts, namely Term Frequency and Document Frequency.

TF-IDF word weighting
After preprocessing, the dataset must be extracted first.Because the data contained in the dataset of tourist visitor reviews is data in the form of text or categorical data and machine learning cannot accept input in the form of categorical data, therefore feature extraction is needed to convert categorical data into numerical data, namely by using the TF IDF (Term Frequency Inverse Document Frequency) word weighting algorithm [23] [24].TF-IDF will give weight to each word, where this method combines two concepts, namely Term Frequency and Document Frequency.Term Frequency measures how often a word appears in a document, while for Document Frequency to find out the number of documents in which a word appears, and the less the frequency of occurrence [25], the lower the weight value, according to the following formula: Where, , is the term frequency value  of document ., is the appearance of term  in document . is the total terms in document . is the idf value of term . is the number of document collections. is the number of documents that contain term .

Oversampling data
The next stage is Oversampling this data has the aim of dealing with class imbalances in the dataset used in the study, because there are differences in the number of classes in the dataset related to the data used in this study [26], so that the data is balanced using SMOTE (Synthetic Minority Over-sampling Technique) oversampling.The SMOTE technique is performed by duplicating samples from minority classes so as to generate new synthetic samples through extrapolation of existing minority samples using random samples.By applying SMOTE to unbalanced data, the performance of the model in predicting minority classes can be improved [27], using the following equation: Where, Dist is the euclidean distance.Xn is the nth attribute value.Xsyn is synthetic data created to generate new data.Xὶ is the data to be replicated.Xknn is the data that has the closest distance from the data to be replicated.δ is a random number between 0 and 1.

Feature selection
Next is feature selection using information gain, which is very important in performing text classification by forming a vector space to improve scalability, efficiency, and accuracy in the text clustering process.Information gain works by selecting features that have the highest weight according to the desired number of features.Information gain involves entropy to find the best term.If the information gain value of a term is greater, then the feature is considered more significant and more important [28].

Naïve bayes implementation
The next stage of sentiment analysis, training data is used for model training using the Naïve Bayes method so that a model for classification is obtained.Naïve Bayes Classifier is a probabilistic algorithm used to predict a situation [18].In its use, it uses the concept of probability theory which involves predicting the likelihood of a future event based on data that has been collected in the past.As a probability concept, Naïve Bayes Classifier can be used to classify text documents into certain classes with high accuracy and is able to process large amounts of data [12].In classifying text documents, there are several processes that must be done, namely, finding the probability of each document category, finding the probability of occurrence of each word in each document category, determining the category of documents to be classified based on calculations from the first and second stages.
The next stage of sentiment analysis, training data is used for model training using the Naïve Bayes method so that a model for classification is obtained [29].The Naïve Bayes Classifier algorithm assigns a target value to new data using the  value, which is the highest possible value of all members of the domain set  [30].Each review data is represented with attribute pairs 1, 2, 3 ... . . where 1 is the first word, 2 is the second word and so on.Whereas  is the set of sentiment categories [31].During classification, the algorithm will look for the highest probability of all tested categories (), where the equation is as follows: Where,  is the review category  = 1,2,3, ...n Where in this study: 1 is a positive review category, 2 is a negative review category, 3 is a neutral review category. (| ) is the probability of  in category .() is the probability of .

Model evaluation
After the model is completed, the last stage is testing and evaluation.This stage will measure accuracy using a confusion matrix to compare actual categories and predicted categories.This measurement is done by calculating the accuracy, precision, recall, and f1-score values.In multi-class classification evaluation, there are 3 commonly used matrices, namely, accuracy, precision, and recall.Accuracy measures the ratio of correct predictions to the total data evaluated.Precision measures the level of accuracy between the requested data and the answer or result provided by the system.Meanwhile, recall is used to measure the amount of data that is correctly classified against the total data that should belong to that class [32].At this stage, data testing on the model will be carried out by evaluating the extent of naïve Bayes performance on the tourist visitor review dataset using information gain feature selection with threshold values of 0.0001, 0.0003, and 0.0007 to explore the effect of using thresholds in selecting features that can affect model performance and features that have weight values below the three threshold values will not be used.Information gain has an important role as a feature selector so that the accuracy of the system can be better.The selected feature is a feature with an information gain weight that has a value not equal to zero and the feature will be selected using a threshold to produce an output in the form of the best feature [33].
The stages carried out in this research start from data collection, data labeling, preprocessing, TF-IDF word weighting, Feature selection, naive Bayes implementation, and evaluation model.From this process, the results of this study are obtained in the form of the best accuracy, precision, recall, and F-1 score values based on the results of the test scenario using the confusion matrix.

Result and discussion
The data used amounted to 3649 items, consisting of 2583 positive reviews, 275 negative reviews, and 457 neutral reviews.Furthermore, the preprocessing stage was carried out and obtained clean data as well as 3326 review data.After that, TF-IDF word weighting is carried out.The data is then balanced by the number of sentiments in the reviews using the SMOTE technique.After the data balancing process is carried out with the SMOTE technique, the number of positive, negative, and neutral reviews is 2583.The data is divided into training data and testing data, and feature selection is carried out using information gain.In the sentiment analysis stage, training data is used to train the model with the multinomial Naïve Bayes method for classification.Once the model is ready, testing and evaluation are done using the testing data to measure accuracy.

Data collection
To collect data for this research, w [11]e used data scraping techniques from reviews of tourist attractions in Bangkalan Regency.WebHarvy software is used to collect data on WebHarvy by retrieving data from Google Maps, such as name, time, stars, and reviews on the website.

Data labeling
Tourist review data is labeled based on the stars obtained by determining each class of visitor review data.Each class of each tourist visitor review data, whether it belongs to positive, negative, or neutral sentiment.Number of labeling results shown in Table 1.

2583
275 457 Based on the table above, it can be seen that the number of positive labels is 2583 data, negative labels are 275 data, and neutral labels are 457 data.

Reviews
Data Cleaning Exotic former limestone mining is pretty good for eye wash and a good spot for photos; it's just that the road access to the location is still not good.
Exotic former limestone mining is pretty good for eye wash good spot for it's just that the access road to the location is still not good b.Casefolding

Data Cleaning
Casefolding Exotic former limestone mining is pretty good for eye wash good spot for it's just that the access road to the location is still not good exotic former limestone mining is pretty good for eye wash good spot for it's just that the access road to the location is still not good.

Casefolding
Normalization exotic former limestone mining is pretty good for eye wash good spot for it's just that the access road to the location is still not good.
exotic former limestone mining is pretty good for eye wash good spot for it's just that the access road to the location is still not good.

Normalisasi
Tokenization just that the access road to the location is still not good.

TF-IDF word weighting
The TF-IDF algorithm calculates the importance of each word in a document by assigning a weight to each word after performing data division.This method can be used to convert each word in a text document into a numerical frequency.By using the scikit-learn library, TF-IDF word weighting can be done in Python by using the TfidfVectorizer() function.The results can be seen in the table below.

Naïve bayes implementation
The naive bayes classification process is carried out on data that has gone through the tf-idf weighting stage and information gain feature selection using training data with predictions using testing data.From the classification results, the accuracy, precision, recall and F1-score results are obtained.The library used for the naive bayes classification process is scikit-learn.The Scikit-learn library used is MultinomialNB.The following is the implementation of naïve bayes is shown in Sourcode 1.

Model evaluation
In this scenario, the Naïve Bayes algorithm is tested using the Information Gain feature selection and SMOTE technique.In this test, the Information Gain feature selection and SMOTE technique are carried out before classifying using the Naïve Bayes algorithm.Evaluation is done using confusion matrix to see the accuracy results of Naïve Bayes algorithm using information gain with threshold 0.0001, 0.0003, and 0.0005.The results of the testing scenario are shown in the table below.Description in table 4, NB: Naïve Bayes, IG: Information Gain.Based on the table, it can be seen that the accuracy value obtained from each model by applying SMOTE oversampling.The evaluation results of the Naïve Bayes method using Information Gain feature selection at a threshold of 0.0001 with the number of features is 4094 with time 0.1242 resulted in an accuracy of 82.62%, precision 83.06%, recall 82.20% and f1-score 82.14%, while information gain with a threshold of 0. 0003 with the number of features is 2349 with time 0.0402 resulting in accuracy of 82.68%, precision 82.51%, recall 82.49%, and f1-score 82.44% and threshold 0.0007 with the number of features is 1906 with time 0.0226 resulting in an accuracy value of 80.89%, precision 80.65%, recall 80.78% and f1-score 80.70%.The results of processing tourist visitor reviews in Bangkalan Regency using SMOTE produce the best accuracy, namely naïve bayes with an information gain threshold of 0.0002, with a total positive sentiment of 2597, with positive words that most often appear are the words "good", "tour", "hill", "photo", and "place".This shows that visitors have a good and pleasant experience in visiting tourism in Bangkalan Regency.With a total negative sentiment of 274, with negative sentiment the words that most often appear are the words "enter", "extortion", "parakeet", "not", and "road".As for the number of neutral sentiments of 455, with neutral sentiments the words that appear most often are the words "place", "good", "beautiful", "photo", and "cool".

Analysis result
The results of the comparison of the accuracy, precision, recall, and f1-score results of the naïve Bayes model and information gained with SMOTE in the figure below.Based on the test results in this study, it shows that the use of SMOTE increases the accuracy of the Naïve Bayes model to 82.81% with balanced precision and recall.The application of Information Gain with thresholds 0.0001, 0.0003, and 0.0007 also increases the accuracy of the Naïve Bayes model to 78.37%, 78.67%, and 78.67%, respectively.Using a combination of the Naïve Bayes algorithm with an information gain threshold of 0.0003 and SMOTE produces the best accuracy of 82.68%.This method helps classify sentiment more precisely and efficiently, providing deep insight into visitor satisfaction with tourism in Bangkalan Regency.The use of information gain with the right threshold and SMOTE is the best choice for analyzing visitor reviews in Bangkalan Regency.

Conclusion
Evaluation of the results of using SMOTE in the Naïve Bayes method using Information Gain feature selection at a threshold of 0.0001 resulted in an accuracy of 82.62%, precision 83.06%, recall 82.20%, and f1-score 82.14%, while information gain with a threshold of 0.0003 resulted in an accuracy of 82.68%, precision 82.51%, recall 82.49%, and f1-score 82.44%, and threshold 0.0007 resulted in an accuracy value of 80.89%, precision 80.65%, recall 80.78%, and f1-score 80.70%.From the test results, it shows that the threshold value of 0.0003 gets the best accuracy value from the other thresholds.In other words, Naïve Bayes optimization using information gain and SMOTE feature selection gives better results in analyzing tourist visitor reviews.
Based on this research, it can be concluded that the use of information gain with the selection of the right threshold value and the use of SMOTE produce good accuracy.With high accuracy and good precision, recall, and f1-score values, this method helps in classifying sentiments more precisely and efficiently.Thus, the Naïve Bayes model optimized with SMOTE and information gain techniques is a better choice for the analysis of tourist visitor reviews in Bangkalan Regency and can provide deeper insight into visitor satisfaction with tourism in Bangkalan Regency.

Table 2 .
Changes in text preprocessing word count.

Table 3 .
Results of TF-IDF word weighting.

Table 4 .
Comparison of Naïve Bayes test results with SMOTE.