Feasible Trend Prediction for 2019 Indian General Elections

. The popularity and accessibility of social media applications, such as Twitter, has increased over the past few years at a rapid pace. Users from different parts of the world use it to share thoughts with each other. Politicians use these platforms at every viable opportunity and increase support and following for themselves and their parties. Sentiment analysis has become a key methodology to gain insight from social networks. We perform sentiment analysis using a lexicon-based sentiment analyser and sentiment trend prediction for a short interval of time to model the sentiment trend towards the top contenders of the Indian General Election 2019 on twitter. Sentiment Analysis is done on the tweets posted during the campaign period of the Indian General Elections of 2019. This is done using a pre-trained sentiment analyser called VADER of the NLTK library. VADER is optimized for social media data and can produce useful results. The dataset used for this paper is created using web scraping modules written in Python. In addition to that, sentiment trend prediction was done for a period of 10 days from the day of result announcement using Linear Regression, to achieve sustainable reports.


Introduction
Twitter is a widely used social media platform that enables its users to compose and broadcast brief, 280-character messages, known as "tweets," to their followers.These messages may contain a variety of content types, such as text, images, and videos.Twitter has gained significant popularity as a tool for sharing information, expressing viewpoints, and disseminating news and updates to a broad audience [1].The use of Twitter data in conjunction with other traditional polling methods can provide a more comprehensive understanding of voter behaviour and preferences.By monitoring the sentiment and frequency of tweets related to political candidates or issues, researchers and analysts can gain valuable insights into public opinion and attitudes.
Sentiment analysis, also known as opinion mining, is a natural language processing technique used to determine the emotional tone or sentiment of a text document, such as a tweet, review, or news [2], Ramkumar and Sanjeeva [14].The goal of sentiment analysis is to identify whether the sentiment expressed in the text is positive, negative, or neutral.This is typically done using machine learning algorithms that are trained on a large corpus of labelled data.Sentiment analysis has a wide range of applications, such as monitoring brand reputation, analysing customer feedback, and predicting stock market trends [3], Sharma and Jain [13].Sentiment Analysis can be performed using three different approaches, namely lexicon-based approaches, machine learning based approaches, and hybrid approaches.
Sentiment analysis can be particularly useful for election prediction when analysing Twitter data [4].Sentiment analysis can be used to evaluate the sentiment towards each candidate by analysing tweets that mention their name or handle.This can provide insights into how positively or negatively each candidate is perceived on social media.Sentiment analysis can also be used to analyse the sentiment toward different issues that are being discussed during the election campaign.Sentiment analysis can be used to track the sentiment of different voter groups, such as millennial or seniors, by analysing tweets that are likely to be posted by those groups.This can provide insights into how different demographics are feeling about the election and can help political campaigns tailor their messaging accordingly.
Linear Regression is a supervised Machine Learning technique used to illustrate the relationship between a dependent variable and one or more independent variables.A Linear Regression model learns the patterns present in the data and formulates a linear equation.This linear equation enables the model to predict the output for the values that are not present in the training dataset.Linear Regression helps one to understand the strength and the degree of correlation between the given duo of variables.Linear Regression is a useful technique in a wide variety of areas such as financial markets, health and sports.It helps predict the performance of a company's stock in the markets, how age and body weight can impact blood pressure and glucose levels, or a player's performance in the upcoming cricket match.Linear Regression is very attractive because the representation of the result is very simple and intuitive.
Linear Regression can be useful in forecasting a trend regarding the topics on social media such as covid-19, the war between Russia and Ukraine or the state elections.In the case of elections, this technique, coupled with sentiment analysis, aids in understanding the sentiments of the user-base of the social media site towards their leaders, parties and the policies.The trend in the sentiment change can be modelled using Linear Regression.
The objective of this paper is to perform sentiment analysis on the tweets posted on Twitter about the top contenders of the 2019 Indian General Elections during the election campaign period and predict the sentiment trend towards these candidates during the period of declaration of election result.Finally, the prediction results shall be compared with the real-world results and conclusions shall be made, in order to generate the sustainable report.
The rest of the paper is organized as follows.Section II gives an account of the previous research on sentiment analysis using both machine learning based approaches and lexiconbased approaches and sentiment trend prediction, Section III describes the methodology used to fulfil our objective, Section IV presents the results obtained through our methodology, and Section V states the conclusions and future enhancements.

Related work
Sentiment Analysis as a research topic has been growing steadily ever since the availability of data has increased.It has been regarded as a prime methodology for gaining insights from data.Early works regarding sentiment analysis-based poll prediction relied on software such as LIWC2007 (Linguistic Inquiry and Word Count) Software, which was used to predict the German General Elections of the year 2009, carried out by Tumasjan and team.A text analysis software called LIWC2007 was created to evaluate the emotional, cognitive, and structural features of text samples using an internal dictionary that has undergone psychometric validation.It examines how frequently various emotions appear in the provided text.The software assigned each tweet to one of 12 categories, each of which represented a different sentiment.The authors examined more than 100,000 texts with references to either a political party or a politician using the text analysis software LIWC.Their findings demonstrate that Twitter is in fact widely utilised for political discussion and that the simple volume of messages referencing a certain party accurately predicts election outcomes.However, their data was not fully representative of the German Electorate and the methodology was based on word count alone.
Almatrafi and team applied location-based sentiment analysis to identify the sentiment trend towards the contesting parties of the Indian general elections 2014.The authors have used the Naive Bayes Classifier to perform the sentiment analysis.The authors have collected a corpus of 650,000 tweets using tweepy package in conjunction with a Twitter developer API.The authors manually assigned the location values of the tweets by extracting the location information provided by the users in their profiles.In addition to that, they manually performed pre-processing of the data such as removal of whitespaces, replacing of keywords preceded by '#' character with only the keywords, converting @ into AT_USER, replacing entire URL address with the word URL, etc.They also replaced major emoticons, slang words, and abbreviations with their custom set of words which denote the corresponding emotion.The authors obtained 70% accuracy in classification of tweets as either positive or negative using the Naive Bayes Classifier.The drawbacks are that a huge amount of manual pre-processing was performed to prepare the dataset.In addition to that, neutral sentiment was not included.
Gupta and team have carried out a feature-based sentiment analysis of Twitter data with negation handling.Often, the presence of negation in the text hinders the effectiveness of these algorithms because negations can invert the meaning of words and phrases.A combination of heuristics and machine learning approach with negation handling feature have been developed by the authors which has resulted in improved performance and significant results.Out of all the classifiers implemented, SVM was found to be the most accurate from the experimental results.According to the authors, lexicon-based features and n-grams had the biggest effects on how well SVM performed.When responding to tweets when the use of a negative term does not automatically imply negation, the authors have provided a negation exception mechanism through their work.The model proposed by the authors is not suitable for handling short texts with nuances and emoticons such as tweets.While these techniques have shown promise, there is a need for further research to improve the efficiency and accuracy of sentiment analysis algorithms in handling negation in short texts.Authors [15] highlighted the significance of ML in prediction, pattern recognition and error reduction across diverse fields, emphasizing the impact of AI in broad domain.
In recent years, the use of slang, emoticons, and acronyms has grown manifold.Such elements have an effect on the overall sentiment of the text.Popular sentiment analysis models (machine learning approaches) ignore or substitute such elements with other notations that may not always emphasize or convey the emotions as effectively as the original elements.To take them into account, lexicon-based sentiment analysers such as VADER have been developed for sentiment analysis on text data.Ali and team have performed sentiment analysis of Twitter data related to the 2020 US presidential elections using VADER.They segregated the tweets based on the mentioned candidate, tweet status, and user status.Their analysis spanned ten days, starting from October 31st until November 9th, 2020, over which they collected and recorded nearly 8 million tweets that mentioned either candidate.They aimed to analyse the change in the sentiments of the users based on the above-mentioned parameters.With the exception of a few instances when sarcasm could not be discerned, the authors claim that VADER was able to classify feelings accurately.Hutto and team have discussed the creation, verification, and assessment of VADER (Valence Aware Dictionary for Sentiment Reasoning).VADER is specifically tuned for sentiments expressed in social media posts.According to the authors, VADER (F1 = 0.96) performs better than individual human judges (F1 = 0.84) at correctly categorizing tweet sentiment into positive, neutral, and negative categories.The majority of the time, VADER outperforms machine learning algorithms in the same domain for which they were trained, according to the authors' research.The findings of VADER's straightforward rule-based methodology are commensurate with such complex standards of excellence.So far, most of the approaches have been based on word count and sentiment of individual words.VADER helps in implementing sentiment analysis on a sentence of text, which is an advancement in the methodology.
Bonta and team performed an exhaustive survey on lexicon-based approaches for sentiment analysis.They have applied three lexicon-based analysers, namely NLTK, TextBlob and VADER to perform sentiment analysis on a movie reviews dataset provided by Cornell University.The movie reviews were collected from the rottentomatoes website.After conducting the experiments, the authors found VADER to be the most accurate among the chosen analysers with an accuracy of 77%.The authors have endorsed VADER to be the most preferable lexicon-based approach to perform sentiment analysis.The authors have also argued that VADER is more preferable to the prevalent machine learning approaches.
Kodirekka and team members have performed sentiment analysis of tweets regarding the new Goods and Service Tax (GST) policy, 2017 of India and predicted sentiment for the time period from 01/10/2017 to 10/10/2017 regarding the same topic using machine learning techniques.The authors classified over 2, 00,000 tweets as either positive, or negative or neutral tweets regarding GST posted in the period between 01/09/2017 and 30/09/2017.The sentiments were classified by using the Naive Bayes supervised machine learning algorithm.Hence, the authors ascertained that the predictions can be accurately made using the linear regression technique.Authors [16] discussed the importance of text summarization in online shopping and surveys various techniques, highlighting the use of seq2seq models with LSTM and attention mechanisms for improved accuracy.The approach [17] utilized Advanced Deep Learning with global threshold to improve E-commerce product classification, achieving high accuracy and challenging existing technology.Authors [18] presented text classification algorithms for various applications and explores the use of machine learning in detecting phishing attacks.Authors [19]

Proposed Work
A five-step process was implemented to fulfil the objective of this paper.The architecture diagram of the proposed method is illustrated in figure 1.First, the data was collected from Twitter using the web scraping module called snscrape.The data was collected keeping two of the top contenders of the 2019 election in mind, i.e.Narendra Modi and Rahul Gandhi.Hence, two datasets were obtained, one of each candidate.Both datasets were pre-processed using python modules such as regex and string to remove the unwanted elements such as URLs, # symbol in hashtags, etc.In the next step, the sentiment scores are calculated, and polarity labels are assigned for each record of each dataset based on the criteria suggested by Hutto and team.The proportion of the sentiment polarities towards each candidate is both visualized separately and contrasted against each other to get a clear understanding of the scenario before the prediction of the sentiment trend towards them.Finally, the sentiment trend towards each of the candidates is predicted and visualized individually.Each step is discussed in detail in the sub-sections ahead.

Data collection
Data can be obtained from various sources.There are millions of datasets available on the internet.On the other hand, the dataset required for this paper was not yet available on the internet, when it was time to implement the proposed method.Therefore, the dataset, rather two datasets -one each for Narendra Modi and Rahul Gandhi-had to be obtained using web scraping techniques.Many web scraping modules are available in python such as Beautiful Soup, Scrapy, Requests, etc.We found the module Snscrape to be the most suitable one for our purpose.
Snscrape is a powerful and versatile Python library used for scraping social media data from various platforms, including Twitter, Reddit, and Instagram [12].It provides a convenient way to access public data, including posts, comments, user profiles, and other related information.The Snscrape module provides various methods to scrape data from social media sites.In our case, Twitter tweets were scraped using the TwitterSearchScraper(query).get_items() method, and they were returned as tweet objects.The above-mentioned method takes a query string as a parameter, which contains the criteria to be used by the scraper while scraping tweets.For example, query = "python" must be the query parameter if a user wants to scrape the latest tweets on the topic of python.
Each tweet object possesses a set of attributes which contain the information regarding the individual tweet.Some of the attributes of the tweet object and their corresponding datatypes are mentioned in table 2.Here is a code snippet which demonstrates how a user can access various attributes of a tweet object and append it to a python list named tweets.strings.A limit on the number of tweets to be collected was set as 20,000.From the date 31st May 2019 to 1st April 2019, the first 20,000 tweets were collected for each of the two candidates.The date range is represented backwards because snscrape scrapes the most recently posted tweets for a given period of time.We accessed the date, lang and the rawContent attributes which represent the date of the creation of the tweet, the language of the tweet and the text content of the tweet.For each candidate, all the tweets were stored in a list.

Data pre-processing
Once the data was collected and stored, it was pre-processed in order to make it suitable for sentiment analysis.We made a few observations regarding the scraped data and preprocessed it accordingly.In some cases, the value of the lang attribute and the actual language of the text content of the tweet did not match.Since the majority of the tweets were posted by Indian citizens, a significant number of tweets did not have the text content in English.Such tweets had to be removed using libraries like regex and string.Besides these measures, common pre-processing operations such as removal of whitespaces, URLs, irrelevant numbers, replacing of keywords preceded by '#' character with only the keywords were performed.This concluded the pre-processing of data.Slang words, emoticons, abbreviations, capitalizations, and extra punctuation marks were not removed, solely due to the fact that VADER factors these elements into the picture while calculating the sentiment score for the text.According to Hutto and team, such elements greatly influence the sentiment of the text.

Sentiment analysis
Sentiment Analysis was performed using the VADER module of the NLTK library.VADER stands for Valence Aware Dictionary and Sentiment Reasoning.It is a lexicon-based sentiment analyser that is widely used to perform sentiment analysis.It is pre-trained on a gold standard sentiment lexicon which ensures that it is optimized to handle the nuances of microblog content such as tweets.The tool can efficiently and accurately deal with the peculiar linguistic features of social media information, such as slang, punctuation, capitalization, casual language, and emoticons.The workflow of the VADER tool is illustrated in figure 3. The pre-processed data was loaded into a Pandas dataframe.The VADER sentiment analysis object was initialized and the compound sentiment scores were calculated.The compound sentiment scores are assigned within a range of -1 and +1, with -1 indicating highly negative sentiment, +1 indicating highly positive sentiment.Once the sentiment scores were obtained, sentiment polarity was assigned to each sentiment score based on the criteria suggested by creators of VADER.The criteria is illustrated in Table 3.After the assignment of the sentiment polarities, the proportion of the three sentiments towards the candidates was represented.In order to prove the sustainable reportsdata visualization techniques were used (Figure 3).It is discussed in the results section.

Prediction of sentiment trend
After the sentiment analysis process was completed, the sentiments towards the candidates further into the campaign period was predicted using Linear Regression.Kodirekka and team had carried out trend forecasting using Linear Regression on the topic of GST.The topic of GST was widely discussed on social media sites in its initial days because of its widespread, heavy impact.They had performed sentiment analysis on tweets regarding GST using Naive Bayes and forecasted the sentiments of the public in the matters of GST.The authors had experimented with multiple regression models and concluded that Linear Regression had performed better and is sustainable than any of the other models considered.We had trained the Linear Regression model using the labelled training dataset of tweets from 1st April 2019 to 20th May 2019.Then, the model was used to predict the sentiment trend towards each of the candidates from 21st May 2019 to 31st May 2019.

Results
Our proposed method consists of two phases, namely the sentiment analysis and the sentiment trend prediction.Results of each phase are discussed in this section.

Results of sentiment analysis
The results of the sentiment analysis phase contained the proportions of the three sentiments expressed towards each of the two candidates.These numbers are visualized using pie charts.It is evident from the charts that the candidate Narendra Modi had attracted more positive sentiment and less negative and neutral sentiment from the public than the candidate Rahul Gandhi.Although the differences are minute, this information can be considered to understand the preference of the public.The sentiment polarity counts are illustrated in the figure 6.This bar graph helps the viewer understand the scenario and the situation in a quick glance.This graph helps illustrate the type of response the candidates have attracted from the public.This is usually a reflection of their own actions and statements on social media sites.Users respond to the activity that is carried out by the leaders on such platforms by expressing their sentiments and opinions.Evidently, Narendra Modi has attracted more positive tweets and less negative and neutral tweets than Rahul Gandhi.These predictions prove the system generate a sustainable report.

Results of sentiment trend prediction
The sentiment trend towards each of the two candidates was predicted using Linear Regression.The output was a trend line for each of the three sentiments, showing the most likely trend for a period of ten days.The outputs are illustrated in figures 7 and 8.While predicting the trend lines for Narendra Modi and Rahul Gandhi, the model managed an accuracy of nearly 97% and 93% respectively.This is represented in table 4. Through these results, it can be inferred that Narendra Modi was more likely to receive continued positive feedback from the public, and the same is not the case with Rahul Gandhi.This insight can be used by the candidates to understand the support of the public and formulate measures to tailor their social media activities such that sustainable outcomes are achieved in the future.

Conclusion
In conclusion, we used a linear regression model to predict the sentiment towards Narendra Modi and Rahul Gandhi, with an accuracy of nearly 97% and 93% respectively, during the election campaign period of May 2019.The results show that the sentiment of tweets about Narendra Modi and Rahul Gandhi was mostly neutral in May 2019.Overall, positive sentiment was expressed more towards Narendra Modi than towards his counterpart Rahul Gandhi.Rahul Gandhi mostly garnered neutral and negative sentiment.This is clearly indicative of preference and support towards Narendra Modi to Rahul Gandhi.Further improvements to our method include analysis of tweets of regional languages of India, such as Hindi, Urdu, etc., location-based poll prediction, and exploratory data analysis of the extracted data to understand key insights within the data.Another crucial improvement could be using a larger dataset, containing data from multiple social media platforms and offline poll data.

Fig. 2 .
Fig. 2. Code snippet to scrape a tweet using snscrape.The query strings in our case contained hashtags relevant to each of the two candidates during the election campaign period.The query_list is a python list containing multiple query

Fig. 6 .
Fig. 6.A contrast of sentiment polarity counts towards the two candidates.

Table 1 .
Summary of the existing approaches.
emphasized the significance of feature selection in classification for accuracy and efficiency.It investigates combining features from different

Table 2 .
Tweet object -attributes and their datatypes.

Table 3 .
Criteria for sentiment polarity assignment.

Table 4 .
Prediction accuracy table for each candidate.