Automated Abstractive Text Summarization using Deep Learning

. A class of neural networks known as Recurrent Neural Networks (RNNs) are capable of processing sequential input, including time series and plain language. In the shortest possible time to find relevant and useful information, it is for sure very helpful if the information is summarized, but it typically requires a lot of effort, dedication, patience, and attention to detail for humans to go through and summarize the lengthy texts. This can be done using automated abstractive text summarization techniques by using deep learning which is the way of selecting the most significant information in a text, then condensing it while maintaining its underlying meaning. The objective is to construct an abstractive text summarizer using deep learning. It takes large meaning-full text data as input and gives the summary of the data as output. The algorithm which is been used here is the Long Short Term Memory model (LSTM) which is a type of RNN model. And then next the model which is been used is the Sequence to Sequence model (Seq2Seq). Seq2Seq learning is the training model that can modify the sequences of one input into the sequences of another output. The dataset used is News Summary dataset which is taken from Kaggle. Experimental results on the news summary dataset show that the proposed method results are appropriate and match with the other methods.


Introduction
Text refers to written or printed words, symbols or other forms of communication that convey meaning.It can be in the form of books, articles, documents, newspapers, web pages, messages and more.Text is used to communicate information and ideas using language.Data can be conveyed as text, photos, videos, audio files, and other formats, but most of the data/knowledge right now is in the representation of text data.It means that the volume of textual data for any topic can be exceptionally large as it is used the most.
Recent years have seen a substantial increase in the volume of textual data, making it an excellent resource for knowledge extraction and analysis.Humans find it incredibly timeconsuming and difficult to manually summarize lengthy texts.Reading copious amounts of data and understanding the information manually is really a time taking task.We need a summary of the text to save our time instead of reading the whole data.If the summary is something related to what we need, then we explore the main text.In this way, we can save our time while reading.This motivates us to have a automated text summarizer and use it for summarization.Text summarizing is a method of gathering the most important facts and the main ideas from a lengthy text source to produce a concise, accurate, and elegant summary.Summarizing the text is accomplished by using different algorithms and other NLP tools.It has been around for a long time and has been studied extensively in the field of NLP, but recently the deep learning ways have shown nice results for text summarization.Types of text summarization can be broken down and broadly divided into two categories: (a) Extractive, and (b) Abstractive Text Summarization.
Extractive Text Summarization is a process of text summarization in which key points from the original text are taken and they are given as the summary of the given text.All the text present in the summary is taken from the input text data itself.Automated Abstractive Text Summarization (AATS) is a method that converts the provided text data into a brief and streamlined summary by understanding the whole set of information and containing the main concept of the original text.The text present in the summary maybe or maybe not appear in the source text.

Literature survey
Extended Transformer was the paradigm for Abstractive Document Summarization that was proposed by Yongjian, Weijia, Tianyi, and Wenmian [2].The goal was to employ an encoder-decoder system based on Transformers with two original enhancements for abstractive document summarization.The effectiveness of our approach is assessed using the ROUGE scores.This network successfully lessens the effects that secondary data has on the summaries that are produced.11,490 test pairings and 13,368 validation pairs and 2,87,226 training pairs make up the dataset.NafeesMuneera along with Sriramya [3] suggested hybrid techniques for automated abstractive text summarization, proffers a Multilayered Attentional RNN with Stacked LSTM (MASta-LSTM-RNN).The objective was to provide an abstractive summary using MASta-LSTM-RNN hybrid technique.The dataset contains training, validation and testing couples.The study proved that it performs nicer than the rest of the algorithms.Shengli with Haitao and Tongxiao [4] suggested LSTM CNN based ATS framework.The objective was to build LSTM CNN model for automated abstractive text summarization.The CNN dataset is composed of 92k texts along with its summaries, while the Daily Mail dataset is composed of 219k texts along with its summaries.The framework outperforms the other models and achieves a high competitive results on non-automatic linguistic quality evaluation.
Mahnaz and William Yang [5] have did job on ATS using the existing approaches to evaluate their performances.The objective was to evaluate the performances of the existing text summarization techniques and to present its challenges.The techniques used for evaluation were TextRank Extractive Method, seq-2-seq model with attention, pointer generator abstractive system, pointer generator with coverage abstractive system.The dataset used here is WikiHow dataset which contains articles integrated from different datasets like CNN/Daily Mail, New York Times etc.The study concludes that the new features and aspects inherent in the new dataset (WikiHow) can be used to further improve the summarization systems.TianCaii, Mengjun, Huailiang, Leii Jiang and Qiong Daii [6] have proposed a new model named RC Transformer (RCT).The objective was to proffer the Transformers with an supplemental RNN based encoder to snap the sequential context representation.The datasets used are Gigaword, a sentence summarization dataset and DUC-2004.The study concluded that the RCT model is able to bring out higher quality summaries than the normal Transformer.Yong Zhang with Dan Lii and Yuheng Wang along with Yang Fang and Weidong Xiao [7] proposed a convolutional seq2seq architecture which is generative model based.The objective of this paper is to add a convolutional seq2seq model to understand the text summarization that helps to revamp the model effective.The big length F1 type of Rouge is applied to estimate the performance of the model.Two original-life datasets, GigaWord and DUC corpus are used for model.Yixin and Pengfei [8] has proposed a simple but even then powerful framework for abstractive type summarization.The objective of this paper is to use contrastive learning for automated abstractive text summarization.The ROUGE types of 1,2,L are used as the ruling evaluation metric and the recently developed semantic similarity metrics, namely, BERTScore and Mover Score are used to evaluate the model.The CNN/DM and XSum datasets are used for this model.The study comes to the conclusion that the current abstractive approaches have the capacity to produce candidate summaries that are significantly better than the initial outputs.A paradigm for improving automated abstractive text summarization was gave out by Panagiotis Kouris and Georgios Alexandridis and Andreas Stafylopatis [9] and on the basis of the integration of deep learning methods with semantic data transformations.The potential of the model has undergone in evaluation using the Rouge scores.This model makes use of the Gigaword dataset.
Nikola Nikolov and Richard Hahnloser [10] proposed Abstractive Document Summarization without Parallel Data.The goal is to create an abstractive summarization system that solely uses extensive databases of sample summaries and unrelated articles.Systems are evaluated using METEOR and ROUGE 1 and 2 and L with F1 metrics.The benchmark dataset is CNN/DM.A unique paradigm for improving automated abstractive text summarization was put out by Panagiotis Kouris, Georgios Alexandrids, and Andreas Stafylopatis [11] and on the basis of the coupling of deep learning methods with semantic transformations of data.The goal was to create a generalised summary using an extensive encoder-decoder architecture and an interdisciplinary framework for semantic-based text generalisation that was developed.The success of the model is assessed using the Rouge scores.This model makes use of the Gigaword dataset.The unique framework that has been suggested combines DL methods with semantic-based content approaches to provide abstractive summaries in a broader sense.
The paper [16] discusses the importance of text summarization in online shopping and surveys various techniques, highlighting the use of seq2seq models with LSTM and attention mechanisms for improved accuracy.Authors [17] highlighted the significance of ML in prediction, pattern recognition and error reduction across diverse fields, emphasizing the impact of AI in broad domain.The approach [18] utilized Advanced Deep Learning with global threshold to improve E-commerce product classification, achieving high accuracy and challenging existing technology.Author [19] presented text classification algorithms for various applications and explores the use of machine learning in detecting phishing attacks.
3 Problem statement and objectives

Problem statement
Most knowledge transfer is done by using text data in the form of webpages, books, newspapers, etc.We need a short summary of the large text data to save our time by avoiding reading the whole data and just reading the summary.

Objectives
The motive of the paper is: • To build an automated abstractive text summarizer for summarizing text data.
• To use deep learning techniques for this process.the architecture: (a) Encoder: The encoder network typically is composed of multi LSTM layer, (b) Decoder: The decoder network also typically is composed of multiLSTM layers.These layers are used to generate the summary.

Modules of the proposed work
The paper can be divided into four modules.They are data preprocessing, building the model, training the model and evaluation.Let's discuss the modules in detail.

Data preprocessing
In this module, the raw data from the NEWS SUMMARY dataset, available in Kaggle, is preprocessed and converted into new data which can be used for training the model.
The first phase in every machine learning or deep learning paper is preparing the data and then preprocessing it.Data Preparation is the process of preparing the raw data from various sources.The data may be unclean and we may have unnecessary data along with required data.So, we have Data Preprocessing.Data preprocessing is feasible step in the process of training any model.The data preprocessing steps depend on the type of model we are going to train so that it is suitable for the model development.Without data preprocessing, the model may not train as expected and gives inaccurate and inconsistent results.The training of the model depends on the data we provide for training and hence it is also important that the data is clean and converted into the right form [14].The first step is to divide the entire dataset into train and test datasets.These datasets are used to train the model and then prize model's potential.After this, the next step was to remove the unnecessary data like the 'id' column in the dataset.Out of the three attributes in the dataset namely id, articles and highlights, the id attribute is not required for training the model.
All the further steps in preprocessing can be considered as purifying the data.The next steps are as follows: • Lower case: If the text has upper case characters, then they will be converted into lowercase.Most of the text is in lowercase and generally the first letter is capitalized at the beginning of the sentence.For example, the words "Hello" and "hello" are the same words with the same meaning but the model treats them as different words.
Hence, we convert all the words in the dataset to lowercase.
• Split text: To apply the further preprocessing steps like adding contractions, lemmatization etc., it is not possible to work on sentences, therefore, the text sentences are split into words.This enables us to complete the preprocessing of the data and after it is done, the words are again combined to form sentences. • Add contractions: Contractions are short forms or shortened words in the English Language.For the model to train more accurately, we use the original words instead of the contractions.All these contractions will be converted into its original word.Table 1 shows a few more examples of the common contractions.• Stop word remove: Generally, articles and pronouns are considered as the stop words.
These words may take the importance of the data leaving the main data and hence even the summary may be produced with these unwanted words.Hence, by removing stop words, we can focus on useful information.• Lemmatization: Lemmatization is the process in which words are converted into its root word.For example, the words "walked, walking, walks" will all be converted into "walk".
After the text is purified, the splitted words are combined to form sentences.Special tokens like <sostok>, <eostok> are added along the sentences which have a special meaning for the model to process the text data.Here, <sostok> denotes the start of summary token and <eostok> denotes the end of summary token.
For training the model, the data must be in the form of vectors.For this, we add the embedding file to the text data.The vectors of real numbers are derived from the probability distribution of each word occurring before or after another.The steps of preprocessing involving lemmatization, word embedding etc., are done by using the SpaCy library in Python.Hence, the use of GloVe is replaced with SpaCy.

Building the model
Here, we are building the Recurrent Neural Network (RNN) as our model.We choose to build a Recurrent Neural Network as it is the best model for sequential data.Text data is also Sequential data, so RNN works better for our data.In RNN, connections among the nodes can create a round cycle allowing "out" from some nodes to affect subsequent "in" to the same nodes [14].There are many algorithms in RNN and Long Short Term Memory (LSTM) is one of them.LSTM is most commonly used for building an RNN model.Long Short-Term Memory, often known as LSTM, is a form of RNN architecture created to solve the vanishing gradient issue that plagues conventional RNNs.The network may be unable to learn long-term relationships as a result of the vanishing gradient problem, which happens when the gradients employed for refreshing the weights in the network are very tiny.LSTM networks have a more convoluted structure than classic RNNs and are able to selectively keep and forget information over long periods of time.Figure 4 gives the basic comparison between traditional RNN and LSTM.They are made up of three different kinds of gates: forget, output, and input.
Figure 4 shows the difference between the working of traditional RNN with LSTM.Here we can see that the LSTM has an additional Long-Term Memory whereas the RNN has only the working memory.At each step, the LSTM cell takes in a vector which is the input vector and also the previous hidden state and cell state.The input vector is remodelled using a linear layer, and then passed through a series of activation functions to compute the input gate of LSTM, output gate of LSTM, and forget gate values for LSTM.The network of encode takes duty for taking in an input text and encoding it into a stubborn length vector representation that can be used by the network of decoder to spawn the summary.The encoder typically consists of multiple LSTM layers, which are used to snap the contextual deets of the text [15].The LSTM film present in the encoder processes the sequence of word embeddings which are generated from the input text.For every step, the LSTM layer takes the word embedding of current word and the hidden state which previous, by this it gives birth to a new hidden state and cell state as output.The end hidden state and cell state are used as input for the decoder layer.
The decoder network also typically consists of multiple LSTM layers.The LSTM layers are used to spawn the summary.At each time step, the layer takes in the previous word generated and the encoded vector from the encoder network.At step of each time, the decoder snatches in the old word generated and the encoded vector from the encoder network.The previous word is typically represented as a one-hot vector, which is fed into a line layer to produce an embedded vector.This embedded vec is then concatenated with the encoded vector from the previous network, which contains the contextual deets of the input sequence.The concatenated vector is passed through multiple LSTM layers, with the out of each layer being used as the in for the next layer.The out of the end LSTM layer is passed through a linear film and a softmax activation to produce a probability distribution over the feasible words that can be generated at that time step.Figure 5 shows us the working of encoder and decoder layers.The figure represents how a sentence is being taken by the encoder layer and how the decoder is arranging the words from the output of the encoder layer into a meaningful sentence.

Training the model
After establishing the model, we need to train the model.The model built using the seq2seq model is now compilation done up with the loss function and optimizer.The sparse_categorical_crossentropy loss function and Adam optimizer are used for model compilation.The future step is to train the model on the preprocessed data of news summaries.This necessitate feeding the training data into the model and adjusting the weights of neural network to minimize the annoying loss function.The model is trained for multiple epochs (here 50 epochs).

Evaluation
The next step after training the model is Evaluation.This step involves measuring the performance of the model that has been constructed, using the test dataset.Here, we are choosing the intrinsic method for our evaluation and the metric used for appraising the potential is ROUGE Score.ROUGE stands for Recall Oriented Understudy for Gisting Evaluation.The evaluation is done by measuring three ROUGE metrics namely ROUGE 1, ROUGE 2 and ROUGE L.
• ROUGE 1 measures the imbrication between the predicted summary and the reference summary in terms of unigrams (single words).It calculates the precision, a recall, and an F1-score of the unigrams in the predicted summary compared to the unigrams in the reference summary.• ROUGE-2 is similar to ROUGE 1, but it measures the imbrication between bigrams (words in pair) instead of unigrams.• ROUGE-L measures the longest common subsequence between the predicted summary and the reference summary, considering all common subsequences of words that are at least 2 words long.It calculates the precision, a recall, and an F1-score of the longest common subsequence.• The range of the metrics is between 0 and 1.The more extortionate the value of ROUGE, the better the potential of the model.

Description of the dataset
The dataset used for training our Sequence-to-Sequence model is the news summary dataset.This dataset is available on Kaggle [12].We require a dataset comprising the input data and the actual output to train our deep learning model.The model then trains with its predicted output and the actual output.In a text summarizer, the input data is a large text and the output is a summary of it.Newspapers can be a good example for this.The headlines can be a summary of the original news.As news articles include a wide variety of words and syntax, we can learn more.So, we choose a newspaper dataset which is the news summary dataset.The dataset has two files namely, the news_summary.csvfile and the news_summary_more.csvfile.The news_summary dataset consists of 4515 values and contains Author_name, Headlines, URL of Article, Short text, Complete Article.Then we have the news_summary_more dataset has 98280 values which contains headlines and text.
The text is the main data and the headline is the summary of the text.The news in the dataset is collected from Inshorts and only scraped the news articles from different newspapers in the country like The Hindu, Guardian and Indian times.The articles of news are covered from February 2017 to August 2017.

Experimental results
After the model has been trained, it is tested on a validation dataset to see how accurately the generated outputs match up to the original, human-derived results.Figure 6 displays the review, the original summary, and the model-predicted summary.
The metrics used to measure accuracy is the ROUGE score.Table 2 shows the Recall, Precision and F1-Score of Rouge-1, Rouge-2 and Rouge-L of the proposed method.

Conclusion and future enhancements
In the paper, we totally worked on building a model that can generate a good summary for the text that was given as input.Building an abstractive text summarizer is really a challenging task.When text is summarized, it is generated in its own words.We created an abstractive summarization system that can be trained using sample summaries and a sizable database of unrelated articles rather than parallel resources.For building an abstractive text summarizer, we can choose the traditional NLP approaches, or we can also use Deep Learning approaches.The RNN is best suited for text data.The proposed method is based on LSTM with the Seq2Seq model.LSTM is a special type of RNN, it has a long term memory compared to traditional RNN.So, for training the model, we used the news_summary dataset.The method used for automated abstractive text summarization entails data preprocessing, model architecture development, model training, and performance evaluation using metrics like the ROUGE score.Using LSTM and seq2seq, the model is trained on a dataset to provide an abstractive text summarizer.After the input sequence has been passed through the encoder LSTM once, it then generates a hidden state input sequence delineation.When the LSTM decoder receives this hidden state, it generates the output summary.The dataset's information quality is good, however it is constrained by the quantity of news summaries that are included in it.Furthermore, the model created just has a few layers and is rather simplistic.To boost the model's functionality and produce more precise results, we can make a few improvements to the suggested approaches, namely, pre-trained model, increasing the size of the train dataset, incorporating the attention models, transformer-based models, and evaluating the model in distinct domains.

Fig. 3 .
Fig. 3. Architecture diagram.The architecture diagram typically be composed of encoder network which takes in the input text and encodes it into a fixed-length vector and a decoder network which takes in the encoded vectors and generates summary.Here's a brief overview of each component of /doi.org/10.1051/e3sconf/20234300102121 430

Fig. 6 .
Fig. 6.Review, original summary and predicted summary by the model.

Table 1 .
Example of few common contractions.

Table 2 .
ROGUE score of the proposed method.