Human Action Recognition by Learning Spatio-Temporal Features with Deep Neural Networks

. Human action recognition plays a crucial role in various applications, including video surveillance, human-computer interaction, and activity analysis. This paper presents a study on human action recognition by leveraging CNN-LSTM architecture with an attention model. The proposed approach aims to capture both spatial and temporal information from videos in order to recognize human actions. We utilize the UCF-101 and UCF-50 datasets, which are widely used benchmark datasets for action recognition. The UCF-101 dataset consists of 101 action classes, while the UCF-50 dataset comprises 50 action classes, both encompassing diverse human activities. Our CNN-LSTM model integrates a CNN as the feature extractor to capture spatial information from video frames. Subsequently, the extracted features are fed into an LSTM network to capture temporal dependencies and sequence information. To enhance the discriminative power of the model, an attention model is incorporated to improve the activation patterns and highlight relevant features. Furthermore, the study provides insights into the importance of leveraging both spatial and temporal information for accurate action recognition. The findings highlight the efficacy of the CNN-LSTM architecture with an attention model in capturing meaningful patterns in video sequences and improving action recognition accuracy. You should leave 8 mm of space above the abstract and 10 mm after the abstract. The heading Abstract should be typed in bold 9-point Arial. The body of the abstract should be typed in normal 9-point Times in a single paragraph, immediately


INTRODUCTION
Human action recognition is a fundamental task in computer vision that involves automatically identifying and categorizing human activities from video sequences.CNN-LSTM architectures have been widely used for human action recognition tasks as they can capture both spatial and temporal information from videos.However, the challenge lies in effectively learning discriminative features and patterns from video data, particularly in complex and cluttered scenes.
To address this challenge, various approaches have been proposed, such as incorporating attention mechanisms, using different types of CNNs, and designing novel loss functions.In this research, we propose a CNN-LSTM architecture with an attention model to enhance the discriminative power of the model and improve action recognition accuracy.
The attention model aims to highlight relevant features and improve the activation patterns of the model.It achieves this by incorporating a set of activation functions to learn the nonlinear relationships between features and actions.Traditional approaches to action recognition relied on handcrafted features and shallow learning algorithms, which often struggled to capture the complex spatio-temporal dynamics present in video data.CNNs excel at capturing spatial information from images, while LSTMs are well-suited for modeling temporal dependencies in sequential data.By combining these two powerful architectures, researchers have developed hybrid models that effectively capture both spatial and temporal cues in video sequences, leading to improved action recognition performance.
In this study, we propose a CNN-LSTM model with an attention mechanism for human action recognition.The model aims to leverage the strengths of CNNs in capturing spatial information and LSTMs in modelling temporal dependencies.It also incorporates an attention model to enhance the discriminative power of the learned features.The activation model helps to emphasize relevant features and suppress irrelevant ones, leading to improved action recognition accuracy.
To evaluate the proposed approach, we employ two widely used benchmark datasets: UCF-101 and UCF-50.The UCF-101 dataset consists of videos belonging to 101 action classes, including activities such as basketball, biking, and golf swinging.The UCF-50 dataset, on the other hand, comprises 50 action classes, covering a diverse range of human activities.These datasets provide a challenging and diverse set of video sequences, enabling a comprehensive evaluation of our model's performance.
Demonstration of the effectiveness of the attention model in improving the discriminative power of the learned features.The remainder of this paper is organized as follows: Section 2 provides an overview of related works in the field of human action recognition using deep learning approaches.Section 3 describes the proposed CNN-LSTM model with the attention mechanism in detail.Section 4 presents the experimental setup, including dataset descriptions, evaluation metrics, and implementation details.Section 5 discusses and analyzes the experimental results.Finally, Section 6 concludes the study and discusses potential future directions in the field of human action recognition.

UCF-101 dataset
The UCF-101 dataset is a widely used benchmark dataset for human action recognition tasks in computer vision research.It was introduced in 2012 by Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah at the University of Central Florida.The UCF-101 dataset consists of 13,320 videos, each depicting an action from one of 101 action classes.These action workshops include a wide range of human activities, including playing the guitar, horseback riding, and shooting hoops.Due to the numerous action classes and the high level of variability in the films, the UCF-101 dataset presents a challenge.The right action class is assigned to each video, and the dataset also includes a list of action characteristics that give more detailed information about the actions.A variety of human action recognition approaches, including traditional methods and deep learning methods have been rigorously tested on the UCF-101 dataset.It has evolved into a de facto industry benchmark dataset, and the usage of it has significantly advanced the study of human action recognition.

UCF-50 dataset
The UCF-50 dataset is a widely used benchmark dataset for human action recognition tasks.It was created by collecting video clips from YouTube and other online sources, and it comprises 50 action classes with approximately 25 clips per class.Both simple and complicated behaviours, such as leaping, waving, slamming the ball into the air, juggling a football, and riding a horse, are included in the UCF-50 dataset.There are 6,618 video segments totalling around 27 hours in the UCF-50 collection.The performance of various human action detection algorithms, notably deep learning methods like CNNs and LSTMs, has been extensively assessed using the UCF-50 dataset.The dataset provides a challenging benchmark for action recognition due to the high variability in actions, viewpoints, and environments.To evaluate the performance of action recognition algorithms on the UCF-50 dataset, researchers often split the dataset into training and testing sets, with a commonly used split ratio of 80:20.Metrics including accuracy, precision, recall, and F1-score are used to assess the algorithms' performance.The UCF-50 dataset, which offers a difficult benchmark for assessing the performance of various algorithms, is generally a useful tool for researchers working in the field of human action recognition.Attention mechanisms have also gained attention in human action recognition.By focusing on relevant spatial or temporal regions, attention mechanisms enhance the discriminative power of models.They have been applied to CNN-LSTM architectures, where they selectively attend to salient frames or regions within video sequences.Transfer learning has also been explored for action recognition.By leveraging the learned visual representations, this approach improves performance on limited action recognition datasets.Datasets such as UCF-101, HMDB-51, and Kinetics have played a crucial role in evaluating and comparing action recognition models [1].These datasets contain a wide range of action categories and provide a benchmark for assessing model performance.Researchers have achieved significant advancements by training and evaluating their models on these datasets.In summary, human action recognition has witnessed remarkable progress due to advancements in deep learning.CNNs, LSTMs, attention mechanisms, and transfer learning techniques have greatly improved the accuracy of action recognition models.Furthermore, benchmark datasets have facilitated fair comparisons and benchmarking of different approaches [3].The field continues to evolve, with ongoing research focusing on developing robust models that can handle complex action scenarios, occlusions, and largescale datasets.Graph Convolutional Networks (GCNs) have been employed to capture the structural relationships between human body joints or key points.By modelling the spatial dependencies and interactions between body parts, graph-based models have shown promising results in recognizing actions that involve fine-grained motions and body configurations.
Another direction of research involves incorporating spatial-temporal attention mechanisms.These mechanisms dynamically allocate attention to different spatial regions or temporal segments within videos, focusing on the most informative parts for action recognition.
Action recognition with fewer and fewer shots has also gained attention.In scenarios where labelled data is scarce or completely absent for certain action categories, these approaches aim to recognize actions with limited or no training samples.Few-shot learning leverages a few labelled examples from one action category to generalize and recognize unseen instances, while zero-shot learning transfers knowledge from seen action categories to recognize unseen ones using semantic representations.Video-based action recognition has extended beyond single-stream models [4].Two-stream models utilize separate streams of appearance and motion information, extracting features from RGB frames and optical flow fields, respectively.By combining the two streams, these models capture both appearance and motion cues, leading to improved performance in action recognition tasks.
Domain adaptation and transfer learning techniques have been explored to address the domain shift problem in action recognition [5].By leveraging knowledge from a source domain with sufficient labelled data to improve performance in a target domain with limited labelled data, these approaches aim to enhance the generalization capabilities of action recognition models across different settings or data distributions.End-to-end learning frameworks have also gained popularity, allowing the joint optimization of feature extraction and action classification.These frameworks eliminate the need for handcrafted feature engineering and enable the network to learn discriminative features directly from raw video data.Furthermore, the emergence of large-scale video datasets, such as Moments in Time and Something-Something, has provided researchers with more diverse and challenging data for action recognition [6].These datasets encompass a wide range of daily-life actions, offering opportunities to explore and develop models that can recognize actions in real-world, dynamic environments.
In conclusion, human action recognition has witnessed significant progress with advancements in deep learning, attention mechanisms, graph-based models, few-shot learning, domain adaptation, and large-scale datasets.These developments have paved the way for more accurate and robust action recognition systems, opening up possibilities for various applications such as surveillance, sports analysis, and human-computer interaction.The field continues to evolve, driven by the exploration of novel architectures, techniques, and datasets to improve the performance and generalization capabilities of action recognition models.Temporal modeling techniques have been a focus of research in action recognition.TCNs utilize dilated convolutions to capture long-range temporal dependencies efficiently, while Transformer models leverage self-attention mechanisms to model temporal relationships across video frames.Multi-modal action recognition has gained attention as well.Instead of relying solely on visual information, researchers have incorporated other modalities such as depth data from depth sensors or skeleton data obtained from pose estimation techniques [2,9].These additional modalities provide complementary cues and improve the discriminative power of action recognition models.
Weakly supervised learning methods have been explored to alleviate the need for framelevel or segment-level annotations.These approaches aim to learn action representations using only video-level labels, where the temporal boundaries or locations of actions are unknown.Weakly supervised methods, such as Multiple Instance Learning (MIL) or attention-based pooling, enable models to discover discriminative temporal segments or frames without precise annotations.
Action anticipation has emerged as a related research area, focusing on predicting actions before they occur.These models aim to forecast future actions based on observed video frames, enabling proactive decision-making in real-time applications.Action anticipation requires understanding the temporal dynamics and cues that precede an action, and it has found applications in video surveillance, autonomous systems, and human-robot interaction.
The interpretability of action recognition models has also garnered attention.Researchers have explored methods to visualize and interpret learned representations and attention mechanisms within deep learning models.These techniques provide insights into which regions or frames contribute most to action recognition decisions, enhancing the transparency and trustworthiness of the models [8,10].
Meta-learning or learning to learn has been applied to action recognition, enabling models to adapt quickly to new action categories or unseen scenarios with minimal training samples.Meta-learning frameworks leverage prior knowledge from multiple tasks to generalize and recognize new actions efficiently, making them suitable for scenarios with limited labeled data.Real-time action recognition has been an active research direction, focusing on developing lightweight and efficient models that can process video streams in real-time In summary, human action recognition research has seen advancements in various areas such as temporal modelling, multi-modal fusion, weakly supervised learning, action anticipation, interpretability, meta-learning, and real-time processing The field continues to evolve with ongoing research to address new challenges and explore emerging techniques for better understanding and recognition of human actions.
Lightweight Action Recognition Architectures: The study focuses on developing a lightweight architecture for human action recognition using deep neural networks.The aim is to design an efficient model that can process RGB data and achieve accurate recognition results [11].The emphasis on lightweight architectures addresses the need for real-time applications and resource-constrained environments.CNNs are powerful tools for extracting spatial features from input images, enabling the model to distinguish objects from the background.The use of CNNs highlights the effectiveness of deep learning in capturing discriminative features for action recognition tasks.Long Short-Term Memory (LSTM): LSTM units are incorporated into the architecture to capture temporal motion features [12].By leveraging the temporal dependencies within action sequences, LSTM networks enable the model to understand the sequential nature of actions.The integration of LSTM units emphasizes the importance of considering temporal information for accurate action recognition.
Temporal-Wise Attention Model: A temporal-wise attention model is introduced to identify the significant parts within frames that contribute to action recognition.This attention mechanism enhances the discriminative power of the model by focusing on the most informative regions or frames.The temporal-wise attention model allows the architecture to dynamically allocate attention and adaptively learn important temporal features.Joint Optimization Module: The proposed architecture includes a joint optimization module that explores the intrinsic relations between the different LSTM features extracted from different CNN layers.This module aims to enhance the integration of spatial and temporal information for improved action recognition performance.By jointly optimizing the LSTM features, the model can effectively capture both the local and semantic characteristics of actions.
Experimental Evaluation: The effectiveness of the proposed method is validated through extensive experimental evaluations.The results demonstrate the efficiency and accuracy of the architecture in recognizing human actions.The evaluation highlights the advantages of leveraging spatial and temporal information, as well as the effectiveness of the attention mechanism and joint optimization module.In summary, the literature work focuses on addressing the challenge of human action recognition by proposing a lightweight architecture based on deep neural networks [15].The architecture combines CNNs for spatial feature extraction, LSTM units for temporal motion feature extraction, a temporalwise attention model for focusing on informative frames, and a joint optimization module for integrating spatial and temporal information.The experimental results demonstrate the efficiency and effectiveness of the proposed method in accurately recognizing human actions.
Action Recognition in Robotics Systems: The study emphasizes the importance of action recognition in robotics systems.Accurate recognition of human actions is crucial for robots to understand and respond appropriately to human behaviors in various interactive scenarios [16].By developing an efficient architecture specifically for robotics applications, the study addresses the need for real-time and lightweight action recognition models that can be deployed on robotic platforms.
Spatial and Temporal Feature Fusion: The proposed architecture leverages both spatial and temporal features for robust action recognition.The CNN component captures spatial information by extracting local and semantic characteristics from the input RGB data.The LSTM units, on the other hand, capture temporal motion features by modeling the sequential dependencies within action sequences [17].The fusion of spatial and temporal features allows the model to capture both appearance and motion cues, improving the discriminative power of the architecture.
Attention Mechanisms in Action Recognition: The introduction of the temporal-wise attention model highlights the significance of attention mechanisms in action recognition [19].By dynamically allocating attention to informative parts within frames, the attention model enhances the model's ability to focus on relevant regions and frames.This attention mechanism enables the architecture to selectively attend to crucial temporal cues, leading to improved recognition accuracy.
Joint Optimization for Feature Integration: The inclusion of the joint optimization module demonstrates the importance of integrating different LSTM features extracted from different CNN layers.By jointly optimizing the features, the model can effectively capture both low-level and high-level representations of actions.This integration allows the architecture to leverage complementary information from multiple layers, enhancing the overall action recognition performance.
Efficiency and Effectiveness of the Proposed Method: The experimental evaluation demonstrates the efficiency and effectiveness of the proposed architecture.The lightweight design enables real-time processing and efficient deployment on resource-constrained platforms [23].The results show that the architecture achieves high recognition accuracy, validating the effectiveness of the spatial and temporal feature fusion, attention mechanism, and joint optimization module in improving action recognition performance.
Comparison with Existing Methods: The study compares the proposed architecture with existing approaches for human action recognition.This comparison highlights the advantages and contributions of the proposed method, such as its lightweight nature, incorporation of attention mechanisms, and joint optimization for feature integration.The comparison showcases the superiority of the proposed architecture in terms of accuracy, efficiency, and robustness.In conclusion, the literature work focuses on developing a lightweight action recognition architecture for robotics systems [20].By combining spatial and temporal features, incorporating attention mechanisms, and applying joint optimization, the proposed method achieves accurate and efficient recognition of human actions.The study highlights the significance of action recognition in robotics and demonstrates the effectiveness of the proposed architecture through extensive experimental evaluations and comparisons with existing methods.

METHODOLOGY
In this section, we present the detailed methodology for human action recognition using a CNN-LSTM architecture with an attention model.The proposed methodology consists of several steps, including data preprocessing, feature extraction, model architecture design, training, and inference.In this section, we provide a comprehensive and detailed methodology for human action recognition using a CNN-LSTM architecture with an attention model [25].Our proposed methodology encompasses various essential steps, such as data preprocessing, feature extraction, model architecture design, training, and inference.By following this methodology, we aim to achieve accurate and robust action recognition results.
The first step in our methodology is data preprocessing.We begin by selecting the UCF-101 dataset, which is widely recognized as a benchmark dataset for action recognition.This dataset contains videos of 101 action categories captured in diverse settings and view points [27].To ensure consistency and optimal performance, we apply preprocessing techniques such as resizing the videos to a standardized resolution and normalizing pixel values.Additionally, we divide the dataset into training and testing subsets to facilitate model evaluation.
By utilizing a pre-trained CNN, such as VGG or ResNet, we can capture discriminative visual information from the input frames.This process involves passing each frame through the CNN and obtaining high-dimensional feature vectors that encode spatial characteristics.By utilizing these two types of LSTM networks, we effectively capture both local and global temporal dependencies within action sequences [1].
We design the overall model architecture by fusing the outputs of the two LSTM networks, resulting in spatial-temporal features that comprehensively represent the actions [28].We then integrate the attention model into the CNN-LSTM architecture, placing it after the LSTM layers.This integration allows the attention model to refine the extracted features and capture action-specific cues, further improving the models discriminative capabilities.
Moving on to the training phase, we select an appropriate loss function, such as categorical cross-entropy, to optimize the model during training.This loss function measures the discrepancy between the predicted action labels and the ground truth labels, guiding the model towards better performance.We employ optimization algorithms, such as stochastic gradient descent (SGD) or Adam, to update the model's parameters and ensure efficient convergence during training.To prevent overfitting and enhance generalization, we also apply regularization techniques like dropout or weight decay [5].
During inference, we evaluate the trained model on the testing subset of the UCF-101 dataset.We feed video sequences into the model and obtain predicted action labels.By comparing these predictions with the ground truth labels, we assess the model's performance in terms of accuracy, precision, recall, and F1 score [32].To further improve the model's performance, we explore fine-tuning techniques and transfer learning.Finetuning involves training the model on additional labeled data or adjusting specific model layers to adapt to the target action recognition task.Transfer learning leverages knowledge learned from pre-trained models on similar tasks or datasets to enhance the model's performance.
By following this comprehensive methodology, we aim to achieve accurate and robust human action recognition using a CNN-LSTM architecture with an attention model.Each step in the methodology has been carefully designed to ensure the efficient extraction of spatial and temporal features, incorporation of action-specific cues, and effective training and inference [34].The proposed methodology serves as a guideline for researchers and practitioners interested in developing advanced models for human action recognition.The dataset is curated to include videos captured in various settings and viewpoints, representing real-world scenarios.This diversity introduces challenges such as variations in lighting conditions, camera angles, and background clutter, making the dataset a suitable testbed for evaluating the robustness and generalization capabilities of action recognition models.The dataset provides a rich and representative sample of the types of actions encountered in real-world applications.The UCF-101 dataset offers a considerable number of videos for each action category, ensuring sufficient data for training, validation, and testing.This abundance of data allows for comprehensive analysis and evaluation of different action recognition approaches [15].
Moreover, the dataset provides ground truth labels for each video, indicating the corresponding action category.These labels serve as a reference for evaluating the performance of action recognition models.Researchers can compare the predicted action labels from their models against these ground truth labels to measure accuracy, precision, recall, and other evaluation metrics.The popularity and widespread usage of the UCF-101 dataset have contributed to the development of numerous state-of-the-art action recognition models.Researchers can benchmark their models against existing approaches and evaluate their performance on a standardized dataset [31].This facilitates fair comparisons and fosters advancements in the field by building upon previous work.The UCF-101 dataset is an ideal choice for our research on human action recognition.Its extensive collection of videos covering 101 action categories, diverse settings, and viewpoints provides a realistic and challenging dataset for evaluating the performance of action recognition models.The availability of ground truth labels ensures accurate evaluation and comparison of different approaches.By utilizing the UCF-101 dataset, we aim to contribute to the advancement of action recognition techniques and improve the understanding of human actions in various domains [12].

VIDEO PREPROCESSING
Video preprocessing plays a crucial role in preparing the dataset for human action recognition.In this section, we describe the various steps involved in video preprocessing, which include resizing videos, and extracting the frames then normalize the pixels.Additionally, we discuss the extraction of optical flow fields or the use of precomputed flow fields to capture motion information [26].

Resizing Videos:
To ensure consistency and optimize computational efficiency, we resize the videos in the dataset to a consistent resolution.By standardizing the video size, we create a level playing field for subsequent processing steps.This resizing step helps to alleviate the potential computational burden associated with varying video resolutions and facilitates efficient feature extraction [23].Normalization of pixel values is another important preprocessing step.By applying normalization techniques, such as mean subtraction or min-max scaling, we bring the pixel values of the videos to a common scale.This normalization process enhances the model's ability to learn meaningful features by reducing the impact of variations in lighting conditions and overall video brightness [27].

Splitting the Dataset
Splitting the data set into train and test validation is more importent processes.The splitting process ensures that the model is tested on unseen data, providing a reliable estimation of its performance [18].

Extraction of Optical Flow Fields
Motion information is vital for accurate action recognition.To capture motion cues, we extract optical flow fields from the videos.Optical flow represents the apparent motion of objects in consecutive frames.By calculating the displacement of pixels between frames, we obtain a dense optical flow field that encodes motion information.This additional information aids the model in distinguishing different actions based on their dynamic properties [17].

Use of Precomputed Flow Fields
Alternatively, we can utilize precomputed flow fields for capturing motion information.Precomputed flow fields are computed offline using specialized algorithms, such as Dense Optical Flow or Flow Net.These flow fields are then stored and used as inputs during training and testing.Using precomputed flow fields reduces the computational overhead during training and inference, enabling faster processing and alleviating the need for online optical flow computation.The choice between extracting optical flow fields and using precomputed flow fields depends on the specific requirements of the action recognition task and the available computational resources.Both methods enable the model to capture temporal dynamics and incorporate motion information into the recognition process [34].
By performing video preprocessing steps like resizing, pixel value normalization, and dataset splitting, we create a standardized and optimized dataset for human action recognition.Additionally, the extraction of optical flow fields or the use of precomputed flow fields enhances the model's ability to capture and leverage motion information.These preprocessing steps lay the foundation for subsequent feature extraction and model training stages, contributing to the overall effectiveness and accuracy of the action recognition system.

CNN-LSTM Fusion
We combine the outputs of the two LSTM networks from the previous step, resulting in fused spatial-temporal features.This fusion enables the model to capture spatial and temporal dynamics simultaneously, improving the overall representation of actions.The model architecture for human action recognition involves a crucial step called CNN-LSTM fusion.In this step, we combine the outputs of the two LSTM networks, which were previously applied to capture temporal information from different levels of spatial features.The fusion of these outputs results in fused spatial-temporal features, which are essential for effectively representing actions.By combining spatial and temporal dynamics, the model gains a comprehensive understanding of actions.being performed in the video sequences.The CNN component extracts spatial features from individual frames, using a pre-trained network such as VGG or ResNet [28].These spatial features capture visual information and provide a representation of the objects and scenes present in the video frames.The LSTM networks, on the other hand, focus on capturing temporal dynamics by modeling sequential dependencies within action sequences.This fusion is crucial as it allows the model to leverage complementary information from the two LSTM networks and capture the complete context of the actions.By combining spatial and temporal features, the model architecture achieves a more robust and comprehensive representation of actions.It becomes capable of understanding not only the appearance of objects and scenes but also the dynamic aspects and temporal variations within action sequences.This fused representation of spatial-temporal features enables the model to better discriminate between different actions and improves overall recognition performance.It enhances the model's ability to capture complex and subtle variations in actions, leading to more accurate predictions.
The CNN-LSTM fusion in the model architecture demonstrates the importance of integrating spatial and temporal information for effective action recognition.By leveraging both spatial and temporal dynamics, the model can capture the nuanced characteristics of actions and make informed decisions about the actions being performed in the video sequences.The integration of the attention model into the CNN-LSTM architecture is achieved by connecting it after the LSTM layers.The output from the LSTM layers serves as input to the attention model, allowing it to analyze and refine the extracted features.This integration facilitates the fusion of the temporal information captured by the LSTM layers with the action-specific cues provided by the attention model.The refined features obtained from the attention model are then used for the final classification task.The fused features, enriched with action-specific cues and refined temporal dynamics, enhance the discriminative power of the architecture in distinguishing between different action categories accurately.To train the integrated model, we employ appropriate loss functions such as categorical crossentropy, which measure the discrepancy between the predicted action labels and the ground truth labels.Optimization algorithms like stochastic gradient descent (SGD) or Adam are utilized to update the model's parameters during the training process.Regularization techniques, such as dropout or weight decay, may be applied to prevent over fitting and improve generalization.During the training phase of the proposed human action recognition model, several key steps are undertaken to optimize the model's performance and improve its generalization capabilities.One crucial aspect is the selection of an appropriate loss function.We choose the categorical cross-entropy loss function, which is commonly used for multi-class classification tasks.This loss function quantifies the difference between the predicted action labels and the ground truth labels associated with each video sequence.By minimizing this discrepancy, the model learns to accurately classify actions and improves its overall performance.
To mitigate the risk of overfitting, we incorporate regularization techniques during training.Dropout, a widely adopted regularization method, is applied to randomly deactivate a fraction of the neurons in the network during each training iteration [23].This prevents the model from relying too heavily on specific neurons and encourages the learning of more robust and generalized features.Another regularization technique we employ is weight decay, which adds a penalty term to the loss function to discourage large weight values.This helps prevent the model from becoming overly sensitive to individual data samples and improves its ability to generalize to unseen data.This process aims to find the optimal parameter values that minimize loss, and improve the model's ability to accurately classify actions.
During training, we carefully monitor the model's performance using validation data.This enables us to track training progress and make adjustments as necessary.If the model's performance plateaus or starts to degrade, we may employ techniques such as learning rate

RESULTS
In this section, we present the results of our proposed method for human action recognition using the CNN-LSTM with attention model.We conducted extensive experiments on a benchmark dataset, evaluating the performance in terms of recognition accuracy and comparing it with baseline models.The experiments were performed on a highperformance computing system, ensuring consistent and reliable computational resources.
Evaluation Metrics: We employed the standard metrics for action recognition evaluation, including accuracy, precision, recall, and F1 score.Accuracy measures the overall correctness of action recognition predictions, while Precision and recall assess performance in individual action categories.The F1 score provides a balanced measure of precision and recall, considering both false positives and false negatives.When comparing the performance of the proposed CNN-LSTM with attention model on the UCF50 and UCF101 datasets, we observe the following results: The model achieved an accuracy of 0.7724 on the UCF50 dataset, indicating that it correctly classified 77.24% of the action samples.The precision, which measures the model's ability to correctly predict positive instances, was found to be 0.7792.The recall, which assesses the model's ability to correctly identify positive instances, matched the accuracy value of 0.7724.The F1-score, which combines precision and recall into a single metric, was calculated to be 0.7698.
When evaluating the model on the UCF101 dataset, it achieved a significantly higher accuracy of 0.9083, indicating a more accurate classification of 90.83% of the action samples.Precision, measuring the model's ability to correctly identify positive instances, reached a value of 0.9124.Recall, representing the model's ability to correctly detect positive instances, was found to be 0.9083.The F1-score, which balances precision and recall, yielded a value of 0.9053.
The model achieved higher accuracy, precision, recall, and F1-score values on the UCF101 dataset compared to the UCF50 dataset.This improvement can be attributed to the larger and more diverse nature of the UCF101 dataset, which includes a broader range of action categories and variations in video samples.
The superior performance on the UCF101 dataset demonstrates the effectiveness of the proposed CNN-LSTM with attention model in handling complex action recognition tasks.The higher accuracy, precision, recall, and F1-score achieved on the UCF101 dataset suggest that the model has a stronger capability to generalize and recognize actions across different contexts and variations.

CONCLUSION
In conclusion, this study focused on human action recognition using a CNN-LSTM architecture with an attention model.By integrating spatial and temporal information, the proposed approach achieved accurate recognition of human actions on benchmark datasets, namely UCF-101 and UCF-50.The CNN-LSTM model effectively captured spatial features through a CNN backbone, while the LSTM network captured temporal dependencies and sequence information.The inclusion of attention model enhanced the discriminative power of the model by highlighting relevant features and improving activation patterns.The findings of this study emphasize the significance of considering both spatial and temporal information for accurate action recognition.By leveraging the complementary nature of these two modalities, the proposed CNN-LSTM architecture with an attention model demonstrated its ability to capture meaningful patterns in video sequences and improve action recognition accuracy.This highlights the importance of designing robust models that can effectively integrate both spatial and temporal cues in order to achieve accurate and reliable action recognition results.The utilization of benchmark datasets, such as UCF-101 and UCF-50, provided a comprehensive evaluation of the proposed approach's performance.These datasets encompass a wide range of human activities and action classes, allowing for a thorough assessment of the model's ability to generalize and recognize diverse actions.The proposed CNN-LSTM architecture with an attention model offers a promising approach for accurately recognizing human actions in various applications, including video surveillance, human-computer interaction, and activity analysais.

Fig
Fig 4 Model Architecture Fig 5 graph of accuracy vs val_accuracy of the model

Fig 6
Fig 6 Evaluation Metrics for UCF101 dataset

Fig 8 Fig 8 Fig 9 Fig 10
Fig 8 Predicting the diving image