Human Action Recognition by Learning Spatio-Temporal Features with Deep Neural Networks

P. Haindavi; Shaik Sharif; A. Lakshman; Veerender Aerranagula; P. Chandra Sekhar Reddy; Anuj Kumar

doi:10.1051/e3sconf/202343001154

All issues

Volume 430 (2023)

E3S Web Conf., 430 (2023) 01154

Abstract

Open Access

Issue		E3S Web Conf. Volume 430, 2023 15^th International Conference on Materials Processing and Characterization (ICMPC 2023)


Article Number		01154
Number of page(s)		21
DOI		https://doi.org/10.1051/e3sconf/202343001154
Published online		06 October 2023

E3S Web of Conferences 430, 01154 (2023)

Human Action Recognition by Learning Spatio-Temporal Features with Deep Neural Networks

P. Haindavi¹^*, Shaik Sharif², A. Lakshman³, Veerender Aerranagula⁴, P. Chandra Sekhar Reddy⁵ and Anuj Kumar⁶

¹ Dept. of CSE- Data Science, KG Reddy College of Engineering & Technology, Hyderabad, Telangana, India.
² Dept. of CSE-AI&ML, CMR Technical Campus, Kandlakoya, Hyderabad, Telangana, India
^3,4 Dept. of CSE- Data Science, CMR Technical Campus, Kandlakoya, Hyderabad, Telangana, India
⁵ Professor, Department of Computer Science and Engineering, GRIET, Bachupally, Hyderabad, Telangana
⁶ Uttaranchal Instiiute of Technology, Uttaranchal University, Dehradun, 248007

^* Corresponding author: veerender57@gmail.com

Abstract

Human action recognition plays a crucial role in various applications, including video surveillance, human-computer interaction, and activity analysis. This paper presents a study on human action recognition by leveraging CNN-LSTM architecture with an attention model. The proposed approach aims to capture both spatial and temporal information from videos in order to recognize human actions. We utilize the UCF-101 and UCF-50 datasets, which are widely used benchmark datasets for action recognition. The UCF-101 dataset consists of 101 action classes, while the UCF-50 dataset comprises 50 action classes, both encompassing diverse human activities. Our CNN-LSTM model integrates a CNN as the feature extractor to capture spatial information from video frames. Subsequently, the extracted features are fed into an LSTM network to capture temporal dependencies and sequence information. To enhance the discriminative power of the model, an attention model is incorporated to improve the activation patterns and highlight relevant features. Furthermore, the study provides insights into the importance of leveraging both spatial and temporal information for accurate action recognition. The findings highlight the efficacy of the CNN-LSTM architecture with an attention model in capturing meaningful patterns in video sequences and improving action recognition accuracy. You should leave 8 mm of space above the abstract and 10 mm after the abstract. The heading Abstract should be typed in bold 9-point Arial. The body of the abstract should be typed in normal 9-point Times in a single paragraph, immediately following the heading. The text should be set to 1 line spacing. The abstract should be centred across the page, indented 17 mm from the left and right page margins and justified. It should not normally exceed 200 words.

Key words: CNN-LSTM / Deep Learning / Recognize human action

This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.