Ambient Sound Recognition using Convolutional Neural Networks

. Due to its many uses in areas including voice recognition, music analysis, and security systems, sound recognition has attracted a lot of attention. Convolutional neural networks (CNNs) have become a potent tool for sound recognition, producing cutting-edge outcomes in a variety of challenges. In this study, we will look at the architecture of CNNs, several training methods used to enhance their performance, and accuracy testing. The performance of the proposed sound recognition technique has been tested using 1000 audio files from the UrbanSounds8K dataset. The accuracy results obtained by using a CNN and Support Vector Machine (SVM) models were 95.6% and 93% respectively. These results portray the efficiency of using an advanced CNN architecture with five convolution layers and a versatile dataset like Urbansoundsd8K.


Introduction
Sound recognition is the process of identifying and further converting spoken language into written text.It is widely used in various applications like voice assistants and language translation.To achieve sound recognition, there is a series of steps that the trained program must undergo.These stages can be divided into three main stages.Sound recognition problem consists of three different stages as pre-processing of signals, extraction of specific features and their classification [1].In the pre-processing stage, the input audios are processed into a suitable input format that can then be input to the trained model.Then features are extracted from these audios.These features include Mel Frequency Cepstral Coefficients (MFCC) which will then be fed into the trained model for further classification.
The main goal of speech recognition is to allow machines to recognize sounds and act on them [2].Traditional approaches to sound recognition such as Hidden Markov Models (HMMs) which involve feature extraction using techniques face some limitations in handling variations in speech patterns, such as different accents and speaking styles.Alternately, Convolutional neural networks (CNNs) have proven to be a powerful tool for sound recognition, because they can extract relevant features from audio data, and their capacity for handling large datasets while eliminating the need for manual feature extraction.CNNs have also shown impressive performance in various speech recognition tasks, such as speaker identification, speech recognition, and language modelling.
In this paper, we review the literature on sound recognition using CNNs.The CNN is designed to process data that comes in the form of various arrays: 2D spectrograms for images or audio, and 1D signals for speech and biological signals [3].CNNs have a lot of applications according to how they are trained.A review of literature reveals that there has been intensive research over the last few years to utilise deep learning and CNN to even detect cracks for highways and tunnels [4].Convolutional neural networks (CNNs) take the lead in this field because of their adaptability and capacity to handle various modifying parameters [5].We discuss the architecture of CNNs, their advantages over traditional methods, and their application to speech recognition.We also evaluate the performance and accuracy of CNNs in various sound recognition tasks and discuss the current challenges and future directions for research in sound recognition by feature extractions using CNNs.
Sound recognition being a widely researched area comprises of different methods that authors have used to build efficient sound recognition systems with high accuracy and to target a certain goal.So, it is up to the concerned people to use a suitable classification model such as SVM and CNN along with a specific methodology to obtain accurate results that suit their uses and their dataset.
In this work, a CNN based model is proposed for the sound recognition of different types of audio signals from the benchmark dataset of UrbanSounds8K dataset.This approach is observed to perform consistently better than other state of art methods using the same dataset.However, we have used Support Vector Machine, SVM classifier for comparison to the accuracy results that will be obtained using CNN.Also, we are going to compare different neural networks used for sound recognition and classification like Recurrent Neural Network, RNN in [6][7][8] and k-Nearest Neighbor, KNN in [9][10][11][12] to the ones that have used CNN and SVM classifiers like in this project to analyze their differences and their respective performances in achieving the task of sound recognition.This will also be accompanied by the comparison of CNN and SVM classifiers to perform the same task.

UrbanSounds8K dataset
The UrbanSounds8K dataset is a collection of 8732 labelled sound clips, each clip with an average duration of 4 seconds.These clips contain various urban sounds which are common in everyday life.These sounds are divided into ten kinds of urban sounds, which include air conditioning, car horns, children playing, dog barking, engine idling, gunshots, jackhammers, sirens, and street music where each are labelled next to their group of recordings.These recordings were collected from different locations with different environmental variations, and this makes the UrbanSounds8K dataset suitable for accurate machine training and testing for accurate sound recognition and classification as there is enough dataset to train and we can use a different part of the remaining dataset for testing and validating the created model.In this project we are going to use only a part of this data set due to the limitations of our hardware, but it will be sufficient to obtain satisfactory and accurate results that will suit the uses of this sound recognition model.Table 1 shows the summary of dataset used in the present study.

Convolutional neural networks
Convolutional Neural Networks (CNNs) are a type of artificial neural network that is commonly used when working with classification of two-dimensional data such as image recognition tasks, but they can also be used for one dimensional data such as audio files for speech recognition.In this context, CNNs are frequently used to extract characteristics from sound recordings, including spectrograms and Mel-frequency cepstral coefficients (MFCCs), which are then fed into a separate classifier to recognize the content of the audio for easy classification of the data and training of the model to accomplish sound recognition tasks.The spectrogram is a 2D representation of the sound signal that demonstrates how the signal's frequency content evolves over time.The Fig. 1. shows the waveforms and spectrograms of some audio files from the UrbanSounds8K dataset.

Architecture of CNNs for speech recognition
The architecture of a CNN has many layers, each of which performs a different set of tasks and procedures on the data input, but all depend on each other to obtain reliable classification of the results.The input data to the CNN network is typically a sequence of audio samples, which are then transformed into a spectrogram using a Fourier transform.This input data in form of a spectrogram, then goes into the input layer and below is the brief explanation of the architecture of the proposed CNN model used in this project and every step that happens in every coming layer:

Input layer
This layer specifies the shape of the input data, which in this case will be a 1D array of 40 Mel-frequency cepstral coefficients (MFCCs) values which are representing a single audio sample that is passed in this layer.

Convolutional layers
The convolutional layers use filters to extract local features from the audio signal that was input from the input layer.The model proposed in this project will include five convolutional layers, each with a kernel size of 3 and with an activation function of ReLU.In every layer we will be increasing the number of filters going from 64 to 1024, and each layer will be preserving the same spatial dimensions of the input audio signal with the help of padding.

Max pooling layers
These layers are meant to reduce the dimensionality of the output of the convolutional layers by down-sampling it and taking the maximum value within a pooling window.Our model includes five max pooling layers, each with a pool size of 2, which will reduce the spatial dimensions of the output by a factor of 2.

Flatten layer
This layer converts the final maximum pooling layer output into a one-dimension vector that will later be used as the input to the fully connected layers.

Fully connected layers
The fully connected layers use the features that have already been extracted to determine the test audio signal for the 10 possible sound classes.Our model will consist of three dense layers, each with a ReLU activation function, followed by a dropout layer to prevent overfitting.Then the final output layer will give out a probability distribution over the sound classes using a softmax activation function.

Output layer
This layer specifies the number of output units, which correspond to the possible sound classes in which for this project they are ten.After undergoing through all these steps in these layers, the model will be built using the categorical cross-entropy loss function and the Adam optimizer.Table 2 shows the architecture summary of the CNN model that has been used in this project as per the above explanations.

Results and discussion
To obtain the results on the performance and overall accuracy of the CNN based model for sound recognition on the train set using a 40 epochs number and a batch size of 32 and below are the results obtained showing the epoch numbers with their corresponding loss, accuracy, validation loss and validation accuracy:   From the results obtained, the model seems to be fluctuating at first but in epoch 11/40, the model achieved its highest training accuracy of 0.9953, while the validation accuracy remained to be constant at 0.9500.Then towards the end, the training accuracy increased while the validation accuracy remained constant.This indicated some signs of overfitting in the model.In conclusion, this model performed well on the validation data, with an accuracy of 0.9563 at Epoch 18/40 and it maintained this constant value to the end of the training.
The above explanations of the model's performance can be visualized in the following loss and validation graphs as shown in Fig. 2. Also, to analyse the accuracy of the model in predicting the targeted variables, the graph of target variables vs predicted variables was plotted as shown in Fig. 3 and 4.
Fig. 5 represents is also a confusion matrix of the model's result analysis.From the above confusion matrix, the model performed very well in most of the classes, with the highest values on the diagonal of the matrix.However, there are a few instances when the model classified incorrect results.For example, in two instances the CNN model predicted class 3 when the actual class was class 10.Similarly, there is one instance where the model predicted class 10 when the true class was 3. The model has an accuracy of 95.6% and by using SVM classifier the model had a score of 93%.This is a good result, and the model is expected to perform well.
Table 4 shows comparison between the accuracy of the model used in this project and other models in other similar projects over the recent years.From the table above showing the comparison of various previous papers based on the model used and the accuracies obtained, it can be identified that the research that used a CNN model, obtained high accuracy results and the results were consistent in most research that used this model.For instance, when we compare the accuracy results obtained in [2], [13], [16], [17] which are 97.06%,96.48%, 92% and 92.02%, we can conclude that the accuracy results in these models are much more consistent and this goes to shows how a CNN based classifier can be reliable on the task of sound recognition as it delivers results which are much more consistent and of high accuracy.Also, when we compare the CNN models to those which used SVM classification model, it can be drawn to attention that the SVM classifier obtained satisfactory results on the accuracy, but these results were less compared to the CNN base classifiers.For example, in this paper, the CNN model obtained an accuracy of 95.6% while the SVM classifier had an accuracy of 93%.
On the other hand, there are other different neural networks that have been used for instance in [8] and [10] the research has been based on RNN and KNN models respectively.Looking at the accuracy results in these models we can obtain a conclusion that firstly they are not widely used for this task.Also, we can conclude that these classification models obtained fluctuating accuracy results, where in some papers it was high and in some papers being low.For instance, in [8] where a Recurrent Neural Network (RNN) model was used, the accuracy results were low, being 66.31%.This result is still significantly low even when compared to the lowest result obtained on a CNN based model in [1] where it was 73%.For the k-Nearest Neighbor (KNN) model, the results obtained were satisfactory but were still not as high compared to other CNN models.

Conclusion
In this study, a convolutional neural network-based sound recognition model is proposed to classify different types of audio signals.On the benchmark dataset for audio classification, the findings of this project were verified.This method enables network training on devices with constrained computing resources, for example, enabling ongoing development by incorporating newly acquired data.As future work, the method can be explored with other types of deep learning, which could be utilised to do away with the requirement of understanding which patterns of simple events generate which complex occurrences, could also be investigated further.

Fig. 2 .
Fig. 2. A graph of epoch training and validation against loss.

Fig. 5 .
Fig. 5.A confusion matrix of the results obtained.

Table 1 :
A table displaying the first ten audio files of the dataset and the available columns.
Fig. 1.Waveforms and spectrograms of sample audio files in the dataset.

Table 2 .
Summary of the Convolutional Neural Network used in present study.

Table 3 .
Summary of the Convolutional Neural Network used in present study.

Table 4 .
A table displaying the first ten audio files of the dataset and the available columns.