Air Quality Index Prediction

Falling back past few years rapid progress in Air pollution has become a life-threatening concern in many nations throughout the world due to human activity, industrialisation, and urbanisation.. As a result of these activities, sulphur oxides, carbon dioxide (CO2), nitrogen oxides, carbon monoxide (CO), chlorofluorocarbons (CFC), lead, mercury, and other pollutants be emitted into atmosphere. Simultaneously, estimating quality of air is a tough undertaking because of evolution, variability, also unreasonable unpredictability over pollution and particle region and time. In this project we compare the two Algorithms of machine learning in predicting Index of Air Quality and its predominant. Support vector machine (SVM) exists as prominent machine learning method beneficial to forecasting pollutant plus particle levels and predicting the air quality index (AQI), and Random Forest Regression is another. We'll be working with data from India's Open Government Data Platform. This website displays Air Quality Index readings from around India, including Sulphur Dioxide (SO2), Nitrogen Dioxide (NO2), and Particulate Matter (PM) are examples of contaminants (PM10 and PM2.5), Carbon Monoxide (CO), and others. The output of the project is the predict of Air Quality index using two different algorithms and the comparison of models using various error metrics.


Introduction
Air pollutants alludes to the difficulty of pollutants within the air which can be dangerous upon humane along with the complete world.It may be defined as among the maximum risky threats humanity has ever faced.Affecting human's respiration and cardiovascular system, they are motive for extended mortality and extended chance for sicknesses for the population.To save you this trouble there's a want to are expecting air pleasant from pollution the use of device gaining knowledge of strategies.Subsequently, air quality index prediction and evaluation have became a universal studies sector.The purpose is to research device gaining knowledge of primarily based totally strategies for air pleasant prediction.Proper or accurate prediction or forecast of Air Quality or the degree of awareness in regard to numerous kinds ambient contaminants in the air, such as Ozone, Nitrogen Dioxide and Sulphur Dioxide, Carbon Monoxide, Particulate Matter less than 10 microns in diameter, Particulate Matter less than 2.5 microns in diameter, etc, may be very vital due to the fact the effect of those elements on human fitness will become severe.

Literature Review
Multiple Linear Regression and Principal Component Regression (PCR)Techniques have been accustomed to analyse previous AQI as well as meteorological data.They use prior records to predict day to day AQI of the current year.Then, using the Multiple Linear Regression Technique, this predicted value is compared to the observed value of AQI for the current year's seasons of summer, monsoon, post-monsoon, and winter.To discover collinearity among the absolute attributes, where the Principal Component Analysis is utilised .Principal components were used in Multiple Linear Regression to eliminate collinearity among predictor variables and reduce the number of predictors.In comparison to other seasons, the Principal Component Regression performs better in forecasting the AQI in the winter Only meteorological parameters were analysed or employed in this study for forecasting future AQI, but no ambient air pollutants that may cause harmful health consequences were considered.
Bayesian network model has been utilized to forecast air quality index in previous years.The model's evaluation factors are SO2, NO2, O3, CO, PM2.5, and PM10, and the model's output is the AQI value, after which the Bayesian network model is constructed.Finally, the model is utilised to forecast air quality and compare the predicted value to the actual value.The results show that air quality prediction accuracy is over 80%, and the forecast value is close to the real value in most cases, specifying the Bayesian network model posesses some practical use like air quality detection method.Greater the air pollution , higher would be the harm to human health, because of higher values in these six categories : PM2.5, PM10, SO2,NO2,CO, and O3 are all factors that influence the air quality index.
The precision of items for single and many steps ahead of concentrations of floor-degree ozone(O3), nitrogen dioxide(NO2), and other pollutants.andsulphur dioxide (SO2) is investigated using three machine learning (ML) techniques (SO2).Machine learning methods include support vector machines, M5P model bushes, and artificial neural networks (ANN).There are two types of simulations used: 1) uni-variate and 2) multivariate.The measures used to evaluate the performance include prediction pattern accuracy and root mean square error (RMSE).The results show that the M5P algorithm gives reliable estimates when using unique attributes in multivariate simulation.Air quality is a significant issue that directly affects human health.Monitoring motes wirelessly collect amazing air knowledge.These data are processed and used to forecast pollution awareness values utilising a sophisticated desktop-to-computer technology.The platform employs machine learningbased algorithms to create forecasting items based on the data collected.

Random Forest Model:
Random Forest (RF) is a combined learning technique for clustering and decision trees.The purpose of the Random Forest algorithm is to swap different decision trees for random selection of all dataset features and subsamples.All trees are averaged together to increase efficiency and prevent overfitting.Thus, this strategy minimizes high variance while slightly increasing the bias.This tradeoff usually makes the model more robust when predicting the input model.
Random Forest is a popular choice due to its fast training time, no input data for normalization, and hyperparameters that require little tuning.The uses randomly selected features to distribute individual trees, while random frequency selection is used to generate a subset of information for each decision tree.Deviations from the random number of attributes are checked for separation between each decision tree.If the target behavior is categorical, Random Forest will choose the most frequent based on its prediction.If it is a numerical problem, the mean of all estimates will be chosen.All measured data points were run over each decision tree in the forest to make predictions.Trees then vote on the results and predictions are made based on the majority vote in the sample, getting stronger per student.The predicted mean will be like the base true (distribution) or the true value, allowing the random forest to overcome the variation (regression) in the prediction that each decision tree has.Scikit-learn Random Forest classifier class for generating RF models.This parameter estimation results in several different decision trees among different models of each dataset.

Support Vector Model:
Support Vector Machines, a supervised learning system for classification, regression, and anomaly detection so that the output can be determined later.SVM was implemented in two different ways.Support vectors are data points located on the edge of the region closest to the hyperplane in the distribution problem in figure a.The edge of the course is somewhere between these two areas.The number of classes in the dataset will be determined using the hyperplane, and the output for missing data will be estimated based on which classes are similar to the new data.
In the regression problem, linear regression is used to construct this hyperplane approximation to the nonlinear function of the maximum margin.Therefore, -insensitive loss was added as an additional parameter to allow some time variation in the tube region.Section (assuming a straight line with equation.For products manufactured outside of the greater than or less than I range, the SVR uses the concept of a penalty denoted by the C index (appropriate value).However, data points at the border will be avoided.Since the support vectors represent the points of these boundary lines, the number of support vectors decreases as you get to the plane; otherwise, the number of support vectors increases as the plane moves upward.

Model Evaluation Methods:
Several statistical scores were used to evaluate the performance of O3, NO2 and PM2.5 model, including the Coefficient of determination (R^2), mean absolute error (MAE), root mean squared error (RMSE) and Root Mean Squared Logarithmic Error (RMSLE).

Dataset Description
In this paper, the data had been acquired from the archive statistics furnished with the aid of using the Open Government Data (OGD) Platform India.The statistics from the OGD includes day by day concentrations of particulate count of sizes much less than or same to 2.5 microns (PM2.5) and particulate counts of less than or equal to 10 microns in diameter (PM10).They derive the Air Quality Index primarily based totally on those values.The statistics obtained is absolutely in XML format.
The raw data received for this evaluation is in XML format as shown in Figure,which is initially converted to .csv and executed diverse data preprocessing steps like getting ready extra correct and whole datasets via way of means of imputing lacking data, making sure data is uniformly disbursed via way of means of normalization and standardization of data, developing a smaller and compact dataset via way of means of extraction and choice of features

Output Data Flowchart:
After all the data is processed ,it is fed to Support Vector Machine and Random Forest Regression models.The predicted results are compared with error metrics and finally best accurate model is visualized.• Keras is a High-level Neural Networks API for building network models.
• In this paper, keras is a front-end interface which is used for running the model.
• After that enter the code in Visual Studio Code and save the file as aqi.py • Compile and run through the options present in VS code app.
• Successful completion of code opens a pop of graph of the models.
• Tensorflow is an open-source software library for machine learning applications such as Neural Networks.• Here, Tensorflow acts as a backend.

Results
The output of this model gives the best accurate model by comparing both the outputs of SVM and Random Forest by using Error Metrics.• It helps people in identifying the purity of air this impacts their health and reduces the chances of occurring any health issues by maintaining a moderate ambiance or as required.• It is done by using only software which is used to predict the Air Quality Index and no hardware is required which helps in reducing the cost.• It helps people monitor the presence of pollutants, resulting in better environmental conditions for humans to reside.• It is used to determine the quality of air.

Disadvantages:
• The dataset may not work well in order to give accurate predictions if dataset is too complex.• It doesn't support sensor-based prediction.

Conclusion
In this project, we have created a comparative model where we compare the best accurate model using various error metrics methods.Based on error metrics methods and graph we conclude that the random forest model is better than the support vector model.
It is essential to recognise approximately AQI due to the fact except and till the humans recognise the worst influences or risks of air pollutants they'll now no longer end up that a whole lot privy to the air pollutants and attempt to lessen it.As in line with this evaluation maximum of the researchers worked on AQI and pollution attention degree forecasting in order to provide the real concept approximately AQI.Our method started out from data

Fig 1 :Fig 2 :
Fig 1: Architecture of the proposed model

Fig 8 :
Fig 8: Error metrics of the model