Roman Sulymka, Lviv Polytechnic National University, Lviv, Ukraine
Dmytro Fedasyuk, Lviv Polytechnic National University, Lviv, Ukraine
Abstract. The paper presents the results of pollution forecasting using ARIMA and LSTM models based on time series and neural networks. Features of the use of ARIMA and LSTM models for air pollution forecasting have been revealed. The LSTM method was found to show better pollution prediction results using data sets with a larger number of records. The result of the research is an analysis of ARIMA and LSTM forecasting methods and developed software using time series and neural networks.
Keywords: pollution forecasting, time series, neural networks, ARIMA, LSTM.
I. Introduction.
Pollution is one of the most widespread environmental problems. Due to the large number of factories that emit harmful substances, they settle in the soil and pollute the environment, but there are still no software tools for predicting pollution [1].
However, there are many algorithms and models that can be used to construct a pollution forecast.
Accordingly, it was decided to conduct a study of pollution forecasting models using time series and neural networks.
It was decided to explore the ARIMA and LSTM methods for forecasting.
II. The relevance of environmental pollution forecasting
Currently, there are many enterprises in Ukraine that emit a large number of harmful substances into the air and wastewater due to their activities. One of the main sources of chemical soil pollution is chemicals used in agriculture, atmospheric precipitation in the radius of industrial enterprises, especially chemical and metallurgical ones, mining of minerals, mineral fertilizers, etc [2]. A significant part of the sources of soil pollution has a local effect, but some of them act on a regional and even global scale, especially in the case of pollution due to precipitation or due to the use of fertilizers on large areas of land [2].
After conducting a detailed analysis of existing systems, many software tools were found that allow forecasting of air and water pollution, but there are no applications for analysis and forecasting of soil pollution.
The purpose of this research is the following main tasks:
- conduct a review of existing information tools and literary sources on pollution forecasting;
- conduct an analysis of existing methods that are popular in pollution forecasting;
- develop software for predicting soil contamination using time series and neural networks;
- provide the neural network with real data;
- train a neural network and determine the best model for pollution prediction.
III. Analysis of literary sources
ARIMA method is based on time series, which includes methods of data analysis in order to extract significant statistics and other data characteristics [2]. Time series forecasting is the application of a model to predict future values based on previously observed values [3].
The ARIMA method is an integrated moving average autoregressive model, a form of regression analysis that measures the strength of one dependent variable relative to other variables. The method’s purpose is to predict future data by studying the difference between the values in the series instead of the actual values [2].
ARIMA algorithm
- Load input data.
- Check whether the time series is stationary, if it is stationary, then go to step 4. If the time series is not stationary, then its difference of order d is found, which is a stationary series [3].
- Put d = 0.
- Plot the graphs of the autocorrelation function and the partial correlation function to determine the input parameters of the ARIMA model.
- Using graphs of the autocorrelation and partial correlation functions, determine the values of p and q for the ARIMA model.
- Fit parameters p and q to identify the ARIMA model.
- Predict values on a test time series of future values.
- Calculate the root mean square deviation to compare predictions and actual values.
The long-short-term memory (LSTM) neural network model is the most popular neural network framework for time series forecasting, which is designed to address the problem of long-term dependencies.
The structure of the model with long short-term memory resembles a chain, it contains four layers of a neural network that interact with each other in a special way [4]. A special feature of LSTM is that the LSTM model stores information over long periods of time. Its advantages include the fact that there are fewer restrictions and assumptions for a neural network, this model is able to process complex nonlinear dependencies in a time series, and has high forecast accuracy and the possibility of automation [5]. The disadvantages include the fact that this method has a low interpretation, and a lot of data is required for an accurate prediction.
Algorithm of LSTM model for the construction of pollution forecast.
- Define Network: We will construct an LSTM neural network with a 1 input time step and 1 input feature in the visible layer, 10 memory units in the LSTM hidden layer, and 1 neuron in the fully connected output layer with a linear (default) activation function.
- Compile Network: We will use the efficient ADAM optimization algorithm with the default configuration and the mean squared error loss function because it is a regression problem.
- Fit Network: We will fit the network for 1,000 epochs and use a batch size equal to the number of patterns in the training set. We will also turn off all verbose output.
- Evaluate Network. We will evaluate the network on the training dataset. Typically we would evaluate the model on a test or validation set.
- Make Predictions. We will make predictions for the training input data. Again, typically we would make predictions on data where we do not know the right answer.
IV. Implementation of pollution forecasting using ARIMA and LSTM.
The dataset contains 1000 records of atmospheric air pollution from 2010 to 2014, with data collected hourly (table 1) [6].
Table 1.
Air pollution input data [4].
Date |
Pollution |
Dew |
Temperature |
Atmospheric pressure |
Wind speed |
02/01/2010 00:00 |
129 |
-16 |
-4 |
1020 |
1.79 |
02/01/2010 01:00 |
148 |
-15 |
-4 |
1020 |
2.68 |
02/01/2010 02:00 |
159 |
-11 |
-5 |
1021 |
3.57 |
02/01/2010 03:00 |
181 |
-7 |
-5 |
1022 |
5.36 |
02/01/2010 04:00 |
138 |
-7 |
-5 |
1022 |
6.25 |
02/01/2010 05:00 |
109 |
-7 |
-6 |
1022 |
7.14 |
02/01/2010 06:00 |
105 |
-7 |
-6 |
1023 |
8.93 |
02/01/2010 07:00 |
124 |
-7 |
-5 |
1024 |
10.72 |
02/01/2010 08:00 |
120 |
-8 |
-6 |
1024 |
12.51 |
02/01/2010 09:00 |
132 |
-7 |
-5 |
1025 |
14.3 |
As a result of the constructed forecast for 1000 days using the ARIMA model, the following results were obtained (Fig. 1).
The result of forecasting pollution using the LSTM model is shown in Fig. 2.
An excerpt of the received forecast results is shown in table 2.
Table 2.
The results of the air pollution forecast using ARIMA and LSTM models.
Date | Real values | ARIMA forecast | LSTM forecast | ARIMA accuracy percentage | LSTM accuracy percentage |
02/01/2010 00:00 | 138 | 190.463533 | 180.791779 | 38% | 31% |
02/01/2010 01:00 | 109 | 142.876833 | 134.663025 | 31% | 24% |
02/01/2010 02:00 | 105 | 113.50899 | 112.399124 | 8% | 7% |
02/01/2010 03:00 | 124 | 107.3276 | 107.894363 | 13% | 13% |
02/01/2010 04:00 | 120 | 126.869533 | 122.34301 | 6% | 2% |
02/01/2010 05:00 | 132 | 123.029785 | 118.099701 | 7% | 11% |
02/01/2010 06:00 | 140 | 135.957285 | 128.783752 | 3% | 8% |
02/01/2010 07:00 | 152 | 144.881011 | 135.095505 | 5% | 11% |
02/01/2010 08:00 | 148 | 158.088652 | 145.482468 | 7% | 2% |
02/01/2010 09:00 | 164 | 153.853202 | 141.957169 | 6% | 13% |
02/01/2010 10:00 | 158 | 171.703913 | 158.602753 | 9% | 0% |
… | … | … | … | … | … |
The mean error | 26% | 20% |
As can be seen in the table. 2, using a dataset of 1000 records, the LSTM model performed significantly better, with an average error of 20%, while the ARIMA model had an average error of 26%. This is because neural networks are better at making predictions based on large data sets.
Based on the obtained forecast results, it can be concluded that the ARIMA model is better suited for short-term forecasts, or forecasts with a small amount of input data. The LSTM model should be used in cases where the input data set consists of a large number of records and when the prediction is long-term because neural network-based models require a large amount of input data for training.
V. Conclusion
Using the air pollution data set, the prediction result using the LSTM model turned out to be more accurate, while the ARIMA model showed lower accuracy, the error of the LSTM model was 20%, the average accuracy of the ARIMA model was 26%.
The performed forecasting analysis showed that for data with a large sample, the LSTM model based on neural networks shows better forecasting results than the ARIMA model.
References
- 1. Omran E.-S. E. Environmental modelling of heavy metals using pollution indices and multivariate techniques in the soils of Bahr El Baqar, Egypt. Modeling earth systems and environment. 2016. Vol. 2, no. 3. URL: https://doi.org/10.1007/s40808-016-0178-7 (date of access: 28.11.2022).
- Statistical approaches for forecasting primary air pollutants: a review / K. Liao et al. Atmosphere. 2021. Vol. 12, no. 6. P. 686. URL: https://doi.org/10.3390/atmos12060686 (date of access: 28.11.2022).
- Ye Z. Air pollutants prediction in shenzhen based on ARIMA and prophet method. E3S web of conferences. 2019. Vol. 136. P. 05001. URL: https://doi.org/10.1051/e3sconf/201913605001 (date of access: 28.11.2022).
- Spatiotemporal prediction of air quality based on LSTM neural network / D. Seng et al. Alexandria engineering journal. 2021. Vol. 60, no. 2. P. 2021–2032. URL: https://doi.org/10.1016/j.aej.2020.12.009 (date of access: 02.11.2022).
- Statistical approaches for forecasting primary air pollutants: a review / K. Liao et al. Atmosphere. 2021. Vol. 12, no. 6. P. 686. URL: https://doi.org/10.3390/atmos12060686 (date of access: 28.11.2022).
- Toxicity criteria database – catalog. Dataset – Catalog. URL: https://catalog.data.gov/dataset/toxicity-criteria-database-22828 (date of access: 28.11.2022).