Application of the Data Mining Methods to Assess the Impact of Meteorological Conditions on the Episodes of High Concentrations of PM 10 along the Polish – Czech Border

The paper presents an attempt to use selected data mining methods to determine the influence of a complex of meteorological conditions on the concentrations of PM10 (PM2.5) proffering the example of the regions of Silesia and Northern Moravia. The collection of standard meteorological data has been supplemented by increments and derivatives of measurable weather elements such as vertical pseudo-gradient of air temperature. The main objective was to develop a universal methodology for the assessment of these impacts, i.e. one that would be independent of the analysed pollution. The probability of occurrence (at a given location) of the assumed concentration level as exceeding the value of the specified distributional quintile was adopted as the discriminant of the incidence. As a result of the analyses conducted, incidences of elevated concentrations of air pollution particulate matter PM10 have been identified and the types of weather responsible for the emergence of such situations have also been determined.


Introduction and Aim of the Study
In spite of continuous measures undertaken by the European Union (EU) in order to improve air quality (overarching strategies: "EU Thematic Strategy on Air Pollution", "EU Clean Air Policy Package") and despite the implementation of many legal instruments (such as the "Air Quality Directive", or the "National Emission Ceilings Directive"), air pollution remains one of the most poignant problems in Central Europe [1].Scientific research is proffering increasingly copious evidence of the negative impact of air pollution on the health of the population, both as a direct exposure and as an indirect result of exposure to pollutants accumulating in plants, soil or water [2].The quality of atmospheric air in view of the presence of pollutants such as lead (Pb), sulphur dioxide (SO2) or benzene (C6H6) has been systematically improving.On the other hand, a significant percentage of the population, especially in highly urbanised areas, is still affected by harmful levels of pollutants such as suspended particulate matter PM in the planetary boundary layer (O3), nitrogen dioxide (NO2) or polycyclic aromatic hydrocarbons such as benzo(a)pyrene (BaP).The cause of poor air quality adversely impacting on health is the emission of substances from transport, industry, agriculture or households.The reduction of air pollution with substances such as particulate matter requires actions to reduce the natural or antropogenic emissions from different sources (point, area,, linear).
It is generally known that meteorological conditions determine the spreading of pollutants from their sources and their dispersion in the atmosphere.Constituting the most important and decisive meteorological factor here are the direction and speed of the wind, as well as the temperature of the air and precipitation.The wind affects the horizontal and vertical spreading of air pollution.The influence of air temperature on the level of pollution is both indirect and direct.Altitude-specific changes in the air temperature, or the so-called vertical thermal structure, shape the stability of the atmosphere.The more stable the thermal structure is -or as the temperature lowers, does not alter (isothermy) or even increases (inversion) with the altitude -the more unfavourable are the conditions for horizontal scattering of pollutants.Additionally, In wintertime, the air temperature "controls" the amount of pollutant emissions, thus indirectly influencing the concentration volumes.In turn, atmospheric precipitation causes diluting, reducing the level of the substance in the air.Insolation is also strongly impactful on photochemical pollution.
The aim of the study is to comprehensively assess the influence of the meteorological factors on the formation of episodes of high levels of particulate matter in areas with a high degree of anthropopressure.
A review of the reference literature identifies many works related to the study of the influence of meteorological conditions on the formation of high concentrations of air pollutants; however, most of them either constitute a description of such cases [3] or an undertaking to statistically analyse the influence of particular meteorological elements on air quality [4].Nonetheless, considering the fact that the episodes of high concentrations of pollutants are not linearly dependent on particular meteorological elements but are the result of many factors, such assessments must be deemed as incomplete.Hence, in this paper an attempt has been made to use data mining methods that allow a more comprehensive assessment of the influence of many meteorological elements on air quality.
This article presents the results of quantification of the above-mentioned relations between suspended particulate matter PM10 in the planetary boundary layer and the weather in the cross-border area of Silesia (PL) and Moravia (CZ).A collection of standard meteorological data has been supplemented with increments and derivatives of measurable weather elements such as the vertical pseudogradient of air temperature.

Data and the Scope of the Study
The analysis of weather conditions responsible for episodes of high concentrations of air pollutants: suspended particulate matter and ground ozone was carried out for the area of the Polish -Czech border area in the Silesian and Moravian regions.On the Czech side, this region included the Moravian-Silesian region comprising the Ostrava-Karviná agglomeration as well as the Bruntál, Frydek-Mistek, Nový Jičín and Opava counties.On the Polish side, it was a region of the Silesian Voivodeship encompassing the Upper Silesian agglomeration, and the rural counties of Bielsko, Cieszyn, Racibórz, Rybnik, Wodzisław, Żywiec, followed by the town counties of Bielsko-Biala, Jastrzębie-Zdrój, Rybnik and Żory.The total study area amounted to 10 370 km 2 , of which 52.5% was the Czech territory inhabited by 38.0% of the population in relation to the total number of persons for the entire area.In turn, the population density of 648.9 persons /km 2 was twice the size on the Polish side of the border [5,6].The described regions of the Silesian and Moravian-Silesian voivodships are among the most urbanised and industrialised in Poland and in the Czech Republic.This area is very diverse in terms of physical, geographical and socio-economic conditions.It is located in four provinces: the Central European Lowlands, the Polish Highlands, the Czech Massif and the Western Carpathians [7,8].In terms of its climate, it is located in a temperate zone with a characteristic annual course of four seasons.As part of the study, a collection of measurement data covering a period from October 2006 to March 2011 was analysed.The meteorological data was obtained from the measurement network of national meteorological services, four of which belong to the Czech Hydrometeorological Institute (ČHMÚ): Ostrava-Mošnov, Ostrava-Poruba, Lučina, Červená, Lysá hora and three to the Polish Institute of Meteorology and Water Management National Research Institute (IMGW-PIB): Bielsko-Biała, Katowice, Racibórz.The table 1 below shows the air quality monitoring stations included in the analyses.

Methodological Principles of Data Analysis
Although meteorological conditions determine the formation of high concentrations of particulate matter, however, the existing dependencies are not linear.Taking this into consideration, data mining methods were used to evaluate the influence of meteorological factors on the quality air [9,10].The main aim was to elaborate a universal methodology, i.e. one that would be independent of the analysed pollution.The probability of occurrence of a given concentration level in excess of the value of a particular quantile distribution was assumed as the discriminant of the episode.

Identification of Episodes High Concentrations of PM10
First, the measurement data were unified, replacing the absolute values of PM10 with the probability of occurrence of a given concentration at the station.sequences n x x ,..., 1 representing average daily PM10 concentrations the probabilities were calculated from the moving averages, according to the following formula: where: symbol denotes the cardinality (numerical amount) i=1,…,n.Thus, the concentrations were characterised by the frequency of their occurrence together with the higher values.Subsequently, average probability for each day was calculated for all stations actively measuring on that day, resulting in an averaged (area average) sanitary air quality.The lower probability corresponded to the higher levels of the analysed pollution: where: Ls -the number of measuring stations operating on a given day.

Identification of Meteorological Situations Associated with High Concentrations of PM10
The characteristics of the daily meteorological situation were obtained by averaging of a number vector consisting of 59 hourly elements and the sum of the daily precipitation.The weather for each day was represented by the p vector, as per the following formula: ] ,..., [ (3) Thus, for the period from October 2006 -March 2011 (N) for PM, a collection of 2008 cases was obtained.In order to standardise the data, a transformation was performed for each coordinate of the meteorological vectors: In this way, data was prepared in the form of vectors p = [p1, ..., p59] with coordinates in the [0,1] interval, which were used for further analyses.

Determining the Weather Patterns Responsible for the Occurrence of Smog Episodes
The P:={p (1) ,…,p (N) } sequence of daily probabilities of occurrence of a given pollution level was sorted in an ascending order.It was then assigned a sequence of weather vectors corresponding to the given sanitary air quality.In the next stage, grouping of set elements was performed.In order to allow clustering of the most similar meteorological situations, a limit distance r > 0 was determined.The value of the r parameter was determined based on the clustering index referred to as the Davis-Bouldin index.For the start, the vector of the weather for which the worst sanitary air quality occurred was assumed as the pattern.And this weather was the first one to be included in the k th group.Examining the P sequence starting from the second position, the k th group was subsequently enlarged by those p (I)  whose Manhattan distance from p (1) was lower than the set r value; afterwards, the p (I) vector was removed from the P sequence.( Thus, groups of several of numerical amount from several to dozens of vectors were designated.However, there have been cases -especially for smog situations with the highest concentration of pollutants (unusual weather) -in which the resultant group was one-element.On the other hand, a certain imperfection of the Davis-Bouldin index is constituted by the fact that it achieves a minimum of zero for one-element groups.Therefore, the proposed modification of this index for the purposes of these analyses entailed an insertion of the i S 2 component where Si represents the numerical amount of the i th group: where:  x -an average distance of the elements of the x th cluster from its centroid, x=1,…,n (the number of all clusters), and d(ci,cj) is the centroid distance.Grouping was performed many times for different r values, assuming as appropriate such rmin for which DBmod (rmin) reached the lowest value.Due to the large variability of meteorological conditions, which resulted in the small number of groups containing meteorological situations responsible for the highest concentration of pollutants, it was decided to cluster similar groups into larger aggregates.
For this purpose, a c centroid was calculated for each group, or, in other words, a vector composed of 59 coordinates formed by averaging the same coordinates of the weather vectors belonging to the given group.Then proceeding as described above, the resulting clusters were combined into groups, which were referred to as weather patterns.As a result for each pollutant, three weather patterns (types) were assigned for each PM level.
The result of the conducted research was a clustering of common features (meteorological elements) with high mean area concentrations of PM pollutants, followed by the determination of the types of weather responsible for such situations.

Particulate Matter PM
Table 2 shows for selected weather stations the characteristics of the basic meteorological elements that distinguish separate types of weather responsible for particulate matter PM episodes.An analysis of the results (Table 2) showed that the highest concentrations of particulate matter PM10 were observed in the weather type no.I, which is characterised by the lowest probability of occurrence and occurs at very low mean daily air temperatures (at all stations < -10 °C).Such low temperatures are accompanied by a very low wind speed (except for plateaus) and a long windless period.In weak wind conditions, there is also a dominant inflow from the southern sector.Among other weather patterns, this one is characterised by the lowest relative air humidity, the highest insolation, no precipitation or precipitation of a small daily sum thereof, as well as 24-hour inversion of temperature.The weather type no.II episodes that account for the highest number of dust episodes, but with the lowest mean regional concentration values occur at slightly negative air temperatures with relatively high wind velocities considering the dust episode conditions and the lowest number of silences among all the differentiated weather patterns.Dominant here is a clear influx of air masses from the north and northeast, with high humidity and low insolation.A slightly positive vertical pseudogradient of temperature indicates a slight inversion of temperature during these episodes.Episodes of this type feature atmospheric precipitation with relative frequency.The weather type no.III, which accounts for less than 30% of the dust episodes, is characterised by a mean daily air temperature of about -4.0°C, the highest wind speeds among all other types of weather (excluding the plateaus), the influx of air masses from the east, and a relatively high total radiation with no precipitation.Its characteristic feature is isothermy.As can be seen from the results shown in Table 1, with the weather type no.I, the probability of occurrence of an episode of an at least regional scale is almost total.In type no.II, the probability of occurrence of an episode amounts to about 60%.In turn, in type no.III, the probability is estimated at 75%.

Summary
Areas which have been anthropogenically strongly transformed and thus those which are marked with dominance of industrial and urban infrastructure are particularly vulnerable to the occurrence of high concentrations of particulate matter base.The identification and subsequent testing of the development phases of such episodes had been hitherto performed using classical synoptic analysis methods or descriptive statistics.In contrast, this paper proposes a comprehensive approach to the subject -the use of data mining methods, which, according to the authors, allows for a more complete indication of a group of meteorological factors which influence high concentrations of particulate matter.The applied methodology helped delineate several groups of meteorological situations responsible for the formation of episodes of high concentrations of PM10.It seems that this methodology -despite the fact that it does not simply refer to the origin of such situationsdoes allow for an objective identification (i.e.such that is independent of the researcher's intuition) of a set of meteorological factors that are conducive to their engendering.This method may therefore constitute an important auxiliary tool in both short-term air quality forecasting, and -what is perhaps even more important -it may create an element of an air quality management system in areas of particular vulnerability.
The scientific study was elaborated within the framework of the project entitled "Air Quality Information System in the Polish -Czech Border Area in the Region in Silesia and Moravia", which was founded by the Operational Program for Cross -Border Cooperation between the Czech Republic the Republic of Poland (POWT RCz -RP 2007-2013) (registration number CZ.3.22/1.2.00/09.01610)and co-financed by the ERDF (European Regional Development Fund).

Table 1 .
Air quality PM10 on monitoring stations used in the analysis.

Table 2 .
Meteorological characteristic of the types of weather responsible for the situations of high PM10 concentrations.