Analysis of environmental problems based on social media data (on the example of atmospheric air quality)

. The article discusses the state and the prospects of two new methods to study the environmental issues: Internet ecology (iEcology) and conservation culturomics. Both approaches are very similar; both of them are based on the big data analysis, which is not directly meant to study and solve environmental issues (publications in social networks, Internet search, photos and videos posted on Internet platforms, etc.). The authors offer the methodology to study environmental issues (as exemplified by the quality of the atmospheric air) based on the data from the VK social network and machine learning algorithms. For the content analysis we used PolyAnalyst software. The results of the analysis of publications on the atmospheric air quality in the Magnitogorsk city for 2020-2022 are presented. We identified 433 messages characterizing the air condition in Magnitogorsk. Our research demonstrates that the ecological methods of conservation culturomics can contribute to the analysis of the environmental situation. Our results let us state that the issue of the atmospheric air quality is very important for the residents of Magnitogorsk. The social network data can be used as an additional source of information for the subjective assessment of the atmospheric air quality.words.


Introduction
Digitalization has infiltrated industry, business, state administration and everyday life of all people; inevitably, it affects the scientific cognition, too.Changes in the society make us change the way we perceive the society.The new epistemology is supported by computer technologies, which are used to analyze bid data volumes.The use of big data and the intellectual data analysis in various areas of expertise has been discussed a lot for about fifteen years.Specifically, big data have good prospects in ecology [1].Eco-informatics, which is based on the use of computer technologies in relation to the environmental data, has developed for several decades already [2].Apart from the environmental data (the data generated and collected by professional environmental experts to solve certain environmental tasks), some other open data sources can be used.For example, the data retrieved from the social media and generated by the users for completely different purposes can be used to study biodiversity, for the environmental monitoring, to analyze environmental practices, as well as to solve a lot of other tasks.
In this article we investigate two interrelated approaches in ecology based on online-data -Internet ecology and conservation culturomics.We also represent the results of our own empirical research based on the data from the social network VKontakte (VK).We took city Magnitogorsk as objects of investigation.This city belongs to the category of monocities, where the largest metallurgic factory is located -Magnitogorsk Metallurgic Factory.Magnitogorsk is a big city in terms of their population (as of 01.01.2021Magnitogorsk had 410 thousand people).Thus, we suppose the data volume generated in VKontakte by the users living in these cities is big enough for us to apply the automatic data analysis.As of the beginning of 2021, there were 185 thousand registered VK users from Magnitogorsk.The level of VK penetration in this city makes up 45%.
Magnitogorsk is frequently mentioned in publications as city with difficult environmental situation, especially in terms of the air pollution [3][4].This city is included into the list of participants of the federal project 'Clean Air', the goal of which is to improve the environmental situation and to reduce the air contamination.Thus, we assumed that the environmental agenda can be important for the population of this city and, respectively, there might be a lot of publications on that in the social network.In this article we focus on the search and analysis of messages where the important issue of the atmospheric air quality is touched upon.
Here it must be mentioned that the quality of air is not just a significant environmental issue, but an important social issue as well, as the quality of air directly influences the overall life quality of people.Studies show that the air pollution is one of the factors, which impairs the health of the local population and the environmental component of their life quality.Apart from the direct negative consequences for health, the air pollution can be the cause of discomfort and deterioration of the subjective life quality because of the unpleasant smell.This issue has never been properly investigated; however, in certain cases it is very urgent, too.

Literature review
The analysis of digital data related to the conservation of the environment performed by digital methods became known as 'conservation culturomics'.The term 'culturomics' got into common use after the article by Michel et al was published in Science in 2011, where the authors studied cultural phenomena by means of the quantitative analysis of digitalized texts [5].As a field of research culturomics arose as a combination of the possibilities offered by information technologies (especially in terms of collection and analysis of digital data) and studies of the human culture.Thus, culturomics has a lot in common with ides of the fourth paradigm of science.
Within the last decade the principles of culturomics went beyond the analysis of purely cultural phenomena.Thus, in 2016 Ladle et al published an article, where the idea of conservation culturomics was defined as the field, which studies the interaction between people and nature.In the article several ways of the future development of this new research field were offered: -identification of those population groups, which tend to protect and conserve nature;studying of the public interest in nature; -identification of the environment protection symbols; -applying of new metrics and tools to monitor the environment online and to support solutions for the environment protection; -assessment of the way the environment protection activities impact the society; -description of the environmental problems and helping the public to understand these problems [6].
In terms of the applied methods and the data sources conservation culturomics is very similar to Internet ecology as a scientific project.The difference lies in their declared goals.According to [7], Internet ecology is focused on a wider scope of phenomena and processes as compared to culturomics, which is focused on the interaction between human and nature.For example, Internet ecology includes studying biological processes taking place in the wild nature, which humans are not a part of (invasion of non-indigenous species into the ecosystem).In this regard conservation culturomics can be interpreted as an element of Internet ecology.
The approach based on the digital environmental data is especially interesting for the urban ecology, because cities generate large volumes of data, which can be used for the analysis of the environment.Data from wireless sensor networks, data of the remote sensing, of social media and of the city administration can offer the information on the condition of urban ecosystems with high spatial and temporal resolution [8].For us the social media data (SMD) are of the special interest.These data offer new ways to indirectly measure the risks of the environmental impact on the society and to take decisions regarding the improvement of the environmental situation in the city [9][10][11].
According to [12], all issues related to the use of the digital data sources in ecology are technical and not conceptual.In [13] performed the systematic review of 169 publications, which analyze interactions between human and nature by using the social media data, and came to the conclusion that, even though the social media data offer unprecedented opportunities from the point of view of the data volume, the scale of analysis and online monitoring, scientists have just started dealing with the issue of the data heterogeneity and noise level, possible prejudices, the ethics of data collection and utilization.Besides, it is unclear how available all those data will remain in future.
In this article we present the results of the environmental analysis in two Russian monocities performed according to the principles of the conservation culturomics.We analyzed texts of the messages posted in some 'urban communities', i.e. in the social network VKontakte, where people discuss various life aspects in some cities [14].This approach is new for ecology; it has not been publicly recognized yet, but its prospects are not to be doubted [15].Online data and those generated by the social media users in particular can be utilized as an additional source of information on the environment condition [16].

Data collection methodology
At the first stage we performed the full-text analysis of the names of the selected city in titles and descriptions of VKontakte (VK) communities in all possible word forms.We took this social network because it is the largest one in the Russian Federation and it has an open API.We built a database with 490 thousand communities.Then, we filtered the communities to exclude the irrelevant ones.First of all, we filtered them according to the number of their subscribers.We excluded the communities with less than 500 subscribers.Secondly, we filtered out the communities, which posted nothing within a month after we downloaded the content from them.Next, we excluded the communities, which have 'stop-words' in their titles or descriptions (we made a list of stop-words based on the analysis of the most frequent words in the so-called 'garbage' communities -advertising, entertaining, creative, humor, commercial, etc.).Finally, we had 147 communities posting different information related to the Magnitogorsk.
At the next stage we downloaded the content of the selected communities for the period between 01.01.2020 and 31.10.2022(192 992 messages from the communities in Magnitogorsk) and we automatically classified the content into 8 themes (education, healthcare, ecology, availability of goods and services, housing and utilities sector and infrastructure, safety and security, politics, social guarantees).To do that we had developed the automatic classifier prior to this study [14].We got 2621 messages in 54 communities.These messages formed the basis for the further analysis.
Each object in the VK database has a numerical identifier, which uses API (application programming interface) to retrieve the information about it and the related objects.Thus, the identifiers of the communities help downloading publications and comments to them (with the ID of the author of the comment), lists of the users who liked the publications.The software for the data collection is implemented with Python 3. It has modules, including those for interactions with the VK API, for the records of the results in the storage and those that ensure parallelism when downloaded.To store the downloaded data we use PostgreSQL.

Data analysis methodology
The procedure of the data analysis had two stages.At the first stage we automatically clusterized the texts by means of the topic modeling.We performed several experiments, for which we applied several different algorithms (Top2vec, BigARTM, Gensim etc.) and various hyperparameters.Next, we validated the models of clustering.For that we merged some topical clusters (topics) and removed the irrelevant topics.
At the next stage of the data analysis we applied the method of the content analysis.We created the manual for the assessors on the basis of the topics identified according to the topical modeling results.The assessors were offered to identify those messages with the information on the 'dark sky mode' in the city, on the fixed atmospheric discharge, on smog caused by the unfavorable meteorological conditions; the information on the reduction or increase of the discharge; on the processes of monitoring, successful or not, performed by different enterprises; on possible methods to reduce the discharge; on the smoke caused by forest fires; on the federal project 'Clean Air'; on the development of dust-suppression systems; on the transition of the public transport to gas as fuel; on opening of carboniferous farms; on the development of the system of eco-vehicles and the use of eco-fuel in cities.Finally, the assessors identified 433 messages characterizing the air condition in Magnitogorsk.
For the further content analysis, we used PolyAnalyst software.We performed preprocessing of the data: -we removed hashtags, links, references, emails; -we removed stop-words; -we filtered all texts according to their length (we left only those with more than 50 symbols); -we checked and corrected spelling.Form the preprocessed texts we retrieved the key word combinations and phrases typical for the collection of texts with the functional node 'Retrieval of key words' of PolyAnalyst.
The key phrases help understand the content and the topics of the collection of texts.We also retrieved the named entities with the 'Entities retrieval' node.In particular, we retrieved the entities, which are objects of the real world, such as companies and institutions.

Results and discussions
In Figures 1 we represent the dynamics of the digital footprints left by the users from Magnitogorsk to the messages on the atmospheric air quality.In this case the digital footprints included the reactions of the users -comments, reposts, likes.For Magnitogorsk the most resonant message was the following: 1.Well, I believe that the morning fog was caused by the temperature difference.But I will never believe that the fog at 15.30 in the apartments is natural!It seems they contaminate the air on purpose!(published on 31.01.2021).This message had 189 comments, 635 likes, 208 reposts and over 40000 views.
The following messages were popular, too: These examples help us understand the general idea of the messages.The final results of the key words and entities retrieval are given in the interactive dashboard with PolyAnalyst.Below we analyze several results we believe are significant.168 key word combinations, which were mentioned 675 times, were identified in the collection of texts for Magnitogorsk (Table 1).As we can see from the given list, the most frequent key word combinations are related to the air contamination, atmospheric discharge, bad meteorological conditions, etc.The options 'Entities retrieval' in the dataset for Magnitogorsk 192 mentions of various companies were identified.Magnitogorsk Metallurgic Factory was mentioned in 135 cases, while the other companies were mentioned once or twice.

Conclusion
Our research demonstrates that the ecological methods of conservation culturomics can contribute to the analysis of the environmental situation.Our results let us state that the issue of the atmospheric air quality is very important for the residents of the Magnitogorsk.The social network data can be used as an additional source of information for the subjective assessment of the atmospheric air quality.Social networks can be a good tool to monitor the interest to environmental issues, as social networks take the leading role in determining the environmental agenda.They go before television and the other sources of information.Generally, in Russia the issue of the air contamination is one of the most urgent together with water pollution.This source of information obviously has its advantages and some limitations as well.The tools of the conservation culturomics can be used to analyse lots of important environmental issues (for example, contamination of water bodies, garbage and unauthorized waste deposits, etc.), as well as to study a broader range of issues related to ecology (for example, the efficiency assessment of the environmental campaigns and activities, studies of the 'ecological' way of life, etc.).

2 .
Chelyabinsk became the dirtiest city in RussiaMagnitogorsk was also included into the rating of the cities with the dirtiest air -As the press-service of the vice prime minister reports, in 2020 the dirtiest cities were Chelyabinsk, Nizhni Tagil and Magnitogorsk.(203 likes)3.✅ The aspiration assembly of the oxygen converter plant was launched.The plant was formally launched and the symbolic 'start' button was pushed by the Chairman of the Federation Council Valentina Matvienko, the Governor of the Chelyabinsk Region Alexey Teksler, the Plenipotentiary Representative of the President in the Urals Federal Distric Vladimir Yakushev and the chairman of the board of directors of MMK PJSC Viktor Rashnikov.The renovation result will be the more efficient purification of dust and gases, as well as the complete exclusion of the unauthorized discharge from all converters in all modes of operation.The gross discharge of dust will be reduced by at least 500 tons per year.(Here we cite only a part of the long message) (164 likes) 4. Anonymous.Very dangerous air in Magnitogorsk!I appeal to the city administration!Do you follow the situation with the air contamination at all?What have you done within the last two days?How should we breathe?! (162 likes)

Table 1 .
List of Top-25 key words.