Direct geocoding of street intersections in text message analysis tasks

. Discussing various kinds of events on social networks is an integral part of everyday life. Users often publish and discuss information about traffic accidents, emergencies, utility accidents, and situations that do not require emergency intervention. Such information can be used to identify situations of a certain type, for example, to create statistics or timely involve certain services to eliminate emergency situations. The paper proposes a way to solve the problem associated with geocoding intersections when processing and analyzing text messages.


Introduction
The most common geocoding task is to convert one or more full addresses, which usually contain the name of a city, and sometimes an administrative unit or country.However, many geocoding systems are capable of handling other types of location information.You can enter city names, postal codes, as well as the names of regions or countries into the system.Apart from directly entering immediate coordinates, the most accurate result can be achieved using the full address.However, users of social networks rarely indicate full addresses in a form convenient for geocoding: they indicate house numbers in free form, indicate landmarks along with street names.For example, a movie theater on Bolshoy Prospekt may be understandable to other users living in the same city or area, but may not be suitable for the geocoding procedure.
For some situations, street intersections may be indicated without indicating the nearest houses.Direct geocoding is the operation of finding coordinates for an exact address, but is not possible in cases where only two streets are specified.For intersections, reverse geocoding is most often performed, where the user places a pointer on the map to obtain the coordinates.
This paper presents an approach to automated geocoding of intersections to indicate the occurrence of emergency situations.

Related works
One of the main problems with direct geocoding is the direct dependence of the accuracy of the result obtained on the data entered into the system.Paper [1] explores various methods for correcting entered addresses in order to increase the percentage of correctly identified addresses, showing that 90% accuracy can be achieved using string similarity methods or using machine learning methods.Paper [2] describes a method for retrieving and storing data on intersections, which simplifies the search for road intersections in most geocoding services when the user directly interacts with the service and manually enters street names and, if possible, addresses.However, when analyzing large volumes of texts, it is necessary to automatically extract addresses and parts of addresses to determine a possible location.In [3], a system of syntactic rules was proposed for identifying parts of text in posts from social networks describing accidents, but this does not solve the problem of determining the exact location and geocoding.In [4], a large-scale system for geocoding social media posts was proposed, consisting of two layers of geoparsers, and then several layers of geocoders, allowing further processing of posts that did not yield results in previous steps.The proposed solution is designed to search for information on the social network Twitter, while the authors note that the system allows you to detect mentions of traffic accidents on major highways and highways.
The study showed that the problem of detecting a traffic accident in an array of text messages [5] published on social networks and their automated display on a map is not fully solved in any of the considered developments.

Proposed approach
The tools we previously proposed for automatically collecting and processing information from social networks involve processing text messages and isolating named entities for subsequent geocoding.When specifying only the street name, the entire street is highlighted, but if two streets are indicated in one message, both streets are highlighted in its entirety (Figure 1).However, when it comes to intersections, it is necessary to highlight only the intersection point, and not the streets along their entire length.To solve this problem, the following solution is proposed.
When mentioning events that occurred at intersections, words such as "crossroads," "intersection," "corner," "circle" (for roundabouts) are usually used in Russian oral and written speech.If a message contains any of these words and two named entities representing street names, it is proposed to use the following approach.
Any street can be represented as an ordered set of buildings located on it: where STR i is the street, B ij is the building located on this street.Meanwhile the B i1 will point to building 1 located on this street, and B in is the maximum possible address on this street.
Each building could be described with coordinates NE.So let NE(B ij ) be the coordinates of building B ij .
Therefore, any street can be described using an ordered set of coordinates: When two streets are mentioned in the text, we get two ordered sets of coordinates.Let's denote the second street as: Obviously, for the intersection of these two streets, it is necessary to find such coordinates that will coincide in both sets or have as close values as possible (it is proposed to use the acceptable discrepancy threshold delta).
The algorithm for searching for matching coordinates is proposed as follows (Figure 2).An acceptable discrepancy threshold is necessary in most cases, since in a small number of cases a building may have a dual address on different streets.Then a comparison is made at the closest addresses.
Marking an intersection can be done in various ways: using the coordinates of the house of the first or second street or by calculating the arithmetic average for the most similar coordinates.

Results
As a result of the experiments, a set of posts was obtained from the social network VK, which mentioned road accidents in St. Petersburg.To mark intersections, the arithmetic mean of the coordinates was calculated.
Figures 3 and 4 show examples of displaying events on the map.It is noteworthy that Figure 3 illustrates an event that occurred on an embankment, with the embankment located on both sides of the river.The location of the incident is the point for which a coincidence of coordinates was detected, but the event could have occurred on the opposite bank.In this case, the point is located exactly at the intersection.Figure 4 shows that in some cases the incident location may not be displayed correctly.For example, in this case the street consists of two roadways separated by a small park.A traffic accident was noted in the green zone, which is hardly possible.However, the intersection is recognized correctly, and this information may be sufficient to inform special services or other road users.It should be noted that the task of marking intersections is relevant not only for cases of road accidents, but also for mentioning other events.Therefore, it is necessary to take into account the context of the event, since it may not always occur directly on the roadway.

Conclusion
The experiments showed that the proposed approach is workable and allows you to mark intersections of two streets on the map when processing and analyzing text messages.Further research may be related to increasing the accuracy of determining the coordinates of the intersection and analyzing the nature of the situation that occurred.You should also evaluate the performance of the algorithm with a large amount of data and carry out its possible optimization in order to reduce the number of iterations for pairwise comparison of coordinates and increase the response speed.

Fig. 1 .
Fig. 1.Designation on the street map if the text mentions an intersection.

Fig. 2 .
Fig. 2. Algorithm for finding the coordinates of the intersection of two streets.

Fig. 4 .
Fig. 4. Geocoding of event on the large street