Models of the APIEd system under development to collect data on different levels of educational institutions

. The article discusses an approach to developing a web resource that aggregates information about educational organizations of various levels: from kindergartens to universities. The technology stack used is described, the stages of development of the main components of the system are demonstrated. The article presents the architecture of the system with a description of the operation of individual modules, including modules for determining the geo-location of an educational institution. It is shown how information about more than 150,000 educational institutions in Russia is collected on the basis of open data.


Introduction
Choosing one or another educational organization is sometimes a difficult task.A lot of information can be found on the Internet or in special reference books.But often this information is scattered.Various individual tax codes and other legal information can be stored in the first database, the manager's name and email in the second, and, for example, the rating of an educational institution in the third.The existing approach may seem logical, since a specific user often needs only one kind of information.But in today's digital world, data is fundamental.Data must be able to collect, store, process and provide a user or a third-party business system [1][2][3].For example, an applicant choosing a university wants to check the education license, as well as view the social networks of the future alma mater, or check the number of university tenders.To do this, now you need to visit 3 different resources: the register of licenses of educational institutions, the website of the university and the website of the federal tax service.Or, if we consider the selection of a kindergarten, parents may be interested in the same information as the applicant, but they will also want to look at the available institutions near the house on a map, then they need more information about the coordinates [4].
Thus, the task of collecting data on educational institutions of different levels (kindergartens, schools, technical schools, universities) on one resource is an urgent task.Further, the article will consider the main stages of development of APIEd -a system for collecting and providing various data about educational organizations of different levels.

Data collection
Let's take a look at the main data sources.There are several ways to collect data.One of the most popular methods is parsing Internet resource pages.Parsing is the process of collecting data with its subsequent processing and analysis.This method is used when there is a large amount of information to be processed, which is difficult to handle manually.Parse -collect and organize information posted on certain sites using special programs that automate the process [5].This method has several disadvantages.First, in some countries, web scraping is essentially an illegal way to obtain data.Secondly, although this is an automated method of collection, if any changes occur on the page, it will be impossible to access certain fields without changing the parser code.The work deals with working with open data resources, and all the data received does not in any way relate to trade secrets or personal data [6].
The next way is to get data from open sources.There are many resources in Russia that are provided by the so-called "opendata", which are now practically available on the websites of all ministries, including the Ministry of Science and Education of the Russian Federation.Such files are usually provided in the .csv,.xmlor other format.The main difficulty may be the fact that not all data files contain descriptions of structure or markup.In this case, the process of working with data will be complicated by work on data markup and search for useful lines [8].
A more correct approach is to use the API.An API is an application programming interface, a description of the ways in which one computer program can interact with another program [9].With this method, the developer of a program, including a resource with data, provides a toolkit that will allow you to get data using special calls and calls to the target resource.The disadvantages of this approach include the fact that the API may not always provide up-to-date information, developers should keep an eye on API updates.In addition, the API is often a paid solution, includes a limited number of requests and the ability to receive a limited amount of data [10].
Another way is to buy a ready-made database or several such databases with subsequent connection to your own system [11].
The work analyzed the method of collecting data from different sources.The main sources considered and the possible types of data collected are presented in Table 1.After analyzing possible sources of information, it was decided to use as much open data as possible based on obtaining it using API and databases, for example, CSV.At the same time, an integrated approach was applied to the solution of the problem, when various sources were involved.The main ones are shown in Figure 1.In the test version, parsing is used in a limited version and where no other ways to obtain data have been found, for example, on the website of ratings of universities and schools [12,13].Top 100 universities and top 100 schools are represented on RAEX, so such data can even be transferred manually [14].Basic data is obtained from open sources.

Applied technology stack and system architecture development
The development of the API Ed prototype was divided into the visual part -the frontend and the server part -the backend.
The framework was chosen as the basis for the server side -Spring boot, OSS (Open Source Software) and the Java 11 Liberica programming language.Spring boot -allows you to create an application based on loosely coupled components.The spring framework is also focused on enterprise-level applications and allows you to easily manage an application on a spring stack in a cloud infrastructure [15].
The OSS stack is a framework and library from MovieCompany that addresses the challenges of distributed, scalable applications.The de facto OSS technology stack is the standard for microserver architecture [16].OSS is gradually being included in the spring cloud distribution and allows you to implement such patterns as: service discovery, load balancing, fault-tolerance, circuit breaker, router, etc.
The choice of OSS and Spring cloud stack is driven by the need to interact with the target system in a microservice environment.The service acts as a data source for the target application for educational programs.Significant service load is expected [17].
PostgreSQL was chosen as the storage, but at the time of the MVP implementation for storing semi-structured data, it seems more logical to use mongoDB as a storage [18].This will reduce the cost of processing data and preparing it for storage in relational algebra and will significantly reduce the latency of the system.MongoDB is a distributed and high performance noSQL document database.Used as a cache service Redis.Redis -NoSQL key-value storage.
To obtain the coordinates of organizations, we use our own geoServer -nominatium from the open street map.It is an opensource solution.Using your own geoserver significantly reduces the cost and decouples the decision from the need to use geoproviders.
To obtain data, parsing of open resources is used.Parsers are written in Python.And the stages go through: -Normalization.Preparing a data format for storing data at the application level.
-Aggregation.This layer compares data from different sources and stores it in the application storage.
The VueJs framework is used as a client application.Vue.js is an open source JavaScript framework for building user interfaces.Easily integrates into projects using other JavaScript libraries.It can function as a web framework for developing reactive style single page applications.The main input value was the All-Russian Classifier of Economic Activities, which is group 85, including all subgroups.These codes refer to codes for educational activities.These codes were used to unload all institutions involved in education: kindergartens, schools, colleges, technical schools, universities, institutions with additional vocational education programs, etc. From the resulting unloading, you can get the tax code of the institution, which is unique.Further, this code was used to search in other open data.
An XML file was downloaded from the educational supervision resource, which contained information about the licenses of educational organizations, as well as the composition of educational programs.The file size was about 1.5 GB and included about 2 million records that needed to be marked up, duplicates removed, and only the organizations of interest were selected.
From the DaData resource, data was collected about the contact information of the institution, its head and its status.The missing address data was also supplemented.The resource limit of 10,000 was enough to fill with data only on educational institutions when the system was running for several days in a row.

Determining the coordinates of educational institutions and mapping
The initial screen of APIEd is an interactive map on which you can see the number of educational institutions by region (the darker the color, the more there are), as well as zoom in on the map and see the type of institution of interest near the required location.Leaflet library is used to display the map.Leaflet is the leading open source JavaScript library for mobile interactive maps.Weighing in at only about 39kb, it has all the matching features that most developers would ever need.
APIEd uses the Nominatium API because it is free and accurate enough.Nominatim uses OpenStreetMap data to find places on Earth by name and address (geocoding).He can also do the opposite -find an address for any place on the planet [19,20].
The data that Nominatium works with is cartographic maps of Russia from geofabrik.de,a German mapping data company in partnership with OpenStreetMap.
Nominatium imports data natively in its format, creates indexes, and optimizes for search.This process for such large geolocations as Russia takes several hours and requires a lot of computing resources.
The addresses of houses, institutions, locations are freely perceived by a person in almost any form, even if the address is modified.For example, the options "Omsk region, Cherlaksky district, Elizvatenka village, Beregovaya street, house 44" and "Omsk region, Cherlaksky district, s.Elizavetinka, Beregovaya 44 "are understandable to humans, but the geocoder cannot search for an address by fuzzy matches, this is the task of another service.
As already described above, using an integrated approach, APIEd collects, among other things, the addresses of educational institutions, but these are addresses in a humanreadable form.The resulting address base had to be brought to a form that the geocoder could understand.It was experimentally found that Nominatium understands the format of addresses best of all in the form: "Omsk Cherlaksky Elizavetinka Beregovaya 44".
It was necessary to remove all the index words: "district", "house", "highway", "street", etc.So before passing the data through the server, they had to be prepared using a regular expression, otherwise most of the addresses (95%) would not be recognized:

Conclusion
The prototype of the APIEd system, using heterogeneous open data sources (more than 10), allows collecting and aggregating information about educational institutions of various levels in Russia.APIEd has various filters, allows sorting by region, city, type of institution, etc.An interactive map was also developed on which you can see the nearby educational institutions.APIEd is based on a modern technology stack, including OpenAPI, which will allow you to easily embed the system into various educational resources.