Urban areas analysis using satellite image segmentation and deep neural network

The goal of our research was to develop methods based on convolutional neural networks for automatically extracting the locations of buildings from high-resolution aerial images. To analyze the quality of developed deep learning algorithms, there was used Sorensen-Dice coefficient of similarity which compares results of algorithms with real masks. These masks were generated automatically from json files and sliced on smaller parts together with respective aerial photos before the training of developed convolutional neural networks. This approach allows us to cope with the problem of segmentation for high-resolution satellite images. All in all we show how deep neural networks implemented and launched on modern GPUs of high-performance supercomputer NVIDIA DGX-1 can be used to efficiently learn and detect needed objects. The problem of building detection on satellite images can be put into practice for urban planning, building control of some municipal objects, search of the best locations for future outlets etc.


Introduction
Nowadays, the problem of object detection on satellite images is in the focus of researchers. Automatic image segmentation allows to extract areas of interest such as vehicles or buildings. Most approaches of solving this problem suggest the development of deep learning algorithms. In machine learning the problem of image segmentation is usually reformulated as a classification of each pixel. The simplest and the slowest way of solving this problem is a manual segmentation of images by experts. However, it is a time-consuming process, which is subjected to human errors. Automatic image segmentation makes possible to process images immediately after receiving it. Satellite image segmentation finds its application in urban planning, building control forest management and meteorology.
This article presents developed convolutional neural networks (CNNs). The main advantage of CNNs is that they can detect and classify objects in real time while being computationally less expensive and superior in performance when compared with other machine learning methods. Essentially, the mathematical structure of CNNs is parallel and perfectly fits the architecture of graphics processing units (GPUs) which consists of thousands of cores to perform several tasks simultaneously [1]. CNNs have become ubiquitous in computer vision since AlexNet [2] won the ImageNet Challenge: ILSVRC 2012 [3].
The problem of satellite images segmentation is challenging. The most approaches of solving this problem suppose the usage of deep learning algorithms. The features in these networks are formed automatically in the process of training. In recent years there were developed some CNNs, which aim at satellite image segmentation.
One of the most successful algorithms is based on pyramid architecture of CNNs -Feature Pyramid neural networks (FPNs). FPNs showed acceptable results of object detection on satellite images [4]. The usage of FPNs allows to get the value of Jaccard index is approximately equal to 0.49 for satellite images from DeepGlobe [5].
In paper [6] there is presented U-Net architecture -a specific type of FPN, which had received a lot of interest for segmentation of biomedical images. Later this model was applied to pixel-wise classification of satellite images [7]. U-Net uses skip-connections to combine low-level and higher-level maps of features.
In paper [8] authors present LinkNet. This is a special architecture of CNNs, which consist of an encoder and a decoder as U-Net. It efficiently share the information learnt by the encoder with the decoder after each downsampling block. This technique of feature transfer allows to get high accuracy values for object detection on the CamVid dataset [9].
This article consists of six parts. The first part is devoted to CNNs as an approach in machine learning and peculiarities of satellite images segmentation. It also contains an overview of some papers for object detection on aerial photos. The second part is devoted to the available databases of satellite images. The third section describes the developed architectures of CNNs for building detection on aerial photos and some peculiarities of training of models. Furthermore, in this part there is mentioned about Keras framework and Tensorflow library. The forth part presents the results of numerical experiments for the developed models. In the conclusion there is summarized the research. And finally, the last section represents references.

Databases of satellite images
A standard database of images is the important part for learning, efficiency evaluation and comparative analysis of different machine learning algorithms. Nowadays, there are some available datasets of satellite images. The SuperView-1 database [10] contains high-resolution panchromatic and multichannel color images from the satellite of the same name, which was launched on January 9, 2018 from Taiyuan Satellite Launch Center in China. The SuperView-1 satellite offers images with 0.5 m panchromatic resolution and 2 m multispectral resolution. The spacecraft is very agile providing multiple collection modes including long strip, multiple strips collect, multiple point targets collect and stereo imaging. Aerial photos from SuperView-1 satellite cover the area of which cover a total area of 900'000 km². Examples of images from the Jilin-1 dataset are shown in Fig. 1.
The GeoEye-1 database [11] includes aerial panchromatic and four-channel (blue, green, red, near-IR) images. GeoEye-1 was successfully launched on September 6, 2008 from Vanderberg Air Force Base in the USA. GeoEye-1 is capable of acquiring image data at 0.46 m panchromatic and 1.84 m multispectral resolution. High-resolution aerial photos from GeoEye-1 satellite apply for various mapping applications such as environmental monitoring, mining, engineering, archaeology and agriculture. Examples of images from the GeoEye-1 dataset are shown in Fig. 2.

Deep Learning Algorithm
This article presents developed CNNs. This type of neural networks has a special architecture, aimed at quick and high-quality segmentation [12]. In this research there is made a comparative analysis of U-Net [7] и LinkNet [8]. The research of work of developed models continues the research, which was provided in papers [13,14]. All created networks were developed using Keras library with Tensorflow framework as a backend. Keras is an open source library written in Python. It is built on Tensorflow framework and contains numerous implementations of commonly used neural network building blocks and ready tools to preprocess images and text data. Keras offers a higher-level, more intuitive set of abstractions to develop deep learning models [15]. Moreover, this library allows to train networks on GPU.
TensorFlow is an open-source software library for high performance numerical computation. It is a symbolic math library, and is also used for machine learning applications such as neural networks. In this field Tensorflow aims at fast detection and classification of images, achieving the quality of human perception [16].

Fig. 4. U-Net architecture
As shown in Fig. 4, U-Net network consists of two parts: an encoder (left) and a decoder (right). The encoder is a neural network which has a typical CNN architecture including four blocks. Each of these blocks consists of two convolutional layers with 3 × 3 filter, two ReLU activation functions and a maxpooling operation with 2 × 2 filter and step 2. The decoder contains the same number of blocks. Each decoder block consists of an upsampling operation with 2 × 2 filter merging with an appropriate map of features from the encoder, two convolutional layers with 3 × 3 filter and one ReLU activation function. The last layer of the network is convolutional with 1 × 1 filter, which classify every pixel. As a result, the network has 19 convolutional layers, 18 ReLU activation functions, 4 maxpooling operations, 4 upsampling operations, and 4 merge operations.
As in the case of U-Net architecture LinkNet has the encoder and the decoder too. According to the network architecture, which is shown in Fig. 5, both subnets consist of 4 blocks. Each encoder block contains four convolutional layers, two merging operations and one maxpooling operation with 2 × 2 filter and step 2. Each decoder block has a similar architecture, except merging operations and a maxpooling operation, which was replaced with an upsampling operation with 2 × 2 filter. Before sending the map of features from an input data to the first encoder, there are implemented two operations of batch normalization, one ReLU activation function, one convolution with 2 × 2 filter and one maxpooling operation with 2 × 2 filter. After the last decoder there is performed a sequence of one upsampling operation with 2 × 2 filter, two operations of batch normalization, one ReLU activation function and one convolution with 2 × 2 filter.

Fig. 5. LinkNet architecture
As the algorithm of numerical optimization Adam optimizer with a learning rate of 1e-3 was chosen. This optimizer combines best approaches from gradient descent and momentum optimizers and shows optimal and fast convergence for most tasks of machine learning [17]. Binary cross-entropy was chosen as the loss function. The batch consists of 18 samples. The training finishes after completing 96 epochs.

Numerical Results
Numerical experiments for developed algorithms were performed on satellite images of the private database. Information about the location of buildings was extracted from json files and generated as black-and-white masks, where a pixel is colored to white if it belongs to buildings. The traditional approach for segmentation concerns the usage of parts of satellite images, which are fed to the input of CNN. Developed models require images of 512 × 512 pixels, so before the launch of deep learning algorithms each image and mask of dataset have been cut on parts of appropriate size by data windowing.
The prepared dataset contained 3264 images of 512 × 512 size. For numerical experiments it was shuffled and splitted on training and test set in ratio 80/20. In our research, only 2 classes were taken into account: "buildings" and "not-buildings".
Developed CNNs were launched on NVIDIA DGX-1 supercomputer, which was provided by Artificial Intelligence Center of P.G Demidov Yaroslavl State University.
As a rule, the quality of algorithms for image segmentation is evaluated by special coefficients for comparing the similarity of predicted and true masks. To estimate developed models there was used Sorensen-Dice coefficient (DSC). This index is binary measure of similarity, possesses the value from [0, 1] and can be calculated by the following formula: where is a power of intersection and is a sum of powers for real mask and predictions . In our task, numerator and denominator can be calculated by following formulae , , where are values of pixels from for expert markup and predicted masks respectively. Dependencies of DSC values from the number of epochs (E) for developed algorithms on the test subset are shown in Fig. 6. According to the results presented in Fig. 6, the worst results of segmentation were shown by LinkNet (0,72), while the best results were obtained using U-Net (0,77). Some test results of U-Net are shown in Fig. 7.

Conclusion
Numerical experiments of comparative analysis of developed algorithms were performed for the aerial photos of the private database. For modeling of numerical experiments there were extracted smaller images and masks, which were generated automatically from json files. Using the special metrics of similarity between expert markup and predicted masks there was shown that U-Net got better results compared with LinkNet. For U-Net the value of Sorensen-Dice coefficient (DSC) is equal to 0.77. All created networks are simple for its implementation.
This article was prepared with the financial support of the Ministry of science and education of the Russian Federation under the agreement No. 075-15-2019-249 from 04.06.2019 (identifier works RFMEFI57517X0167). The authors are also grateful to the AI-center of P.G. Demidov Yaroslavl State University for providing an access to NVIDIA DGX-1 supercomputer.