CNN Based Monocular Depth Estimation

. In several applications, such as scene interpretation and reconstruction, precise depth measurement from images is a significant challenge. Current depth estimate techniques frequently provide fuzzy, low-resolution estimates. With the use of transfer learning, this research executes a convolutional neural network for generating a high-resolution depth map from a single RGB image. With a typical encoder-decoder architecture, when initializing the encoder, we use features extracted from high-performing pre-trained networks, as well as augmentation and training procedures that lead to more accurate outcomes. We demonstrate how, even with a very basic decoder, our approach can provide complete high-resolution depth maps. A wide number of deep learning approaches have recently been presented, and they have showed significant promise in dealing with the classical ill-posed issue. The studies are carried out using KITTI and NYU Depth v2, two widely utilized public datasets. We also examine the errors created by various models in order to expose the shortcomings of present approaches which accomplishes viable performance on KITTI besides NYU Depth v2.


Introduction
Commercial depth sensors, such as different LiDAR systems including Kinects, are often used to acquire depth information. Besides the expensive cost and high level of operating expertise required, they also have the drawbacks of low resolution and limited perception distance, which restrict their application scope. Because RGB images are widely available, interpreting a depth map from monocular images is gaining popularity. However, because it is intrinsically scale ambiguous, an endless number of potential depth maps may be connected with one image, as it is a difficult ill-posed issue. Monocular image depth cues have been used as clues during depth estimate. Despite the fact that these strategies have been made known to be excellent in experiments, they are not frequently used because depth cues are not often present in natural images. Several researchers are trying to solve monocular depth prediction through implementing deep learning approaches, motivated by the remarkable performance of deep learning on image categorization, object identification, including semantic segmentation. Early techniques depended solely on hand-crafted features including probabilistic models, whereas current approaches have largely relied on convolutional neural networks (CNNs) because of their superior performance. Many CNN based approaches are proposed by various authors to evaluate the depth of a monocular image. Saxena et al. [1] initially projected a Markov Random Field to envisage depth from a monocular horizontallyassociated image, which then far ahead progressed into the Make-3D scheme. When the settings are out of control this method fails to work, particularly when the horizontal arrangement state does not embrace. Eigen et al. [ 2 ] are the first to use CNN to guess the depth of one image. They trained a multi scale convolutional neural network (CNN) to measure depth using Deep Learning techniques for this type of challenge. Xiaobai Ma et al. [3] proposed that CNN based depth estimation methods began to dominate recently. Since the task is closely related to semantic labeling, most of the works are built upon ImageNet Large Scale Visual Recognition Challenge (ILSVRC), often initializing their networks with AlexNet or the deeper VGG. The authors addressed the task by employing two deep network stacks. The first network is used to predict the depth of the whole image, and the second network defines this prediction locally. This idea is later extended, where three stacks of CNN are used to additionally predict surface normals and labels together with depth. Zhao Chen et al. [4] suggested a method for overcoming the limits of monocular depth estimation by giving a depth model a sparse amount of recorded depth and an RGB image to predict the entire depth map. The resultant model developed can provide comparable performance to a modern depth sensor, despite only observing a small fraction of the depth map. The design of smaller and more energy-efficient depth sensor hardware is preferred. Because of this challenge, many computer vision systems use RGB images obtained from video stream for camera pose estimation. Subsequently that, Liu et al. [5 ] presented a number of alternative CNN-based techniques. Roy et al. [6] have suggested a Neural Regression Forestbased depth estimation technique. The approaches outlined above, on the other hand, are particular to the context in which they were trained and so are not domain independent. Monocular depth estimation relies on image processing techniques [7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25] and is helpful during communication [26][27][28][29][30]. The rest of the work is prearranged as follows. Section 2 illustrates the proposed methodology associated with the monocular depth estimation. Simulation results and analysis are provided in section 3 with different performance metrics for two publicly available datasets. Conclusion is given in section 4.

Proposed Methodology for Monocular Depth Estimation
We take a set of images from dataset of size 640x480 and resize them to 256x256. When these input images are passed through the encoder, it performs down sampling along with a few operations to get an image of size 8x8. Encoder output is then passed to the decoder. Now decoder up samples these images along with few operations to obtain the final predicted depth map images of size 256x256. In this project ResNet-50 is used as an encoder and bilinear-up sampling method as the decoder. Experiments were also conducted with EfficientNet and DenseNet architectures as the encoders, and the results are compared with ResNet. The following flowchart demonstrates the process in detail.

Fig 1. Flow Chart
• The database is the first block of encoder. It's where the input data is saved. For this project we used a database that comprises of 5000 images under the training set and 500 images under the testing set. • Image Normalization is performed to convert input images into a range of pixel values. • Data-Augmentation is performed on the dataset to increase diversity. Auto Augmentation is the technique that is used in the current project. Auto Augmentation is pre-learned with sub-policies on ImageNet. • Later we resize the image to a double vector matrix.
These operations are performed in the second block. • The third block contains the neural network model.
ResNet-50 or EfficientNet-B0 are the models used in the project. • EfficientNet-B0 or ResNet-50 models are defined using TensorFlow and Keras. The transfer learning approach is used in this project with pretraining performed on ImageNet.
• These are the blocks in encoder which down sample the input images and produce an image of size 8x8.
• Later, in decoder a custom Up sampling block is defined, this block consists of Bilinear up sampling layer along with convolutional and activation layers. The convolutional layers consist of skip connections from the encoder. These connections provide feature-rich correspondences.
• Here the images are up sampled and we get the depth map images.
• The model is trained in Google Colab with GPU support for 25 epochs. Each epoch takes around 5-6 minutes.
• Finally, Depth maps are predicted and the output is obtained.
• For performance and Mean Actual Error (MAE) evaluation, we used Structural Similarity Index Measurement, Jaccard, and F1-score metrics. For validation loss measurement, the test dataset is taken as the validation dataset.
• These are the algorithmic steps followed to obtain the required results.

KITTI Dataset
The KITTI dataset was introduced in IJRR in 2013. This dataset comprises outdoor scenes captured by cameras besides depth sensors in a driving car, with a dimension of 375 × 1242 image. This dataset has two versions encompassing RGB stereo sets and equivalent ground truth depth maps. The KITTI dataset contains genuine photos from the "city," "residential," and "road" categories. The 56 image scenes in the KITTI dataset are separated into 2 portions: 28 for training and 28 for testing. Stereo image pairs having a resolution of 1224 x368 are used to create each scene. We resize the Images to 256x256 during training. This dataset is moreover alienated into RD: KITTI Raw Depth; CD: KITTI Continuous Depth; SD: KITTI Semi-Dense Depth; ES: Eigen Split; ID: KITTI Improved Depth. It is customary to use the Eigen split instead of using the complete KITTI dataset. This split contains 23,488 photos for training and 697 images for testing from 32 scenes.
A revolving LIDAR sensor samples the corresponding depth of every RGB image in a sparse manner. We only evaluate the bottom half of the picture space since the lidar cannot give depth information for the upper half of the image space. The dataset is also commonly used to assess deep learning-based visual odometry (VO) algorithms because it includes the ground truth of posture for 11 odometry sequences. This dataset is frequently used in computer vision subtasks such as optical flow, visual odometry, depth, object detection, semantic segmentation, and tracking, among others. As this Dataset is more complicated for the use in supervised and semisupervised algorithms, we have used NYU-Dataset.

NYU-Depth V2
Different from the KITTI dataset, The NYU-Depth V2 data set is composed of monocular video sequences from a variety of indoor scenes as taken by both the RGB and Depth cameras from the Microsoft Kinect. Nathan Silberman and Fergus introduced the NYU Depth dataset at ECCV 2012. This dataset consists of 1449 RGB images compactly labelled with depth images. The datasets comprise of 407K frames of 464 scenes captured from 3 different cities. This dataset focuses on indoor situations, and there are 464 indoor image scenes in this dataset. There are 249 training sequences and 215 testing sessions in the official split.

Fig 3 NYU Depth v2 dataset
It is the common benchmark and the main training dataset in the supervised monocular depth estimation. We randomly selected frames from each training sequence, yielding about 12k distinct images. Subsequent to offline augmentations, our dataset includes around 59k RGB-D image pairs. The original image resolution is 480 × 640. We down sampled the images to 256× 256 as our network input. Because of the varied frame rates of the RGB and depth cameras, there is no one-to-one correlation amongst depth maps and RGB images. To line up the depth maps and RGB images, each depth map is first paired with the RGB image that is nearest to it. The camera projections are then utilized to bring into line depth and RGB pairings using the geometrical connection supplied by the dataset. Because the projection is discrete, not all pixels have a depth value, hence the pixels with no depth value are masked off during the trials. We conducted studies using these down sampled photos as input images.

Evaluation metrics
Evaluation metrics comprise error besides accuracy metrics which are proposed in five evaluation indicators such as absolute relative error (ARE), root mean square error (RMSE), logarithm root mean square error (RMSE log) and square relative error (SRE). These parameters are defined as follows:

SRE= (4)
Accuracy with threshold (δ < th): % of such that max ( < th; where th = 1.25, , . Where and are the ground truth and foreseen depth at pixel i, and k is the entire number of pixels with real-depth values, and th signifies the threshold. These evaluation metrices were used to measure the quality of our model after the learning process. Table 1 depicts the model parameters and their respective values during training the proposed framework. Table 2 provides the Mean Absolute error (MAE) for different epochs with 400 homogeneous input images only, while Table 3 provides the Mean Absolute error (MAE) for different epochs with 4600 heterogeneous input images . The average error metrics over the entire test dataset for architectures is provided in Table 4. Thus from these tables it is found that EfficientNet architecture has minimum loss values and is more accurate compared to ResNet architecture.    Fig.6 shows the Training and validation loss for EfficientNet-B0 for 400 homogeneous input images, whereas Fig.7 depicts the Training and validation loss for EfficientNet-B0 for 4600 heterogeneous input images. It is found that EfficientNet architecture has minimum loss values and is more accurate compared to ResNet architecture.

Results Analysis
During execution, we used 400 homogeneous images as input and also 4600 heterogeneous images as input for the encoder and compared with both the ResNet encoder and EfficientNet encoder to know which is the best. Fig. 8 shows the original images and their corresponding ground  Fig. 10 shows the original images and their corresponding ground truth and depth maps results obtained with the EfficientNet model for 4600 images while Fig 11 shows the same for ResNet model. On the basis of training and validation losses and MAE, we compared the outcomes and observed that though the input is 400 homogenous or 4600 heterogeneous images, the EfficientNet design produces more accurate results.

Conclusion
Monocular depth estimation plays a critical role in scene understanding. From training approaches to task categories, this work introduces relevant deep learning frameworks. In this work monocular depth estimation utilizing encoder and decoder architecture is proposed. The encoder, EfficientNet-B0 or ResNet-50, is a lowweight network, and Bilinear Interpolation is a decoder network. This decoder helps in the enhancement of model parameters. Since resolution scaling is one of the essential features of EfficientNet-B0, the results confirm that EfficientNet-B0 would be more competent for high resolution images. When tested on 5000 images, we found that EfficientNet-B0 yields reduced validation loss in contrast to other frameworks due to varying scaling. When the input resolution is higher, EfficientNet has a lower rate of loss than ResNet50.