From Auto-encoders to Capsule Networks: A Survey

: Convolutional Neural Networks are a very powerful Deep Learning algorithm used in image processing, object classification and segmentation. They are very robust in extracting features from data and largely used in several domains. Nonetheless, they require a large number of training datasets and relations between features get lost in the Max-pooling step, which can lead to a wrong classification. Capsule Networks (CapsNets) were introduced to overcome these limitations by extracting features and their pose using capsules instead of neurons. This technique shows an impressive performance in one-dimensional, two-dimensional and three-dimensional datasets as well as in sparse datasets. In this paper, we present an initial understanding of CapsNets, their concept, structure and learning algorithm. We introduce the progress made by CapsNets from their introduction in 2011 until 2020. We compare different CapsNets series to demonstrate strengths and challenges. Finally, we quote different implementations of Capsule Networks and show their robustness in a variety of domains. This survey provides the state-of-the-art of Capsule Networks and allows other researchers to get a clear view of this new field. Besides, we discuss the open issues and the promising directions of future research, which may lead to a new generation of CapsNets.


INTRODUCTION
Imitating the human brain used to be a dream for scientists until the creation of Artificial Neural Networks (ANNs). ANNs are the artificial version of Biological Neural Networks that constitute our nervous system. Simulating human brain ability in object classification was the goal of Convolutional Neural Networks (CNNs). This type of neural networks shows high performance in object classification and image processing. CNNs extract the most significant features from images and use them for classification. However, CNNs are unable to detect object deformation and relationships among object entities. These limitations may lead to incorrect classification, hence influencing the model performance negatively.
Capsule Networks have been introduced to adjust CNNs and overcome their shortcomings. These networks are a combination of Auto-encoders and capsules. Auto-encoders (AE) are simple neural networks consisting of an encoder, latent space representation and decoder. The encoder compresses the input to latent space representation, then the decoder reconstructs the input based on this representation only. The network is trained by updating weights using backpropagation with a gradient optimizer. This type of network is used for data denoising, dimensionality reduction and as a generative model. They were widely developed to extract more features while keeping the capacity of generalization, by Denoising AE (Vincent et al., 2008), Sparse AE (Lee et al., 2008), Variational AE (Pu et al., 2016) and Transforming AE (Hinton et al., 2011).
The introduction of Capsule Networks was in 2011. They were presented as Transforming AE by (Hinton et al., 2011) who noticed that Convolutional Neural Networks are misguided in what they are trying to achieve. CNNs lose meaningful information like object entities' poses and relationships between features in the Max-pooling layer. Transforming AE proposed capsules instead of neurons to keep the maximum information, e.g. pose and velocity. However, the idea did not work efficiently until the introduction of the Routing by Agreement algorithm in 2017 (Sabour et al., 2017), which outperforms CNNs in some datasets and shows impressive results. This paper highlights the limitations of CNNs and the high performance of CapsNets in diverse implementations. We present a variety of selections of the best performing works in CapsNets from various viewpoints. We compare different CapsNets' models, and we discuss their benefits and challenges. This survey is done after consulting other similar papers. We believe that our review presents the most recent works in this field. It gives a clear view of CapsNets' series and updates, and it explores a possible future scope of research.
This paper is organized as follows: In Section 2, we introduce CNNs and their limitations. Then, we detail Capsule Networks architecture and its progress in Section 3. Furthermore, we present implementations' domains and fields of this Deep Learning (DL) network in Section 4. After that, we describe CapsNets updates in Section 5. The series and shortcomings of Capsule Networks are described in Section 6. Finally, we conclude in Section 7.

CONVOLUTIONAL NEURAL NETWORKS
Convolutional Neural Networks (CNNs) are very powerful in image classification and processing (Q. Zhang et al., 2016). They are considered state-of-theart in computer vision and widely used in object recognition systems (Maturana & Scherer, 2015) and self-driving cars (Jung et al., 2016).

Overview of CNNs
CNNs treat an input image by four kinds of layers: convolutional layers, pooling layers, flattening layers and fully connected layers. Convolutional layers apply multiple kernels to the input and activate the output according to the rectified linear activation function (ReLU) (He et al., 2015) to generate a features map (equation 1). The pooling generates a pooled feature map using Max-pooling (equation 2), which chooses the most important pixels to be passed to the next layer. Therefore, it reduces the dimension of images. These two layers are repeated several times to refine feature extraction. Next, the flattening layer flattens the pooled feature map into a column matrix. This matrix will be passed to a Fully Connected (FC) artificial neural network that consists of an input layer, hidden layers and output layer. Figure 1 shows the CNNs' structure.

CNNs Shortcomings
Convolutional Neural Networks were introduced two decades ago. Through all these years, CNNs were widely developed and adjusted. However, they still have some shortcomings: -Inability to understand data structure (Hosseini et al., 2017): CNNs are not interested in position properties and hierarchical structures i.e. relations between objects' parts. Max-pooling reduces the dimension of images and causes a loss of some useful features.
-Inability to be spatially invariant: CNNs are only invariant to translation, but if the input images have been reversed, rotated or tilted the performance decreases drastically. They are unable to detect deformation, pose and texture of an image (Sabour et al., 2017). -Viewpoint variance: different viewpoints of an object lead to changes in neural activities. Hence, to recognize objects, the network should learn different variations of the images. That requires a lot of training data and a long training time.
-Overfitting: when the camera or the illumination of the image is changed, CNNs cannot perform well (Ahmadvand et al., 2016). -Sensitive to adversarial attacks : CNNs can easily be fooled by adding some carefully constructed noise to the input image.

CAPSULES NETWORK PROGRESS
The idea of Capsule Network was introduced in 2011 to overcome the shortcomings of CNNs regarding robustness. It has been tested on different types of datasets and showed a high performance. The following sub-chapters describe the main milestones in the progress of CapsNets.

Transforming Auto-encoders
Transforming Auto-encoders (TAEs) (Hinton et al., 2011) were the first seed of capsule networks. TAEs are Auto-encoders that apply a transformation matrix to the extracted pose features, so the network can be trained to predict transformations like rotation, scaling and translation. Unlike CNNs that are only invariant to translation, TAEs are equivariant. This property makes them understand proportion change and adjust themselves accordingly to keep the pose features information. Equivariance is achieved in these Auto-encoders by using vectors to represent objects, where each vector contains scalar values that represent the instantiation parameters of the object.
TAEs consist of several capsules, where each capsule is a group of neurons that represent an object or a part of an object in a specific location using inverse rendering. They extract instantiation parameters from the image to draw it again.
A TAEs' capsule is composed of recognition units and generative units. The output of each capsule represents the contribution to reconstruct the output image. Figure 2 details the structure of the TAEs.
Recognition units (blue circles in Figure 2) detect pose parameters represented by matrix A and compute P, the probability that the capsule's feature is present in the image. Then, the capsule will transfer these values to the generative units layer.
Generative units (red circles in Figure 2) are fed with TA, where T is the transformation matrix. These units compute the capsule's contribution to the transformed image and multiply it by the probability P. Finally, all capsules' contributions are combined to reconstruct the output image. However, this architecture could not work properly in 2011, because of computer hardware limitations and the absence of efficient algorithms.

Dynamic Routing Between Capsules
In 2017, (Sabour et al., 2017) succeeded to implement an efficient algorithm to relate capsules, that showed better performance than CNNs on the MNIST dataset. It is called Dynamic Routing Between Capsules or Routing by Agreement between capsules (RBA). This paper (Sabour et al., 2017) was the official definition of CapsNets as a network of capsules. The output of a capsule is called activation or instantiation vector. The length of this vector represents the probability that the feature actually exists. The orientation of the vector encodes the feature's instantiation parameters, i.e. thickness, localization, width and so on. The  CapsNets Encoder consists of three main parts: Convolutional layer, PrimaryCaps layer and ClassCaps layer (also called as DigitCaps) ( Figure 3). The Convolutional layer extracts image features through convolution kernels to construct a feature map. Then, a ReLU function is applied to provide non-linearity and to activate the feature map values. The output feature map is scanned another time by kernels and generates a new feature map. PrimaryCaps group the generated features to vectors to create primary capsules. Finally, the PrimaryCaps are routed to the ClassCaps layer by Dynamic Routing Between Capsules (Algorithm 1). The original CapsNets are used to classify MNIST dataset, so the ClassCaps consists of 10 classes. The contribution of each capsule i in PrimaryCaps to each capsule j in ClassCaps is computed as follows: Where u is the output of capsule i, and ûji is a prediction vector. Wij is a weight matrix.
Each capsule j in ClassCaps computes the total prediction vector sj (equation 4). To ensure that the vector length is between 0 and 1, a squashing function is applied (equation 5), which does not affect the instantiation parameters.
is the coupling coefficient determined by a SoftMax function (equation 6). This coefficient is used by Dynamic Routing to determine the relation between low-level and high-level capsules through repetitive routing. The agreement between capsules is reflected by the product of the prediction vector and a coupling coefficient. If the agreement is high, the low-level capsule and the high-level capsule are related to each other and the coupling coefficient will increase otherwise, it will decrease. Notice that is updated in this step by updating (equation 7), unlike Wij that are updated by backpropagation.
The Decoder part ( Figure 4) reconstructs the input image, it is made up of three FC layers using ReLU and Sigmoid activation functions to generate the output which is reshaped to a grayscale image.
As long as the CapsNets consist of classification and reconstruction part, the total loss TL will be calculated in two halves: (i) The first one punishes incorrect classifications (encoder-part), (ii) and the second punishes reconstruction error D (decoderpart) by mean square loss.
The following equation represents the margin loss of classification: Where max (0, − ‖ ‖) is calculated if an object of class k is present with Ek is set to 1, and λ(1 − ) max (0, ‖ ‖ − ) is calculated for the opposite case with = 0 . = 0.9 and = 0.1 are set to prevent the length from max out or collapse the loss function unreasonably, λ is set to 0.5 to control the down weighting of initial weights from influencing model decisions. This entity loss ( ) is then summed with the reconstruction loss (equation 9) to compute the total loss (equation 10), which is used to evaluate the model performance. D = MSELoss(y,y') (9) y is the input image and y' is the reconstructed image. (10) α is the down-scaling factor (taken as 0.0005) used to prevent the D loss from dominating over the Lk loss.

Matrix Capsules with EM Routing
In (Hinton et al., 2018), another algorithm was proposed for routing between capsules called Expectation Maximization Routing (EMR). Unlike RBA's capsules that use elements' vectors to represent the pose of an object and the vectors' lengths to represent the probability of existence, EMR capsules use pose matrix and activation probability separately. Expectation Maximization is a clustering algorithm that clusters data points into Gaussian distribution, with each cluster defined by (μ: mean, σ: standard deviation). In capsule network, EMR groups child capsules into a parent capsule. The high-level capsule is activated if there is an agreement among votes from low-level capsules. The low-level capsule makes votes on the pose matrices of its potential parent capsule. The vote ( ) is calculated by RBA EMR Algorithm i: capsule in layer L j: capsule in layer L+1 Algorithm 1 Dynamic Routing (Sabour et al., 2017).
Return Vj ΩL capsules of the layer L Algorithm 2 EM Routing (Hinton et al., 2018).
In EMR, the representation of a capsule's input and output are matrices instead of vectors. Moreover, the likeliness of the existence of an entity is presented by the activation probability a instead of a length vector. The probability is computed without using a squashing function, which is considered "not objective and sensible" (Hinton et al., 2018). Table 1 clarifies the difference between RBA algorithm 1 and EMR algorithm 2.

Stacked Capsule Auto-encoders
In 2019, (Kosiorek et al., 2019) introduced an unsupervised capsule Auto-encoder called Stacked Capsule Auto-encoders (SCAEs). This capsule network uses objects to predict parts, in contrast to EM Routing and Routing by Agreement that use a part-whole relationship to predict the presence of the object. The inference routing used in both previous works is inefficient and it is discussed in further research (Li et al., 2018;S. Zhang et al., 2018), while SCAEs amortized this inference.
The SCAEs consist of two stages: i) Part Capsule Auto-encoder (PCAE) predicts presences and poses of part templates directly from the image and tries to reconstruct the image by appropriately arranging the templates, ii) Object Capsule Auto-encoder (OCAE) organizes discovered parts and their poses into a smaller set of objects. These objects reconstruct the part poses using a separate mixture of predictions for each part.
SCAEs are the only method that achieves competitive results in unsupervised object classification without relying on mutual information.

IMPLEMENTATIONS
CapsNets showed their performance in various fields such as medical or chemical image recognition, audio and video processing and many others.
This type of network has the best performance in detecting spoof attacks. (Nguyen et al., 2019) applied capsule network to the forensics task. It is used to detect various kinds of spoofs from replay attacks using printed images or recorded videos to computergenerated videos. Furthermore, the RBA algorithm improves detection performance on complex and almost perfectly forged images and videos. It showed a great performance and had perfect accuracy at frame level and video level dataset.
Capsule networks have also proven their efficiency in the 3D domain. In (Y. Zhao et al., 2019), they are used to treat sparse 3D point clouds. They preserve spatial arrangements of the input data with good learning ability and generalization properties. The model performs well under rotation, partsegmentation and 3D reconstruction and it has a low reconstruction error. (Duarte et al., 2018) introduced a 3D CapsNets for action detection in videos, by introducing capsulepooling with skip connections in the convolutional layer to decrease the routing computation.
In the medical domain (Mobiny & Nguyen, 2018), capsules have also been developed to handle characteristics of 3D lung nodule classification, and speed up CapsNets by a consistent RBA mechanism. The proposed dynamic routing mechanism consists of enforcing all capsules in the PrimaryCaps layer referring to the same pixels to have the same coupling coefficient, which reduces the number of routing coefficients and speeds up the model while keeping the accuracy of the original CapsNets.
1D-CapsNet (Butun et al., 2020) has been introduced for automated detection of coronary artery disease (CAD) from electrocardiography-signal (ECG). Even though the model achieved a high accuracy it needs to overcome the long training time. Furthermore, the model needs a large dataset for training. This issue could be addressed by few-shot learning (Ren et al., 2020).
CapsCarcino (Y.-W. Wang et al., 2020) has been introduced to distinguish between carcinogens and noncarcinogens. This capsule network is very helpful for carcinogen risk assessment in drugs. CapsCarcino is very robust for small-sized sparse datasets: with just 20% of the dataset, it performs comparably to the other methods using the full training dataset.
WB-Caps (Baydilli & Atila, 2020) is a capsule network architecture that classifies white blood cells (WBCs) into five categories. WB-Caps can help to interpret the patient's condition by performing blood tests with little cost, based on some characteristics of WBCs like ratio or shape. The model obtained a high accuracy without over-fitting.
CapsNet-static-routing (Kim et al., 2020) is a CapsNets model used for text classification. It shows a high performance and stable results even after adding random noise to the dataset, the result does not change, and sentences keep their meaning. The experimental results of the classification indicate that the accuracy of the static routing is higher than the dynamic one. Moreover, the model has a shorter training time than the original CapsNets. On the other hand, due to the high variability in text, CapsNetstatic-routing is not robust enough for document classification as opposed to image classification. It needs to be flexible for text modifications, like word order shuffling. (Lei et al., 2020) introduced Attention-Based Capsule Network (ACN) for tag recommendation. The model is based on the CapsNets with RBA plus an attention mechanism. The model is flexible to be applied for image and video tagging, too. Moreover, ACN could be improved by using Expectation Maximization routing, where pose matrix might extract more information and give better tag results.
For intelligent fault diagnosis, Capsule Autoencoder (Ren et al., 2020) (CaAE) has been proposed to resolve the problems of traditional and modern intelligent fault diagnosis: the need for a large set of samples of faults and the need for diagnosis models to possess the ability of quick updating. The ability of CaAE to extract and fuse features reduces the dependence on the number of samples and training time, which makes CaAE suitable for few-shot learning without overfitting. The model is very robust under noisy datasets and it shows higher accuracy, less training time and a smaller number of epochs compared to methods in (J. Wang et al., 2019) and (Jia et al., 2016). (Nguyen et al., 2019) proposed CAPSULE-FORENSICS to improve the algorithm of (Sabour et al., 2017). A Gaussian random noise has been added to the weight tensor to reduce over-fitting, and an additional squash has been applied before routing by iterating to keep the network more stable. (Kim et al., 2020) suggest a static routing method instead of dynamic routing and ELU-gate (Dauphin et al., 2017) instead of pooling. Static routing reduces the computational complexity of dynamic routing. ELU-gate method selects which neurons to be activated without losing spatial information. (Rajasegaran et al., 2019) have gone deep with capsule network (Deepcaps) using the concept of skip connections and 3D convolutions to build a 3D convolution system based on the dynamic routing algorithm. Skip connections within a capsule cell allow good gradient flow in backpropagation, and 3D convolution reduces the number of parameters. The original CapsNets decoder (Sabour et al., 2017) has been replaced by a Deconvolutional decoder, which strengthens the use of reconstruction loss as a regularization term. This decoder is better at reconstructing spatial relationships and at regularizing capsules. (Phong & Ribeiro, 2019) introduced two advanced models (Capsule 32 V1 for images 32*32 pixels and Capsule 32 V2 for images of 64*64 pixels) to improve CapsNets by expanding more pooling layers to filter image backgrounds and more reconstruction layers to allow better image restoration. Both models showed a good performance but they are more sensitive to changes.

CAPSNETS UPDATES & IMPROVEMENTS
To reduce epistemic and homoscedastic uncertainty, (Ramírez et al., 2020) presented a Bayesian formulation of Capsule networks (BCN). They hybridized Deep Bayesian Neural Networks (DBNN) (Zhu & Zabaras, 2018) with Capsule Networks. The model attained good results with less uncertainty and less error due to performing dropout and including the homoscedastic uncertainty in the loss function and using a regularization term over the linear transformations in the inverse graphics.
As it has been introduced in the RBA algorithm, the SoftMax activation function is used to compute the coupling coefficient . (Z. Zhao et al., 2019a) demonstrated that SoftMax prevents CapsNets to find the optimal coupling to route between low-level and high-level capsules. After multiple routing iterations, it often leads to uniform probabilities. For that, SoftMax has been replaced by the Max-Min normalization. This normalization reduces the test error to 0.17% on MNIST and allows to increase the number of routing iterations without overfitting.
To reduce CapsNets parameters (Yi et al., 2019) designed the CapsNetPr network that uses a pooling method, decomposition and sharing of the transformation matrix to address this issue. As a result, the CapsNets parameters have been reduced significantly across different datasets while keeping the performance of CapsNets.

CAPSULE NETWORKS SERIES, ADVANTAGES AND SHORTCOMINGS
Capsule Networks are used for treating various kinds of data such as images, text, videos. The variety of data requires some modifications to the original network structure. Table 2 summarizes the CapsNets series. The majority of CapsNets research papers worked on the RBA algorithm, either in the original implementation or in improvement, while EMR and SCAE did not get the same attention from researchers. Just like the use of CapsNets in the 3D domain, only a few works have been focused on this field (Y. Zhao et al., 2019), (Duarte et al., 2018), (Weiler et al., 2018), (Jiménez-Sánchez et al., 2018;Mobiny & Nguyen, 2018).

Advantages
CapsNets are a very promising Deep Learning model, which possesses the capacity of learning relationships among image objects. This architecture has so many positive aspects: -Viewpoint invariance (Hinton et al., 2011).
-The dynamic routing algorithm extracts more meaningful features compared to CNNs (Sabour et al., 2017). These characteristics make CapsNets more powerful compared to other DL approaches in terms of generalization capability, accuracy, required training time and robustness to viewpoint changes.

Shortcomings
From RBA to Stacked Capsule Auto-encoder, CapsNets have shown good performance in different domains like in image classification, signal treatment, pose extraction, text classification and many other tasks. They are applicable to various kinds of datasets by adapting the architecture or the learning algorithm to the specificity of the data. Nevertheless, Capsule Networks suffer some drawbacks. Routing by agreement is not optimal for document classification, unlike for image classification, due to the high variability in a text (Kim et al., 2020).
Although the CapsNets showed an impressive result in the MNIST dataset and did well on SVHM, they still perform poorly on CIFAR10, even when going deep in the Capsule network by DeepCaps (Rajasegaran et al., 2019), achieving an error of 8.99%, which is higher than the error rate of the current state-of-the-art 3.47%. The high error rate can be explained with the complexity of the background and the intra-class variation of CIFAR10.
A downside of the treated network is the high number of parameters to be trained (School of Computing, Northwestern Polytechnical University, Xi'an 710072, Shaanxi, P.R. China et al., 2019). With a small input image of 28x28, the original CapsNets architecture needs 8,2 M training parameters. More than half of these parameters come from the PrimaryCaps layer that executes reshaping and dynamic routing operations. The larger the images to be processed become, the greater becomes the number of parameters to be trained. Deepcaps (Rajasegaran et al., 2019) managed to reduce the number of parameters by 68%, while (Xiong et al., 2019;Yi et al., 2019) used a pooling method which loses meaningful information.
The learning process of RBA is slow due to the routing process that requires a loop to refine the coupling coefficient (Z. Zhao et al., 2019b). Moreover, CapsNets require more computational resources since the outputs of primary capsules are activity vectors rather than scalars, which require more memory.

CONCLUSION AND DIRECTIONS FOR FUTURE WORK
In this paper, Capsule networks have been introduced with their main progress steps: Transforming Autoencoders, Routing by Agreement Between Capsules, Matrix capsules with EM routing and Stacked Capsule Auto-encoders. The advantages of grouping extracted features into capsules to keep all input information have been explained as well as learning algorithms, architecture and CapsNets series. Capsule networks guarantee equivariant properties which make the network robust when undergoing a transformation. Furthermore, CapsNets achieved a very promising result with a small training dataset and without overfitting. However, they need to be improved to perform well with multi-class data and complex data such as CIFAR10. This Deep Learning networks need more experiments, searches and tests to explore their maximum capacity. Besides, more attention for the EM Routing and SCAE are necessary to make them more powerful and applicable in different datasets and to realize the full potential of CapsNets. New insights could be provided from going deep with EM routing and Stacked Capsule Auto-encoders as advanced CapsNets, also from working on reducing the complexity of these models and combining Capsule networks with other Deep Learning methods. Furthermore, self-driving cars can take advantage of the CapsNets' accuracy and robustness against transformations made on inputs to trick the network.