Searching For Efficient Multi Scale Architectures For Dense Image Prediction
[Need add more pics and references]
Contents
Introduction
The design of neural network architectures is an important component for the success of machine learning and data science projects. In recent years, the field of Neural Architecture Search(NAS) has emerged, which is the study of automatically finding an optimal neural architecture for a given task in a welldefined architecture space. Often, the resulting architecture has outperformed human experts designed network in many tasks such as image classification and natural language processing.[2,3,4,5] This paper presents a method on finding a neural architecture that performs well in the task of Dense image segmentation.
Motivation
Deep Neural network's success is largely due to the fact that it greatly reduces the work in Feature Engineering, as DNN has the ability to automatically extract useful features given the raw input. However, it created a new type of engineering work  network engineering. In order to successfully extract features, you need to have the corresponding network architecture. So what really happened is the engineering work is shifted from feature engineering to how to design the network so that it can better abstract useful features.
The motivation for NAS is that since there is no guiding theory on how to design the optimal network archtichture, given that we have abundant computational resources, one intuitive solution is to define a finite search space and let the computers do the dirty work of searching for structures and hyperparameters.
NAS Overview
NAS essentially turns a design problem into a search problem. As a search problem in general, we need a clear definition of three things:
 Search space
 Search strategy
 Performance Estimation Strategy
The search space is very intuitive to understand. In what hyperparameter space we should look for our optimal solution. In the field of NAS, the search space is heavily dependent on the assumption we make on the neural architecture. The search
strategy details how to look explore the search space. The evaluation strategy is when we find a set of hyperparameters, how should we evaluate our model. In the field of NAS, it is typically to find architectures that achieve high predictive performance on unseen data. [6]
We will take a deep dive into the above three dimensions of NAS in the following sections
Search Space
There are typically three ways of defining the search space.
1) Chainstructured neural networks
[pic] [6] The chain structed network can be viewd as sequence of n layers, where the layer [math] i[/math] recives input from [math] i1[/math] layer and the output serves the input to layer [math] i+1[/math].
The search space is then parametrized by: 1) Number of layers n 2) Type of operations can be executed on each layer 3) Hyperparameters associated with each layer
2) Multibranch networks
[pic] [6] This architecture allows significantly more degrees of freedom. It allows shortcuts and parallel branches. Some of the ideas are inspired by ResNet [7] https://arxiv.org/pdf/1512.03385.pdf
The search space includes the search space of chainstructured networks, with added additional freedom of adding shortcut connections and allowing parallel branches to exist.
3) Cell/Block
[pic] [6] This architecture defines a cell which is used as the building block of the neural network. The search space includes the internal structure of the cell and how to combine these blocks to form the resulting architecture.
4) What they used in this paper
[pic] [1] This paper's approach is very close to the number 3 above
The paper defines two components: The "network backbone" and a cell unit called "DPC". The network backbone's job is to take input image as a tensor and return a feature map f that is supposedly good abstraction of the image. The DPC is what they introduced in this paper, short for Dense Prediction Cell. The search space consists of what they choose for the network backbone and the internal structure of the DPC.
For the network backbone, they simply choose from existing mature architecture. They used networks like MobileNetv2, InceptionNet, and e.t.c. For the structure of DPC, they define a smaller unit of called branch. A branch is a triple of (Xi, OP, Yi), where Xi is an input tensor, and OP is the operation that can be done on the tensor, and Yi is the resulting after the Operation.
In the paper, they set each DPC consists of 5 cells for the balance expressivity and computational tractability.
The operator space, OP, is defined as the following set of functions:
 Convolution with a 1 × 1 kernel.
 3×3 atrous separable convolution with rate rh×rw, where rh and rw ∈ {1, 3, 6, 9, . . . , 21}.
 Average spatial pyramid pooling with grid size gh × gw, where gh and gw ∈ {1, 2, 4, 8}.
 Use a smaller backbone for proxy task
 caching the feature maps produced by the network backbone on the training set and directly building a single DPC on top of it
 Early stopping train for 30k iterations with a batch size of 8
 Classifier to predict whether the user is indoors or outdoors
 Classifier to identify if the activity of the user, i.e. walking, standing still etc.
 Classifier to measure the displacement

The operation spae has 1 + 8×8 + 4×4 = 81 functions in the operator space, resulting in i × 81 possible options. Therefore, for B = 5,
the search space size is B! × 81^B ≈ 4.2 × 10^11 configurations.
Search Strategy
There are some common search strategies used in the field of NAS, such as Reinforcement learning, Random search, Evolution algorithm. The one they used in the paper is Random Search. It basically samples points from the search space uniformly at random as well as sampling some points that is close to the current observed best point. They quoted from another paper that claims random search performs the random search is competitive with reinforcement learning and other learning techniques [8].J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015. In the implementation wise, they used a Google vizier, which is a search tool for black box optimization. [D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley. Google vizier: A service for blackbox optimization. In SIGKDD, 2017.] It is not open source, but there is an open source implementation of it https://github.com/tobegit3hub/advisor.
Performance Evaluation Strategy
The evaluation in this particular task is very tricky. The reason is we are evaluating neural network here. In order to evaluate it, we need to train it first. And we are doing pixel level classification on images with high resolutions, so the naive approach would require a tremendous amount of computational resources.
The way they solve it in the paper is defining a proxy task. The proxy task is a task that requires sufficient less computational resources, while can still give a good estimate of the performance of the network. In most image classical tasks of NAS, the proxy task is to train the network on images of lower resolution. The assumption is, if the network performs well on images with lower density, it should reasonably perform well on images with higher resolution.
However, the above approach does not work on this case. The reason is that the dense prediction tasks innately require highresolution images as training data. The approach used in the paper is the flowing:
If training on the largescale backbone without fixing the weights of the backbone, they would need one week to train a network on a P100 GPU, but now they cut down the proxy task to be run 90 min. Then they rank the selected architectures, choosing the top 50 and do a full evaluation on it.
The evaluation metric they used is called mIOU, which is pixel level intersection over union. Which just the area of the intersection of the ground truth and the prediction over the area of the union of the ground truth and the prediction.
Result
This method achieves state of art performances in many datasets. The following table quantifies the gain on performance on many datasets.
[pic] The chose to train on modified Xception network as a backbone, and the following are the resulting architector for the DPC.
[pic]
Related Work
In general, previous work falls under two categories. The first category of methods is classification methods based on the user's activity. Therefore, some current methods leverage the user's activity to predict which is based on the offset in their movement [2]. These activities include running, walking, and moving through the elevator. The second set of methods focus more on the use of a barometer which measures the atmospheric pressure. As a result, utilizing a barometer can provide the changes in altitude.
Avinash Parnandi and his coauthors used multiple classifiers in the predicting the floor level [2]. The steps in their algorithmic process are:
One of the downsides of this work is that in order to achieve high accuracy the user's step size is needed, therefore heavily relying on pretraining to the specific user. In a real world application of this method this would not be practical.
Song and his colleagues model the way or cause of ascent. That is, was the ascent a result of taking the elevator, stairs or escalator [3]. Then by using infrastructure support of the buildings and as well as additional tuning they are able to predict floor level.
This method also suffers from relying on data specific to the building.
Overall, these methods suffer from relying on pretraining to a specific user, needing additional infrastructure support, or data specific to the building. The method proposed in this paper aims to predict floor level without these constraints.
= Future work and realworld applications The author suggests that when increasing the number of branches in the DPC, there might be a further gain on the performance on the image segmentation task. However, although the random search in an exponentially growing space may become more challenging. There may need more intelligent search strategy.
There are some realworld applications that already deploy NAS techniques in production. The search technique described in this paper may be deployed in production if the cost can be driven down.
Two good examples are Google AutoML and
Microsoft Custom Vision AI.
[9, 10] https://cloud.google.com/automl/ https://azure.microsoft.com/enus/services/cognitiveservices/customvisionservice/
Critique
1. Rich man's game
The technique described in the paper can only be applied by parties with abundant computational resources, like Google, Facebook, Microsoft, and e.t.c. For small research groups and companies, this method is not that useful due to the lack of the computational power one process. Future improvement will be needed on the design an even more efficient proxy task that can tell whether a network will perform well that requires fewer computations. But here is the irony, if we can tell whether a network will perform well or not without training it, we would not need a search technique in the first place. So everything comes back to the fact that there is no guiding theory on deep learning.
2. Benefit/Cost ratio
The technique here does outperform human designed network in many cases, but the gain is not huge. In Cityscapes dataset, the performance gain is 0.7%, wherein PASCALPersonPart dataset, the gain is 3.7%, and the PASCAL VOC 2012 dataset, it does not out performe human experts. (All measured by mIOU) Even though the push of the stateoftheart is always something that worth celebrating, but in practice, one would argue after spending so many resources doing the search, the computer should achieve superhuman performance level. (Like Chess Engine vs Chess Grand Master). In practice, one may simply go with the current stateoftheart model to avoid the expensive search cost.
3.
=References=