Difference between revisions of "Adacompress: Adaptive compression for online computer vision services"

Presented by

Ahmed Hussein Salamah

Introduction

Big data and deep learning has been merged to create the great success of artificial intelligence which increases the burden on the network's speed, computational complexity, and storage in many applications. The image Classification task is one of the most important computer vision tasks which has shown a high dependency on Deep Neural Networks to improve their performance in many applications. Recently, they tend to use different image classification models on the cloud just to share the computational power between the different users as mentioned in this paper (e.g., SenseTime, Baidu Vision and Google Vision, etc.). Most of the researchers in the literature work to improve the structure and increase the depth of DNNs to achieve better performance from the point of how the features are represented and crafted using Conventional Neural Networks (CNNs). As the most well-known image classification datasets (e.g. ImageNet) are compressed using JPEG as this compression technique is optimized for Human Visual System (HVS) but not the machines (i.e. DNNs), so to be aligned with HVS the authors have to reconfigure the JPEG while maintaining the same classification accuracy.

Methodology

One of the major parameters that can be changed in the JPEG pipeline is the quantization table, which is the main source of artifacts added in the image to make it lossless compression. The authors got motivated to change the JPEG configuration to optimize the uploading rate of different cloud computer vision without considering pre-knowledge of the original model and dataset. The authors used Deep Reinforcement learning (DRL) in an online manner to choose the quantization level to upload an image to the cloud for the computer vision model and this is the only approach to design an adaptive JPEG based on RL mechanism.

The approach is designed based on an interactive training environment which represents any computer vision cloud services, then they needed a tool to evaluate and predict the performance of quantization level on an uploaded image, so they used a deep Q neural network agent. They feed the agent with a reward function which considers two optimization parameters, accuracy and image size. It works as iterative behavior interacting with the environment. The environment is exposed to different images with different virtual redundant information that needs an adaptive solution for each image to select the suitable compression level for the model. Thus, they designed an explore-exploit mechanism to train the agent on different scenery which is designed in deep Q agent as an inference-estimate-retain mechanism to control to restart the training procedure for each image. The authors verify their approach by providing some analysis and insight using Grad-Cam by showing some patterns of each image with its own corresponding quality factor. Each image shows a different response from a deep model to show that images are more sensitive to large smooth areas, while is more robust compression for images with complex textures.

Problem Formulation

The authors formulate the problem by referring to the cloud deep learning service as $\vec{y}_i = M(x_i)$ to predict results list $\vec{y}_i$ for an input image $x_i$, and for reference input $x \in X_{\rm ref}$ the output is $\vec{y}_{\rm ref} = M(x_{\rm ref})$. It is referred $\vec{y}_{\rm ref}$ as the ground truth label and also $\vec{y}_c = M(x_c)$ for compressed image $x_{c}$ with quality factor $c$.

\begin{align} \tag{1} \label{eq:accuracy} \mathcal{A} =& \sum_{k}\min_jd(l_j, g_k) \\ & l_j \in \vec{y}_c, \quad j=1,...,5 \nonumber \\ & g_k \in \vec{y}_{\rm ref}, \quad k=1, ..., {\rm length}(\vec{y}_{\rm ref}) \nonumber \\ & d(x, y) = 1 \ \text{if} \ x=y \ \text{else} \ 0 \nonumber \end{align}

The authors divided the used datasets according to their contextual group $X$ according to [*] and they compare their results using compression ratio $\Delta s = \frac{s_c}{s_{\rm ref}}$, where $s_{c}$ is the compressed size and $s_{\rm ref}$ is the original size, and accuracy metric $\mathcal{A}_c$ which is calculated based on the hamming distance of Top-5 of the output of softmax probabilities for both original and compressed images as shown in Eq. \eqref{eq:accuracy}. In the RL designing stage, continuous numerical vectors are represented as the input features to the DRL agent which is Deep Q Network (DQN). The challenges of using this approach are: (1) the state space of RL is too large to cover, so it should have more layers and nodes to the neural network which make the DRL agent hard to converge and time-consuming during training; (2) The DRL always start with the random initial state that should start to converge to high reward so it will start the train of the DQN. The authors solve this problem by using a pre-trained small model called MobileNetV2 as a feature extractor $\mathcal{E}$ for its ability in lightweight and image classification, and it is fixed during training the Q Network $\phi$. The last convolution layer of $\mathcal{E}$ is connected as an input to the Q Network $\phi$, so by optimizing the parameters of Q network $\phi$, the RL agent's policy is updated.

Reinforcement learning framework

This paper described the reinforcement learning problem as $\{\mathcal{X}, M\}$ to be emulator environment, where $\mathcal{X}$ is defining the contextual information created as an input from the user $x$ and $M$ is the backend cloud model. Each RL frame must be defined by action and state, the action is known by 10 discrete quality levels ranging from 5 to 95 by step size of 5 and the state is feature extractor's output $\mathcal{E}(J(\mathcal{X}, c))$, where $J(\cdot)$ is the JPEG output at specific quantization level $c$. They found the optimal quantization level at time $t$ is $c_t = {\rm argmax}_cQ(\phi(\mathcal{E}(f_t)), c; \theta)$, where $Q(\phi(\mathcal{E}(f_t)), c; \theta)$ is \emph{action-value function}, [/itex] \theta [/itex] indicates the parameters of Q network [/itex] \phi [/itex]. In the training stage of RL, the goal is to minimize a loss function [/itex] L_i(\theta_i) = \mathbb{E}_{s, c \sim \rho (\cdot)}\Big[\big(y_i - Q(s, c; \theta_i)\big)^2 \Big] [/itex] that changes at each iteration [/itex] i [/itex] where [/itex] s = \mathcal{E}(f_t) [/itex] and [/itex]f_t[/itex] is the output of the JPEG, and [/itex] y_i = \mathbb{E}_{s' \sim \{\mathcal{X}, M\}} \big[ r + \gamma \max_{c'} Q(s', c'; \theta_{i-1}) \mid s, c \big] [/itex], where is the target that has a probability distribution [/itex] \rho(s, c) [/itex] over sequences [/itex] s [/itex] at iteration [/itex] i [/itex], [/itex] r [/itex] is the feedback reward and quality level [/itex] c [/itex].

The framework get more accurate estimation from a selected action when the distance of the target and the action-value function's output [/itex] Q(\cdot)[/itex] is minimized. As a results of no feedback signal can tell that an episode has finished a condition value [/itex]T[/itex] that satisfies [/itex] t \geq T_{\rm start} [/itex] to guarantee to store enough transitions in the memory buffer [/itex] D [/itex] to train on. To create this transitions for the RL agent, a random trials is randomly collected to observe environment reaction. After fetching some trails from the environment with their corresponding rewards, this randomness is decreased as the agent is trained to minimize the loss function [/itex] L [/itex] as shown in Algorithm mention in \cite{li2019adacompress}. Thus, it optimize its actions on a minibatch from [/itex] \mathcal{D} [/itex] to be based on historical optimal experience to train the compression level predictor [/itex] \phi [/itex]. when this trained predictor [/itex] \phi [/itex] is deployed, the RL agent will drive the compression engine with the adaptive quality factor [/itex] c [/itex] correspond to the input image [/itex] x_{i} [/itex].

The used reward function that evaluated the interaction between the agent and environment [/itex] \{\mathcal{X}, M\} [/itex] should address the selected action of quality factor [/itex] c [/itex] to be direct proportion with the accuracy metric [/itex] \mathcal{A}_c [/itex] and inverse proportion compression rate [/itex] \Delta s = \frac{s_c}{s_{\rm ref}} [/itex] that shows [/itex] R(\Delta s, \mathcal{A}) = \alpha \mathcal{A} - \Delta s + \beta[/itex], where [/itex] \alpha [/itex] and [/itex] \beta [/itex] to form a linear combination.