statwiki - User contributions [US]

One-Shot Imitation Learning

2018-04-21T04:27:46Z

F7xia: /* Temporal Dropout */

= Introduction =
We are interested in robotic systems that are able to perform a variety of complex useful tasks. Robotic systems can be used for many applications, but to truly be useful for complex applications, they need to overcome 2 challenges: having the intent of the task at hand communicated to them, and being able to perform the manipulations necessary to complete this task. It is preferable to use demonstration to teach the robotic systems rather than natural language, as natural language may often fail to convey the details and intricacies required for the task. However, current work on learning from demonstrations is only successful with large amounts of feature engineering or a large number of demonstrations. The proposed model aims to achieve 'one-shot' imitation learning, ie. learning to complete a new task from just a single demonstration of it without any other supervision. As input, the proposed model takes the observation of the current instance of a task, and a demonstration of successfully solving a different instance of the same task. Strong generalization was achieved by using a soft attention mechanism on both the sequence of actions and states that the demonstration consists of, as well as on the vector of element locations within the environment. The success of this proposed model at completing a series of block stacking tasks can be viewed at http://bit.ly/nips2017-oneshot.

= Related Work =
While one-shot imitation learning is a novel combination of ideas, each of the components has previously been studied.
* Imitation Learning:
** Behavioural learning uses supervised learning to map from observations to actions (e.g. [https://papers.nips.cc/paper/95-alvinn-an-autonomous-land-vehicle-in-a-neural-network.pdf (Pomerleau 1988)], [https://arxiv.org/pdf/1011.0686.pdf (Ross et. al 2011)])
** Inverse reinforcement learning estimates a reward function that considers demonstrations as optimal behavior (e.g. [http://ai.stanford.edu/~ang/papers/icml00-irl.pdf (Ng et. al 2000)])
* One-Shot Learning: is an object categorization problem in computer vision. Whereas most machine learning based object categorization algorithms require training on hundreds or thousands of images and very large datasets, one-shot learning aims to learn information about object categories from one, or only a few , training images.
** Typically a form of meta-learning
** Previously used for variety of tasks but all domain-specific
** [https://arxiv.org/abs/1703.03400 (Finn et al. 2017)] proposed a generic solution but excluded imitation learning
* Reinforcement Learning:
** Demonstrated to work on variety of tasks and environments, in particular on games and robotic control
** Requires large amount of trials and a user-specified reward function
* Multi-task/Transfer Learning:
** Shown to be particularly effective at computer vision tasks
** Not meant for one-shot learning
* Attention Modelling:
** The proposed model makes use of the attention model from [https://arxiv.org/abs/1409.0473 (Bahdanau et al. 2016)]
** The attention modelling over demonstration is similar in nature to the seq2seq models from the well known [https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf (Sutskever et al. 2014)]

= One-Shot Imitation Learning =

[[File:oneshot1.jpg|1000px]]

The figure above shows the differences between traditional and one-shot imitation learning. In a), the traditional method may require training different policies for performing similar tasks that are similar in nature. For example, stacking blocks to a height of 2 and to a height of 3. In b), the one-shot imitation learning allows the same policy to be used for these tasks given a single demonstration, achieving good performance without any additional system interactions. In c), the policy is trained by using a set of different training tasks, with enough examples so that the learned results can be generalized to other similar tasks. Each task has a set of successful demonstrations. Each iteration of training uses two demonstrations from a task, one is used as the input passing into the algorithm and the other is used at the output, the results from the two are then conditioned to produce the correct action.

== Problem Formalization ==
The problem is briefly formalized with the authors describing a distribution of tasks, an individual task, a distribution of demonstrations for this task, and a single demonstration respectively as \[T, \: t\sim T, \: D(t), \: d\sim D(t)\]
In addition, an action, an observation, parameters, and a policy are respectively defined as \[a, o, \theta, \pi_\theta(a|o,d)\]
In particular, a demonstration is a sequence of observation and action pairs \[d = [(o_1, a_1),(o_2, a_2), . . . ,(o_H , a_H )]\]
Assuming that <math>H </math>, the length or horizon of a demonstration, and some evaluation function $$R_t(d): R^H \rightarrow R$$ are given, and that succesful demonstrations are available for each task, then the objective is to maximize expectation of the policy performance over \[t\sim T, d\sim D(t)\].

== Block Stacking Tasks ==
The tasks that the authors focus on is block stacking. A user specifies in what final configuration cubic blocks should be stacked, and the goal is to use a 7-DOF Fetch robotic arm to arrange the blocks in this configuration. The number of blocks, and their desired configuration (ie. number of towers, the height of each tower, and order of blocks within each tower) can be varied and encoded as a string. For example, 'abc def' would signify 2 towers of height 3, with block A on block B on block C in one tower, and block D on block E on block F in a second tower. To add complexity, the initial configuration of the blocks can vary and is encoded as a set of 3-dimensional vectors describing the position of each block relative to the robotic arm.

== Algorithm ==
To avoid needing to specify a reward function, the authors use behavioral cloning and DAGGER, 2 imitation learning methods that require only demonstrations, for training. In each training step, a list of tasks is sampled, and for each, a demonstration with injected noise along with some observation-action pairs are sampled. Given the current observation and demonstration as input, the policy is trained against the sampled actions by minimizing L2 norm for continuous actions, and cross-entropy for discrete ones. Adamax is used as the optimizer with a learning rate of 0.001.

= Architecture =
The authors propose a novel architecture for imitation learning, consisting of 3 networks.

While, in principle, a generic neural network could learn the mapping from demonstration and current observation to appropriate action, the authors propose the following architecture which they claim as one of the main contributions of this paper, and believe it would be useful for complex tasks in the future.
The proposed architecture consists of three modules: the demonstration network, the context network, and the manipulation network.

[[File:oneshot2.jpg|1000px|center]]

== Demonstration Network ==
This network takes a demonstration as input and produces an embedding with size linearly proportional to the number of blocks and the size of the demonstration.
=== Temporal Dropout ===
Since a demonstration for block stacking can be very long, the authors randomly discard 95% of the time steps, a process they call 'temporal dropout'. The reduced size of the demonstrations allows multiple trajectories to be explored during testing to calculate an ensemble estimate. Dilated temporal convolutions and neighborhood attention are then repeatedly applied to the downsampled demonstrations. For block stacking project, the demonstrations can span hundreds to thousands of time
steps, and training with such long sequences can be demanding in both time and memory usage. Hence, the author randomly discards a subset of time steps during training, such operation is called "temporal dropout". Denote p as the proportion of time steps that are thrown away (in this case p = 95%).

=== Neighborhood Attention ===
Since demonstration sizes can vary, a mechanism is needed that is not restricted to fixed-length inputs. While soft attention is one such mechanism, the problem with it is that there may be increasingly large amounts of information lost if soft attention is used to map longer demonstrations to the same fixed length as shorter demonstrations. As a solution, the authors propose having the same number of outputs as inputs, but with attention performed on other inputs relative to the current input.

A query <math>q</math>, a list of context vectors <math>\{c_j\}</math>, and a list of memory vectors <math>\{m_j\}</math> are given as input to soft attention. Each attention weight is given by the product of a learned weight vector and a nonlinearity applied to the sum of the query and corresponding context vector. Softmaxed weights applied to the corresponding memory vector form the output of the soft attention.

\[Inputs: q, \{c_j\}, \{m_j\}\]
\[Weights: w_i \leftarrow v^Ttanh(q+c_i)\]
\[Output: \sum_i{m_i\frac{\exp(w_i)}{\sum_j{\exp(w_j)}}}\]

A list of same-length embeddings, coming from a previous neighbourhood attention layer or a projection from the list of block coordinates, is given as input to neighborhood attention. For each block, two separate linear layers produce a query vector and a context vector, while a memory vector is a list of tuples that describe the position of each block joined with the input embedding for that block. Soft attention is then performed on this query, context vector, and memory vector. The authors claim that the intuition behind this process is to allow each block to provide information about itself relative to the other blocks in the environment. Finally, for each block, a linear transformation is performed on the vector composed by concatenating the input embedding, the result of the soft attention for that block, and the robot's state.

For an environment with B blocks:
\[State: s\]
\[Block_i: b_i \leftarrow (x_i, y_i, z_i)\]
\[Embeddings: h_1^{in}, ..., h_B^{in}\]
\[Query_i: q_i \leftarrow Linear(h_i^{in})\]
\[Context_i: c_i \leftarrow Linear(h_i^{in})\]
\[Memory_i: m_i \leftarrow (b_i, h_i^{in}) \]
\[Result_i: result_i \leftarrow SoftAttn(q_i, \{c_j\}_{j=1}^B, \{m_k\}_{k=1}^B)\]
\[Output_i: output_i \leftarrow Linear(concat(h_i^{in}, result_i, b_i, s))\]

== Context network ==
This network takes the current state and the embedding produced by the demonstration network as inputs and outputs a fixed-length "context embedding" which captures only the information relevant for the manipulation network at this particular step.
=== Attention over demonstration ===
The current state is used to compute a query vector which is then used for attending over all the steps of the embedding. Since at each time step there are multiple blocks, the weights for each are summed together to produce a scalar for each time step. Neighbourhood attention is then applied several times, using an LSTM with untied weights, since the information at each time steps needs to be propagated to each block's embedding.

Performing attention over the demonstration yields a vector whose size is independent of the demonstration size; however, it is still dependent on the number of blocks in the environment, so it is natural to now attend over the state in order to get a fixed-length vector.
=== Attention over current state ===
The authors propose that in general, within each subtask, only a limited number of blocks are relevant for performing the subtask. If the subtask is to stack A on B, then intuitively, one would suppose that only block A and B are relevant, and perhaps any blocks that may be blocking access to either A or B. This is not enforced during training, but once soft attention is applied to the current state to produce a fixed-length context embedding, the authors believe that the model does indeed learn in this way.

== Manipulation network ==
Given the context embedding as input, this simple feedforward network decides on the particular action needed, to complete the subtask of stacking one particular 'source' block on top of another 'target' block. The manipulation network uses an MLP network. Since the network in the paper can only takes into account the source and target block it may take subobtimal paths. For example changing [ABC, D] to [C, ABD] can be done in one motion if it was possible to manipulate two blocks at once. The manipulation network is the simplest part of the network and leaves room to expand upon in future work.

= Experiments =
The proposed model was tested on the block stacking tasks. the experiments were designed at answering the following questions:
* How does training with behavioral cloning compare with DAGGER?
* How does conditioning on the entire demonstration compare to conditioning on the final state?
* How does conditioning on the entire demonstration compare to conditioning on a “snapshot” of the trajectory?
* Can the authors' framework generalize to tasks that it has never seen during training?
For the experiments, 140 training tasks and 43 testing tasks were collected, each with between 2 to 10 blocks and a different, desired final layout. Over 1000 demonstrations for each task were collected using a hard-coded policy rather than a human user. The authors compare 4 different architectures in these experiments:
* Behavioural cloning used to train the proposed model
* DAGGER used to train the proposed model
* The proposed model, trained with DAGGER, but conditioned on the desired final state rather than an entire demonstration
* The proposed model, trained with DAGGER, but conditioned on a 'snapshot' of the environment at the end of each subtask (ie. every time a block is stacked on another block)

== Performance Evaluation ==
[[File:oneshot3.jpg|1000px]]

The most confident action at each timestep is chosen in 100 different task configurations, and results are averaged over tasks that had the same number of blocks. The results suggest that the performance of each of the architectures is comparable to that of the hard-coded policy which they aim to imitate. Performance degrades similarly across all architectures and the hard-coded policy as the number of blocks increases. On the harder tasks, conditioning on the entire demonstration led to better performance than conditioning on snapshots or on the final state. The authors believe that this may be due to the lack of information when conditioning only on the final state as well as due to regularization caused by temporal dropout which leads to data augmentation when conditioning on the full demonstration but is omitted when conditioning only on the snapshots or final state. Both DAGGER and behavioral cloning performed comparably well. As mentioned above, noise injection was used in training to improve performance; in practice, additional noise can still be injected but some may already come from other sources.

== Visualization ==
The authors visualize the attention mechanisms underlying the main policy architecture to have a better understanding about how it operates. There are two kinds of attention that the authors are mainly interested in, one where the policy attends to different time steps in the demonstration, and the other where the policy attends to different blocks in the current state. The figures below show some of the policy attention heatmaps over time.

[[File:paper6_Visualization.png|800px]]

= Conclusions =
The proposed model successfully learns to complete new instances of a new task from just a single demonstration. The model was demonstrated to work on a series of block stacking tasks. The authors propose several extensions including enabling few-shot learning when one demonstration is insufficient, using image data as the demonstrations, and attempting many other tasks aside from block stacking.

= Criticisms =
While the paper shows an incredibly impressive result: the ability to learn a new task from just a single demonstration, there are a few points that need clearing up.
Firstly, the authors use a hard-coded policy in their experiments rather than a human. It is clear that the performance of this policy begins to degrade quickly as the complexity of the task increases. It would be useful to know what this hard-coded policy actually was, and if the proposed model could still have comparable performance if a more successful demonstration, perhaps one by a human user, were performed. Give the current popularity of adversarial examples, it would also be interesting to see the performance when conditioned on an "adversarial" demonstration, that achieves the correct final state, but intentionally performs complex or obfuscated steps to get there.
Second, it would be useful to see the model's performance on a more complex family of tasks than block stacking, since although each block stacking task is slightly different, the differences may turn out be insignificant compared to other tasks that this model should work on if it is to be a general imitation learning architecture; intuitively, the space of all possible moves and configurations is not large for the task. Also it is a bit misleading as there seems to be a need for more demonstrations to first get a reasonable policy that can generalize, leading to generic policy and then use just one demonstration on a new task expecting the policy to generalize. So it seems there is some sort of pre-training involved here. Regardless, this work is a big step forward for imitation learning, permitting a wider range of tasks for which there is little training data and no reward function available, to still be successfully solved.

= Illustrative Example: Particle Reaching =

[[File:f1.png]]

Figure 1: [Left] Agent, [Middle] Orange square is target, [Right] Green triangle is target [2].

Another simple yet insightful example of the One-Shot Imitation Learning is the particle reaching problem which provides a relatively simple suite of tasks from which the network needs to solve an arbitrary one. The problem is formulated such that for each task: there is an agent which can move based on a 2D force vector, and n landmarks at varying 2D locations (n varies from task to task) with the goal of moving the agent to the specific landmark reached in the demonstration. This is illustrated in Figure 1.

[[File:f2.png|450px]]

Figure 2: Experimental results [2].

Some insight comes from the use of different network architectures to solve this problem. The three architectures to compare (described below) are plain LSTM, LSTM with attention, and final state with attention. The key insight is that the architectures go from generic to specific, with the best generalization performance achieved with the most specific architecture, final state with attention, as seen in Figure 2. It is important to note that this conclusion does not carry forward to more complicated tasks such as the block stacking task.
*Plain LSTM: 512 hidden units, with the input being the demonstration trajectory (the position of the agent changes over time and approaches one of the targets). Output of the LSTM with the current state (from the task needed to be solved) is the input for a multi-layer perceptron (MLP) for finding the solution.
*LSTM with attention: Output of LSTM is now a set of weights for the different targets during training. These weights and the test state are used in the test task. The, now, 2D output is the input for an MLP as before.
*Final state with attention: Looks only at the final state of the demonstration since it can sufficiently provide the needed detail of which target to reach (trajectory is not required). Similar to previous architecture, produces weights used by MLP.

= Source =
# Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).
# Duan, Yan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. "One-shot imitation learning." In Advances in neural information processing systems, pp. 1087-1098. 2017.
# Y. Duan, M. Andrychowicz, B. Stadie, J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba. One-shot imitation learning. arXiv preprint arXiv:1703.07326, 2017. (Newer revision)
# Finn, Chelsea, Pieter Abbeel, and Sergey Levine. "Model-agnostic meta-learning for fast adaptation of deep networks." arXiv preprint arXiv:1703.03400 (2017).
# Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.

One-Shot Imitation Learning

2018-04-21T04:26:52Z

F7xia: /* Algorithm */

= Introduction =
We are interested in robotic systems that are able to perform a variety of complex useful tasks. Robotic systems can be used for many applications, but to truly be useful for complex applications, they need to overcome 2 challenges: having the intent of the task at hand communicated to them, and being able to perform the manipulations necessary to complete this task. It is preferable to use demonstration to teach the robotic systems rather than natural language, as natural language may often fail to convey the details and intricacies required for the task. However, current work on learning from demonstrations is only successful with large amounts of feature engineering or a large number of demonstrations. The proposed model aims to achieve 'one-shot' imitation learning, ie. learning to complete a new task from just a single demonstration of it without any other supervision. As input, the proposed model takes the observation of the current instance of a task, and a demonstration of successfully solving a different instance of the same task. Strong generalization was achieved by using a soft attention mechanism on both the sequence of actions and states that the demonstration consists of, as well as on the vector of element locations within the environment. The success of this proposed model at completing a series of block stacking tasks can be viewed at http://bit.ly/nips2017-oneshot.

= Related Work =
While one-shot imitation learning is a novel combination of ideas, each of the components has previously been studied.
* Imitation Learning:
** Behavioural learning uses supervised learning to map from observations to actions (e.g. [https://papers.nips.cc/paper/95-alvinn-an-autonomous-land-vehicle-in-a-neural-network.pdf (Pomerleau 1988)], [https://arxiv.org/pdf/1011.0686.pdf (Ross et. al 2011)])
** Inverse reinforcement learning estimates a reward function that considers demonstrations as optimal behavior (e.g. [http://ai.stanford.edu/~ang/papers/icml00-irl.pdf (Ng et. al 2000)])
* One-Shot Learning: is an object categorization problem in computer vision. Whereas most machine learning based object categorization algorithms require training on hundreds or thousands of images and very large datasets, one-shot learning aims to learn information about object categories from one, or only a few , training images.
** Typically a form of meta-learning
** Previously used for variety of tasks but all domain-specific
** [https://arxiv.org/abs/1703.03400 (Finn et al. 2017)] proposed a generic solution but excluded imitation learning
* Reinforcement Learning:
** Demonstrated to work on variety of tasks and environments, in particular on games and robotic control
** Requires large amount of trials and a user-specified reward function
* Multi-task/Transfer Learning:
** Shown to be particularly effective at computer vision tasks
** Not meant for one-shot learning
* Attention Modelling:
** The proposed model makes use of the attention model from [https://arxiv.org/abs/1409.0473 (Bahdanau et al. 2016)]
** The attention modelling over demonstration is similar in nature to the seq2seq models from the well known [https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf (Sutskever et al. 2014)]

= One-Shot Imitation Learning =

[[File:oneshot1.jpg|1000px]]

The figure above shows the differences between traditional and one-shot imitation learning. In a), the traditional method may require training different policies for performing similar tasks that are similar in nature. For example, stacking blocks to a height of 2 and to a height of 3. In b), the one-shot imitation learning allows the same policy to be used for these tasks given a single demonstration, achieving good performance without any additional system interactions. In c), the policy is trained by using a set of different training tasks, with enough examples so that the learned results can be generalized to other similar tasks. Each task has a set of successful demonstrations. Each iteration of training uses two demonstrations from a task, one is used as the input passing into the algorithm and the other is used at the output, the results from the two are then conditioned to produce the correct action.

== Problem Formalization ==
The problem is briefly formalized with the authors describing a distribution of tasks, an individual task, a distribution of demonstrations for this task, and a single demonstration respectively as \[T, \: t\sim T, \: D(t), \: d\sim D(t)\]
In addition, an action, an observation, parameters, and a policy are respectively defined as \[a, o, \theta, \pi_\theta(a|o,d)\]
In particular, a demonstration is a sequence of observation and action pairs \[d = [(o_1, a_1),(o_2, a_2), . . . ,(o_H , a_H )]\]
Assuming that <math>H </math>, the length or horizon of a demonstration, and some evaluation function $$R_t(d): R^H \rightarrow R$$ are given, and that succesful demonstrations are available for each task, then the objective is to maximize expectation of the policy performance over \[t\sim T, d\sim D(t)\].

== Block Stacking Tasks ==
The tasks that the authors focus on is block stacking. A user specifies in what final configuration cubic blocks should be stacked, and the goal is to use a 7-DOF Fetch robotic arm to arrange the blocks in this configuration. The number of blocks, and their desired configuration (ie. number of towers, the height of each tower, and order of blocks within each tower) can be varied and encoded as a string. For example, 'abc def' would signify 2 towers of height 3, with block A on block B on block C in one tower, and block D on block E on block F in a second tower. To add complexity, the initial configuration of the blocks can vary and is encoded as a set of 3-dimensional vectors describing the position of each block relative to the robotic arm.

== Algorithm ==
To avoid needing to specify a reward function, the authors use behavioral cloning and DAGGER, 2 imitation learning methods that require only demonstrations, for training. In each training step, a list of tasks is sampled, and for each, a demonstration with injected noise along with some observation-action pairs are sampled. Given the current observation and demonstration as input, the policy is trained against the sampled actions by minimizing L2 norm for continuous actions, and cross-entropy for discrete ones. Adamax is used as the optimizer with a learning rate of 0.001.

= Architecture =
The authors propose a novel architecture for imitation learning, consisting of 3 networks.

While, in principle, a generic neural network could learn the mapping from demonstration and current observation to appropriate action, the authors propose the following architecture which they claim as one of the main contributions of this paper, and believe it would be useful for complex tasks in the future.
The proposed architecture consists of three modules: the demonstration network, the context network, and the manipulation network.

[[File:oneshot2.jpg|1000px|center]]

== Demonstration Network ==
This network takes a demonstration as input and produces an embedding with size linearly proportional to the number of blocks and the size of the demonstration.
=== Temporal Dropout ===
Since a demonstration for block stacking can be very long, the authors randomly discard 95% of the time steps, a process they call 'temporal dropout'. The reduced size of the demonstrations allows multiple trajectories to be explored during testing to calculate an ensemble estimate. Dilated temporal convolutions and neighborhood attention are then repeatedly applied to the downsampled demonstrations. For block stacking project, the demonstrations can span hundreds to thousands of time
steps, and training with such long sequences can be demanding in both time and memory usage. Hence, the author randomly discard a subset of time steps during training, such operation is called "temporal dropout". Denote p as the proportion of time steps that are thrown away (in this case p = 95%).

=== Neighborhood Attention ===
Since demonstration sizes can vary, a mechanism is needed that is not restricted to fixed-length inputs. While soft attention is one such mechanism, the problem with it is that there may be increasingly large amounts of information lost if soft attention is used to map longer demonstrations to the same fixed length as shorter demonstrations. As a solution, the authors propose having the same number of outputs as inputs, but with attention performed on other inputs relative to the current input.

A query <math>q</math>, a list of context vectors <math>\{c_j\}</math>, and a list of memory vectors <math>\{m_j\}</math> are given as input to soft attention. Each attention weight is given by the product of a learned weight vector and a nonlinearity applied to the sum of the query and corresponding context vector. Softmaxed weights applied to the corresponding memory vector form the output of the soft attention.

\[Inputs: q, \{c_j\}, \{m_j\}\]
\[Weights: w_i \leftarrow v^Ttanh(q+c_i)\]
\[Output: \sum_i{m_i\frac{\exp(w_i)}{\sum_j{\exp(w_j)}}}\]

A list of same-length embeddings, coming from a previous neighbourhood attention layer or a projection from the list of block coordinates, is given as input to neighborhood attention. For each block, two separate linear layers produce a query vector and a context vector, while a memory vector is a list of tuples that describe the position of each block joined with the input embedding for that block. Soft attention is then performed on this query, context vector, and memory vector. The authors claim that the intuition behind this process is to allow each block to provide information about itself relative to the other blocks in the environment. Finally, for each block, a linear transformation is performed on the vector composed by concatenating the input embedding, the result of the soft attention for that block, and the robot's state.

For an environment with B blocks:
\[State: s\]
\[Block_i: b_i \leftarrow (x_i, y_i, z_i)\]
\[Embeddings: h_1^{in}, ..., h_B^{in}\]
\[Query_i: q_i \leftarrow Linear(h_i^{in})\]
\[Context_i: c_i \leftarrow Linear(h_i^{in})\]
\[Memory_i: m_i \leftarrow (b_i, h_i^{in}) \]
\[Result_i: result_i \leftarrow SoftAttn(q_i, \{c_j\}_{j=1}^B, \{m_k\}_{k=1}^B)\]
\[Output_i: output_i \leftarrow Linear(concat(h_i^{in}, result_i, b_i, s))\]

== Context network ==
This network takes the current state and the embedding produced by the demonstration network as inputs and outputs a fixed-length "context embedding" which captures only the information relevant for the manipulation network at this particular step.
=== Attention over demonstration ===
The current state is used to compute a query vector which is then used for attending over all the steps of the embedding. Since at each time step there are multiple blocks, the weights for each are summed together to produce a scalar for each time step. Neighbourhood attention is then applied several times, using an LSTM with untied weights, since the information at each time steps needs to be propagated to each block's embedding.

Performing attention over the demonstration yields a vector whose size is independent of the demonstration size; however, it is still dependent on the number of blocks in the environment, so it is natural to now attend over the state in order to get a fixed-length vector.
=== Attention over current state ===
The authors propose that in general, within each subtask, only a limited number of blocks are relevant for performing the subtask. If the subtask is to stack A on B, then intuitively, one would suppose that only block A and B are relevant, and perhaps any blocks that may be blocking access to either A or B. This is not enforced during training, but once soft attention is applied to the current state to produce a fixed-length context embedding, the authors believe that the model does indeed learn in this way.

== Manipulation network ==
Given the context embedding as input, this simple feedforward network decides on the particular action needed, to complete the subtask of stacking one particular 'source' block on top of another 'target' block. The manipulation network uses an MLP network. Since the network in the paper can only takes into account the source and target block it may take subobtimal paths. For example changing [ABC, D] to [C, ABD] can be done in one motion if it was possible to manipulate two blocks at once. The manipulation network is the simplest part of the network and leaves room to expand upon in future work.

= Experiments =
The proposed model was tested on the block stacking tasks. the experiments were designed at answering the following questions:
* How does training with behavioral cloning compare with DAGGER?
* How does conditioning on the entire demonstration compare to conditioning on the final state?
* How does conditioning on the entire demonstration compare to conditioning on a “snapshot” of the trajectory?
* Can the authors' framework generalize to tasks that it has never seen during training?
For the experiments, 140 training tasks and 43 testing tasks were collected, each with between 2 to 10 blocks and a different, desired final layout. Over 1000 demonstrations for each task were collected using a hard-coded policy rather than a human user. The authors compare 4 different architectures in these experiments:
* Behavioural cloning used to train the proposed model
* DAGGER used to train the proposed model
* The proposed model, trained with DAGGER, but conditioned on the desired final state rather than an entire demonstration
* The proposed model, trained with DAGGER, but conditioned on a 'snapshot' of the environment at the end of each subtask (ie. every time a block is stacked on another block)

== Performance Evaluation ==
[[File:oneshot3.jpg|1000px]]

The most confident action at each timestep is chosen in 100 different task configurations, and results are averaged over tasks that had the same number of blocks. The results suggest that the performance of each of the architectures is comparable to that of the hard-coded policy which they aim to imitate. Performance degrades similarly across all architectures and the hard-coded policy as the number of blocks increases. On the harder tasks, conditioning on the entire demonstration led to better performance than conditioning on snapshots or on the final state. The authors believe that this may be due to the lack of information when conditioning only on the final state as well as due to regularization caused by temporal dropout which leads to data augmentation when conditioning on the full demonstration but is omitted when conditioning only on the snapshots or final state. Both DAGGER and behavioral cloning performed comparably well. As mentioned above, noise injection was used in training to improve performance; in practice, additional noise can still be injected but some may already come from other sources.

== Visualization ==
The authors visualize the attention mechanisms underlying the main policy architecture to have a better understanding about how it operates. There are two kinds of attention that the authors are mainly interested in, one where the policy attends to different time steps in the demonstration, and the other where the policy attends to different blocks in the current state. The figures below show some of the policy attention heatmaps over time.

[[File:paper6_Visualization.png|800px]]

= Conclusions =
The proposed model successfully learns to complete new instances of a new task from just a single demonstration. The model was demonstrated to work on a series of block stacking tasks. The authors propose several extensions including enabling few-shot learning when one demonstration is insufficient, using image data as the demonstrations, and attempting many other tasks aside from block stacking.

= Criticisms =
While the paper shows an incredibly impressive result: the ability to learn a new task from just a single demonstration, there are a few points that need clearing up.
Firstly, the authors use a hard-coded policy in their experiments rather than a human. It is clear that the performance of this policy begins to degrade quickly as the complexity of the task increases. It would be useful to know what this hard-coded policy actually was, and if the proposed model could still have comparable performance if a more successful demonstration, perhaps one by a human user, were performed. Give the current popularity of adversarial examples, it would also be interesting to see the performance when conditioned on an "adversarial" demonstration, that achieves the correct final state, but intentionally performs complex or obfuscated steps to get there.
Second, it would be useful to see the model's performance on a more complex family of tasks than block stacking, since although each block stacking task is slightly different, the differences may turn out be insignificant compared to other tasks that this model should work on if it is to be a general imitation learning architecture; intuitively, the space of all possible moves and configurations is not large for the task. Also it is a bit misleading as there seems to be a need for more demonstrations to first get a reasonable policy that can generalize, leading to generic policy and then use just one demonstration on a new task expecting the policy to generalize. So it seems there is some sort of pre-training involved here. Regardless, this work is a big step forward for imitation learning, permitting a wider range of tasks for which there is little training data and no reward function available, to still be successfully solved.

= Illustrative Example: Particle Reaching =

[[File:f1.png]]

Figure 1: [Left] Agent, [Middle] Orange square is target, [Right] Green triangle is target [2].

Another simple yet insightful example of the One-Shot Imitation Learning is the particle reaching problem which provides a relatively simple suite of tasks from which the network needs to solve an arbitrary one. The problem is formulated such that for each task: there is an agent which can move based on a 2D force vector, and n landmarks at varying 2D locations (n varies from task to task) with the goal of moving the agent to the specific landmark reached in the demonstration. This is illustrated in Figure 1.

[[File:f2.png|450px]]

Figure 2: Experimental results [2].

Some insight comes from the use of different network architectures to solve this problem. The three architectures to compare (described below) are plain LSTM, LSTM with attention, and final state with attention. The key insight is that the architectures go from generic to specific, with the best generalization performance achieved with the most specific architecture, final state with attention, as seen in Figure 2. It is important to note that this conclusion does not carry forward to more complicated tasks such as the block stacking task.
*Plain LSTM: 512 hidden units, with the input being the demonstration trajectory (the position of the agent changes over time and approaches one of the targets). Output of the LSTM with the current state (from the task needed to be solved) is the input for a multi-layer perceptron (MLP) for finding the solution.
*LSTM with attention: Output of LSTM is now a set of weights for the different targets during training. These weights and the test state are used in the test task. The, now, 2D output is the input for an MLP as before.
*Final state with attention: Looks only at the final state of the demonstration since it can sufficiently provide the needed detail of which target to reach (trajectory is not required). Similar to previous architecture, produces weights used by MLP.

= Source =
# Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).
# Duan, Yan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. "One-shot imitation learning." In Advances in neural information processing systems, pp. 1087-1098. 2017.
# Y. Duan, M. Andrychowicz, B. Stadie, J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba. One-shot imitation learning. arXiv preprint arXiv:1703.07326, 2017. (Newer revision)
# Finn, Chelsea, Pieter Abbeel, and Sergey Levine. "Model-agnostic meta-learning for fast adaptation of deep networks." arXiv preprint arXiv:1703.03400 (2017).
# Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.

One-Shot Imitation Learning

2018-04-21T04:26:44Z

F7xia: /* Algorithm */

= Introduction =
We are interested in robotic systems that are able to perform a variety of complex useful tasks. Robotic systems can be used for many applications, but to truly be useful for complex applications, they need to overcome 2 challenges: having the intent of the task at hand communicated to them, and being able to perform the manipulations necessary to complete this task. It is preferable to use demonstration to teach the robotic systems rather than natural language, as natural language may often fail to convey the details and intricacies required for the task. However, current work on learning from demonstrations is only successful with large amounts of feature engineering or a large number of demonstrations. The proposed model aims to achieve 'one-shot' imitation learning, ie. learning to complete a new task from just a single demonstration of it without any other supervision. As input, the proposed model takes the observation of the current instance of a task, and a demonstration of successfully solving a different instance of the same task. Strong generalization was achieved by using a soft attention mechanism on both the sequence of actions and states that the demonstration consists of, as well as on the vector of element locations within the environment. The success of this proposed model at completing a series of block stacking tasks can be viewed at http://bit.ly/nips2017-oneshot.

= Related Work =
While one-shot imitation learning is a novel combination of ideas, each of the components has previously been studied.
* Imitation Learning:
** Behavioural learning uses supervised learning to map from observations to actions (e.g. [https://papers.nips.cc/paper/95-alvinn-an-autonomous-land-vehicle-in-a-neural-network.pdf (Pomerleau 1988)], [https://arxiv.org/pdf/1011.0686.pdf (Ross et. al 2011)])
** Inverse reinforcement learning estimates a reward function that considers demonstrations as optimal behavior (e.g. [http://ai.stanford.edu/~ang/papers/icml00-irl.pdf (Ng et. al 2000)])
* One-Shot Learning: is an object categorization problem in computer vision. Whereas most machine learning based object categorization algorithms require training on hundreds or thousands of images and very large datasets, one-shot learning aims to learn information about object categories from one, or only a few , training images.
** Typically a form of meta-learning
** Previously used for variety of tasks but all domain-specific
** [https://arxiv.org/abs/1703.03400 (Finn et al. 2017)] proposed a generic solution but excluded imitation learning
* Reinforcement Learning:
** Demonstrated to work on variety of tasks and environments, in particular on games and robotic control
** Requires large amount of trials and a user-specified reward function
* Multi-task/Transfer Learning:
** Shown to be particularly effective at computer vision tasks
** Not meant for one-shot learning
* Attention Modelling:
** The proposed model makes use of the attention model from [https://arxiv.org/abs/1409.0473 (Bahdanau et al. 2016)]
** The attention modelling over demonstration is similar in nature to the seq2seq models from the well known [https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf (Sutskever et al. 2014)]

= One-Shot Imitation Learning =

[[File:oneshot1.jpg|1000px]]

The figure above shows the differences between traditional and one-shot imitation learning. In a), the traditional method may require training different policies for performing similar tasks that are similar in nature. For example, stacking blocks to a height of 2 and to a height of 3. In b), the one-shot imitation learning allows the same policy to be used for these tasks given a single demonstration, achieving good performance without any additional system interactions. In c), the policy is trained by using a set of different training tasks, with enough examples so that the learned results can be generalized to other similar tasks. Each task has a set of successful demonstrations. Each iteration of training uses two demonstrations from a task, one is used as the input passing into the algorithm and the other is used at the output, the results from the two are then conditioned to produce the correct action.

== Problem Formalization ==
The problem is briefly formalized with the authors describing a distribution of tasks, an individual task, a distribution of demonstrations for this task, and a single demonstration respectively as \[T, \: t\sim T, \: D(t), \: d\sim D(t)\]
In addition, an action, an observation, parameters, and a policy are respectively defined as \[a, o, \theta, \pi_\theta(a|o,d)\]
In particular, a demonstration is a sequence of observation and action pairs \[d = [(o_1, a_1),(o_2, a_2), . . . ,(o_H , a_H )]\]
Assuming that <math>H </math>, the length or horizon of a demonstration, and some evaluation function $$R_t(d): R^H \rightarrow R$$ are given, and that succesful demonstrations are available for each task, then the objective is to maximize expectation of the policy performance over \[t\sim T, d\sim D(t)\].

== Block Stacking Tasks ==
The tasks that the authors focus on is block stacking. A user specifies in what final configuration cubic blocks should be stacked, and the goal is to use a 7-DOF Fetch robotic arm to arrange the blocks in this configuration. The number of blocks, and their desired configuration (ie. number of towers, the height of each tower, and order of blocks within each tower) can be varied and encoded as a string. For example, 'abc def' would signify 2 towers of height 3, with block A on block B on block C in one tower, and block D on block E on block F in a second tower. To add complexity, the initial configuration of the blocks can vary and is encoded as a set of 3-dimensional vectors describing the position of each block relative to the robotic arm.

== Algorithm ==

= Architecture =
The authors propose a novel architecture for imitation learning, consisting of 3 networks.

While, in principle, a generic neural network could learn the mapping from demonstration and current observation to appropriate action, the authors propose the following architecture which they claim as one of the main contributions of this paper, and believe it would be useful for complex tasks in the future.
The proposed architecture consists of three modules: the demonstration network, the context network, and the manipulation network.

[[File:oneshot2.jpg|1000px|center]]

== Demonstration Network ==
This network takes a demonstration as input and produces an embedding with size linearly proportional to the number of blocks and the size of the demonstration.
=== Temporal Dropout ===
Since a demonstration for block stacking can be very long, the authors randomly discard 95% of the time steps, a process they call 'temporal dropout'. The reduced size of the demonstrations allows multiple trajectories to be explored during testing to calculate an ensemble estimate. Dilated temporal convolutions and neighborhood attention are then repeatedly applied to the downsampled demonstrations. For block stacking project, the demonstrations can span hundreds to thousands of time
steps, and training with such long sequences can be demanding in both time and memory usage. Hence, the author randomly discard a subset of time steps during training, such operation is called "temporal dropout". Denote p as the proportion of time steps that are thrown away (in this case p = 95%).

=== Neighborhood Attention ===
Since demonstration sizes can vary, a mechanism is needed that is not restricted to fixed-length inputs. While soft attention is one such mechanism, the problem with it is that there may be increasingly large amounts of information lost if soft attention is used to map longer demonstrations to the same fixed length as shorter demonstrations. As a solution, the authors propose having the same number of outputs as inputs, but with attention performed on other inputs relative to the current input.

A query <math>q</math>, a list of context vectors <math>\{c_j\}</math>, and a list of memory vectors <math>\{m_j\}</math> are given as input to soft attention. Each attention weight is given by the product of a learned weight vector and a nonlinearity applied to the sum of the query and corresponding context vector. Softmaxed weights applied to the corresponding memory vector form the output of the soft attention.

\[Inputs: q, \{c_j\}, \{m_j\}\]
\[Weights: w_i \leftarrow v^Ttanh(q+c_i)\]
\[Output: \sum_i{m_i\frac{\exp(w_i)}{\sum_j{\exp(w_j)}}}\]

A list of same-length embeddings, coming from a previous neighbourhood attention layer or a projection from the list of block coordinates, is given as input to neighborhood attention. For each block, two separate linear layers produce a query vector and a context vector, while a memory vector is a list of tuples that describe the position of each block joined with the input embedding for that block. Soft attention is then performed on this query, context vector, and memory vector. The authors claim that the intuition behind this process is to allow each block to provide information about itself relative to the other blocks in the environment. Finally, for each block, a linear transformation is performed on the vector composed by concatenating the input embedding, the result of the soft attention for that block, and the robot's state.

For an environment with B blocks:
\[State: s\]
\[Block_i: b_i \leftarrow (x_i, y_i, z_i)\]
\[Embeddings: h_1^{in}, ..., h_B^{in}\]
\[Query_i: q_i \leftarrow Linear(h_i^{in})\]
\[Context_i: c_i \leftarrow Linear(h_i^{in})\]
\[Memory_i: m_i \leftarrow (b_i, h_i^{in}) \]
\[Result_i: result_i \leftarrow SoftAttn(q_i, \{c_j\}_{j=1}^B, \{m_k\}_{k=1}^B)\]
\[Output_i: output_i \leftarrow Linear(concat(h_i^{in}, result_i, b_i, s))\]

== Context network ==
This network takes the current state and the embedding produced by the demonstration network as inputs and outputs a fixed-length "context embedding" which captures only the information relevant for the manipulation network at this particular step.
=== Attention over demonstration ===
The current state is used to compute a query vector which is then used for attending over all the steps of the embedding. Since at each time step there are multiple blocks, the weights for each are summed together to produce a scalar for each time step. Neighbourhood attention is then applied several times, using an LSTM with untied weights, since the information at each time steps needs to be propagated to each block's embedding.

Performing attention over the demonstration yields a vector whose size is independent of the demonstration size; however, it is still dependent on the number of blocks in the environment, so it is natural to now attend over the state in order to get a fixed-length vector.
=== Attention over current state ===
The authors propose that in general, within each subtask, only a limited number of blocks are relevant for performing the subtask. If the subtask is to stack A on B, then intuitively, one would suppose that only block A and B are relevant, and perhaps any blocks that may be blocking access to either A or B. This is not enforced during training, but once soft attention is applied to the current state to produce a fixed-length context embedding, the authors believe that the model does indeed learn in this way.

== Manipulation network ==
Given the context embedding as input, this simple feedforward network decides on the particular action needed, to complete the subtask of stacking one particular 'source' block on top of another 'target' block. The manipulation network uses an MLP network. Since the network in the paper can only takes into account the source and target block it may take subobtimal paths. For example changing [ABC, D] to [C, ABD] can be done in one motion if it was possible to manipulate two blocks at once. The manipulation network is the simplest part of the network and leaves room to expand upon in future work.

= Experiments =
The proposed model was tested on the block stacking tasks. the experiments were designed at answering the following questions:
* How does training with behavioral cloning compare with DAGGER?
* How does conditioning on the entire demonstration compare to conditioning on the final state?
* How does conditioning on the entire demonstration compare to conditioning on a “snapshot” of the trajectory?
* Can the authors' framework generalize to tasks that it has never seen during training?
For the experiments, 140 training tasks and 43 testing tasks were collected, each with between 2 to 10 blocks and a different, desired final layout. Over 1000 demonstrations for each task were collected using a hard-coded policy rather than a human user. The authors compare 4 different architectures in these experiments:
* Behavioural cloning used to train the proposed model
* DAGGER used to train the proposed model
* The proposed model, trained with DAGGER, but conditioned on the desired final state rather than an entire demonstration
* The proposed model, trained with DAGGER, but conditioned on a 'snapshot' of the environment at the end of each subtask (ie. every time a block is stacked on another block)

== Performance Evaluation ==
[[File:oneshot3.jpg|1000px]]

The most confident action at each timestep is chosen in 100 different task configurations, and results are averaged over tasks that had the same number of blocks. The results suggest that the performance of each of the architectures is comparable to that of the hard-coded policy which they aim to imitate. Performance degrades similarly across all architectures and the hard-coded policy as the number of blocks increases. On the harder tasks, conditioning on the entire demonstration led to better performance than conditioning on snapshots or on the final state. The authors believe that this may be due to the lack of information when conditioning only on the final state as well as due to regularization caused by temporal dropout which leads to data augmentation when conditioning on the full demonstration but is omitted when conditioning only on the snapshots or final state. Both DAGGER and behavioral cloning performed comparably well. As mentioned above, noise injection was used in training to improve performance; in practice, additional noise can still be injected but some may already come from other sources.

== Visualization ==
The authors visualize the attention mechanisms underlying the main policy architecture to have a better understanding about how it operates. There are two kinds of attention that the authors are mainly interested in, one where the policy attends to different time steps in the demonstration, and the other where the policy attends to different blocks in the current state. The figures below show some of the policy attention heatmaps over time.

[[File:paper6_Visualization.png|800px]]

= Conclusions =
The proposed model successfully learns to complete new instances of a new task from just a single demonstration. The model was demonstrated to work on a series of block stacking tasks. The authors propose several extensions including enabling few-shot learning when one demonstration is insufficient, using image data as the demonstrations, and attempting many other tasks aside from block stacking.

= Criticisms =
While the paper shows an incredibly impressive result: the ability to learn a new task from just a single demonstration, there are a few points that need clearing up.
Firstly, the authors use a hard-coded policy in their experiments rather than a human. It is clear that the performance of this policy begins to degrade quickly as the complexity of the task increases. It would be useful to know what this hard-coded policy actually was, and if the proposed model could still have comparable performance if a more successful demonstration, perhaps one by a human user, were performed. Give the current popularity of adversarial examples, it would also be interesting to see the performance when conditioned on an "adversarial" demonstration, that achieves the correct final state, but intentionally performs complex or obfuscated steps to get there.
Second, it would be useful to see the model's performance on a more complex family of tasks than block stacking, since although each block stacking task is slightly different, the differences may turn out be insignificant compared to other tasks that this model should work on if it is to be a general imitation learning architecture; intuitively, the space of all possible moves and configurations is not large for the task. Also it is a bit misleading as there seems to be a need for more demonstrations to first get a reasonable policy that can generalize, leading to generic policy and then use just one demonstration on a new task expecting the policy to generalize. So it seems there is some sort of pre-training involved here. Regardless, this work is a big step forward for imitation learning, permitting a wider range of tasks for which there is little training data and no reward function available, to still be successfully solved.

= Illustrative Example: Particle Reaching =

[[File:f1.png]]

Figure 1: [Left] Agent, [Middle] Orange square is target, [Right] Green triangle is target [2].

Another simple yet insightful example of the One-Shot Imitation Learning is the particle reaching problem which provides a relatively simple suite of tasks from which the network needs to solve an arbitrary one. The problem is formulated such that for each task: there is an agent which can move based on a 2D force vector, and n landmarks at varying 2D locations (n varies from task to task) with the goal of moving the agent to the specific landmark reached in the demonstration. This is illustrated in Figure 1.

[[File:f2.png|450px]]

Figure 2: Experimental results [2].

Some insight comes from the use of different network architectures to solve this problem. The three architectures to compare (described below) are plain LSTM, LSTM with attention, and final state with attention. The key insight is that the architectures go from generic to specific, with the best generalization performance achieved with the most specific architecture, final state with attention, as seen in Figure 2. It is important to note that this conclusion does not carry forward to more complicated tasks such as the block stacking task.
*Plain LSTM: 512 hidden units, with the input being the demonstration trajectory (the position of the agent changes over time and approaches one of the targets). Output of the LSTM with the current state (from the task needed to be solved) is the input for a multi-layer perceptron (MLP) for finding the solution.
*LSTM with attention: Output of LSTM is now a set of weights for the different targets during training. These weights and the test state are used in the test task. The, now, 2D output is the input for an MLP as before.
*Final state with attention: Looks only at the final state of the demonstration since it can sufficiently provide the needed detail of which target to reach (trajectory is not required). Similar to previous architecture, produces weights used by MLP.

= Source =
# Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).
# Duan, Yan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. "One-shot imitation learning." In Advances in neural information processing systems, pp. 1087-1098. 2017.
# Y. Duan, M. Andrychowicz, B. Stadie, J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba. One-shot imitation learning. arXiv preprint arXiv:1703.07326, 2017. (Newer revision)
# Finn, Chelsea, Pieter Abbeel, and Sergey Levine. "Model-agnostic meta-learning for fast adaptation of deep networks." arXiv preprint arXiv:1703.03400 (2017).
# Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.

One-Shot Imitation Learning

2018-04-21T04:26:26Z

F7xia: /* Criticisms */

= Introduction =
We are interested in robotic systems that are able to perform a variety of complex useful tasks. Robotic systems can be used for many applications, but to truly be useful for complex applications, they need to overcome 2 challenges: having the intent of the task at hand communicated to them, and being able to perform the manipulations necessary to complete this task. It is preferable to use demonstration to teach the robotic systems rather than natural language, as natural language may often fail to convey the details and intricacies required for the task. However, current work on learning from demonstrations is only successful with large amounts of feature engineering or a large number of demonstrations. The proposed model aims to achieve 'one-shot' imitation learning, ie. learning to complete a new task from just a single demonstration of it without any other supervision. As input, the proposed model takes the observation of the current instance of a task, and a demonstration of successfully solving a different instance of the same task. Strong generalization was achieved by using a soft attention mechanism on both the sequence of actions and states that the demonstration consists of, as well as on the vector of element locations within the environment. The success of this proposed model at completing a series of block stacking tasks can be viewed at http://bit.ly/nips2017-oneshot.

= Related Work =
While one-shot imitation learning is a novel combination of ideas, each of the components has previously been studied.
* Imitation Learning:
** Behavioural learning uses supervised learning to map from observations to actions (e.g. [https://papers.nips.cc/paper/95-alvinn-an-autonomous-land-vehicle-in-a-neural-network.pdf (Pomerleau 1988)], [https://arxiv.org/pdf/1011.0686.pdf (Ross et. al 2011)])
** Inverse reinforcement learning estimates a reward function that considers demonstrations as optimal behavior (e.g. [http://ai.stanford.edu/~ang/papers/icml00-irl.pdf (Ng et. al 2000)])
* One-Shot Learning: is an object categorization problem in computer vision. Whereas most machine learning based object categorization algorithms require training on hundreds or thousands of images and very large datasets, one-shot learning aims to learn information about object categories from one, or only a few , training images.
** Typically a form of meta-learning
** Previously used for variety of tasks but all domain-specific
** [https://arxiv.org/abs/1703.03400 (Finn et al. 2017)] proposed a generic solution but excluded imitation learning
* Reinforcement Learning:
** Demonstrated to work on variety of tasks and environments, in particular on games and robotic control
** Requires large amount of trials and a user-specified reward function
* Multi-task/Transfer Learning:
** Shown to be particularly effective at computer vision tasks
** Not meant for one-shot learning
* Attention Modelling:
** The proposed model makes use of the attention model from [https://arxiv.org/abs/1409.0473 (Bahdanau et al. 2016)]
** The attention modelling over demonstration is similar in nature to the seq2seq models from the well known [https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf (Sutskever et al. 2014)]

= One-Shot Imitation Learning =

[[File:oneshot1.jpg|1000px]]

The figure above shows the differences between traditional and one-shot imitation learning. In a), the traditional method may require training different policies for performing similar tasks that are similar in nature. For example, stacking blocks to a height of 2 and to a height of 3. In b), the one-shot imitation learning allows the same policy to be used for these tasks given a single demonstration, achieving good performance without any additional system interactions. In c), the policy is trained by using a set of different training tasks, with enough examples so that the learned results can be generalized to other similar tasks. Each task has a set of successful demonstrations. Each iteration of training uses two demonstrations from a task, one is used as the input passing into the algorithm and the other is used at the output, the results from the two are then conditioned to produce the correct action.

== Problem Formalization ==
The problem is briefly formalized with the authors describing a distribution of tasks, an individual task, a distribution of demonstrations for this task, and a single demonstration respectively as \[T, \: t\sim T, \: D(t), \: d\sim D(t)\]
In addition, an action, an observation, parameters, and a policy are respectively defined as \[a, o, \theta, \pi_\theta(a|o,d)\]
In particular, a demonstration is a sequence of observation and action pairs \[d = [(o_1, a_1),(o_2, a_2), . . . ,(o_H , a_H )]\]
Assuming that <math>H </math>, the length or horizon of a demonstration, and some evaluation function $$R_t(d): R^H \rightarrow R$$ are given, and that succesful demonstrations are available for each task, then the objective is to maximize expectation of the policy performance over \[t\sim T, d\sim D(t)\].

== Block Stacking Tasks ==
The tasks that the authors focus on is block stacking. A user specifies in what final configuration cubic blocks should be stacked, and the goal is to use a 7-DOF Fetch robotic arm to arrange the blocks in this configuration. The number of blocks, and their desired configuration (ie. number of towers, the height of each tower, and order of blocks within each tower) can be varied and encoded as a string. For example, 'abc def' would signify 2 towers of height 3, with block A on block B on block C in one tower, and block D on block E on block F in a second tower. To add complexity, the initial configuration of the blocks can vary and is encoded as a set of 3-dimensional vectors describing the position of each block relative to the robotic arm.

== Algorithm ==
To avoid needing to specify a reward function, the authors use behavioral cloning and DAGGER, 2 imitation learning methods that require only demonstrations, for training. In each training step, a list of tasks is sampled, and for each, a demonstration with injected noise along with some observation-action pairs are sampled. Given the current observation and demonstration as input, the policy is trained against the sampled actions by minimizing L2 norm for continuous actions, and cross-entropy for discrete ones. Adamax is used as the optimizer with a learning rate of 0.001.

= Architecture =
The authors propose a novel architecture for imitation learning, consisting of 3 networks.

While, in principle, a generic neural network could learn the mapping from demonstration and current observation to appropriate action, the authors propose the following architecture which they claim as one of the main contributions of this paper, and believe it would be useful for complex tasks in the future.
The proposed architecture consists of three modules: the demonstration network, the context network, and the manipulation network.

[[File:oneshot2.jpg|1000px|center]]

== Demonstration Network ==
This network takes a demonstration as input and produces an embedding with size linearly proportional to the number of blocks and the size of the demonstration.
=== Temporal Dropout ===
Since a demonstration for block stacking can be very long, the authors randomly discard 95% of the time steps, a process they call 'temporal dropout'. The reduced size of the demonstrations allows multiple trajectories to be explored during testing to calculate an ensemble estimate. Dilated temporal convolutions and neighborhood attention are then repeatedly applied to the downsampled demonstrations. For block stacking project, the demonstrations can span hundreds to thousands of time
steps, and training with such long sequences can be demanding in both time and memory usage. Hence, the author randomly discard a subset of time steps during training, such operation is called "temporal dropout". Denote p as the proportion of time steps that are thrown away (in this case p = 95%).

=== Neighborhood Attention ===
Since demonstration sizes can vary, a mechanism is needed that is not restricted to fixed-length inputs. While soft attention is one such mechanism, the problem with it is that there may be increasingly large amounts of information lost if soft attention is used to map longer demonstrations to the same fixed length as shorter demonstrations. As a solution, the authors propose having the same number of outputs as inputs, but with attention performed on other inputs relative to the current input.

A query <math>q</math>, a list of context vectors <math>\{c_j\}</math>, and a list of memory vectors <math>\{m_j\}</math> are given as input to soft attention. Each attention weight is given by the product of a learned weight vector and a nonlinearity applied to the sum of the query and corresponding context vector. Softmaxed weights applied to the corresponding memory vector form the output of the soft attention.

\[Inputs: q, \{c_j\}, \{m_j\}\]
\[Weights: w_i \leftarrow v^Ttanh(q+c_i)\]
\[Output: \sum_i{m_i\frac{\exp(w_i)}{\sum_j{\exp(w_j)}}}\]

A list of same-length embeddings, coming from a previous neighbourhood attention layer or a projection from the list of block coordinates, is given as input to neighborhood attention. For each block, two separate linear layers produce a query vector and a context vector, while a memory vector is a list of tuples that describe the position of each block joined with the input embedding for that block. Soft attention is then performed on this query, context vector, and memory vector. The authors claim that the intuition behind this process is to allow each block to provide information about itself relative to the other blocks in the environment. Finally, for each block, a linear transformation is performed on the vector composed by concatenating the input embedding, the result of the soft attention for that block, and the robot's state.

For an environment with B blocks:
\[State: s\]
\[Block_i: b_i \leftarrow (x_i, y_i, z_i)\]
\[Embeddings: h_1^{in}, ..., h_B^{in}\]
\[Query_i: q_i \leftarrow Linear(h_i^{in})\]
\[Context_i: c_i \leftarrow Linear(h_i^{in})\]
\[Memory_i: m_i \leftarrow (b_i, h_i^{in}) \]
\[Result_i: result_i \leftarrow SoftAttn(q_i, \{c_j\}_{j=1}^B, \{m_k\}_{k=1}^B)\]
\[Output_i: output_i \leftarrow Linear(concat(h_i^{in}, result_i, b_i, s))\]

== Context network ==
This network takes the current state and the embedding produced by the demonstration network as inputs and outputs a fixed-length "context embedding" which captures only the information relevant for the manipulation network at this particular step.
=== Attention over demonstration ===
The current state is used to compute a query vector which is then used for attending over all the steps of the embedding. Since at each time step there are multiple blocks, the weights for each are summed together to produce a scalar for each time step. Neighbourhood attention is then applied several times, using an LSTM with untied weights, since the information at each time steps needs to be propagated to each block's embedding.

Performing attention over the demonstration yields a vector whose size is independent of the demonstration size; however, it is still dependent on the number of blocks in the environment, so it is natural to now attend over the state in order to get a fixed-length vector.
=== Attention over current state ===
The authors propose that in general, within each subtask, only a limited number of blocks are relevant for performing the subtask. If the subtask is to stack A on B, then intuitively, one would suppose that only block A and B are relevant, and perhaps any blocks that may be blocking access to either A or B. This is not enforced during training, but once soft attention is applied to the current state to produce a fixed-length context embedding, the authors believe that the model does indeed learn in this way.

== Manipulation network ==
Given the context embedding as input, this simple feedforward network decides on the particular action needed, to complete the subtask of stacking one particular 'source' block on top of another 'target' block. The manipulation network uses an MLP network. Since the network in the paper can only takes into account the source and target block it may take subobtimal paths. For example changing [ABC, D] to [C, ABD] can be done in one motion if it was possible to manipulate two blocks at once. The manipulation network is the simplest part of the network and leaves room to expand upon in future work.

= Experiments =
The proposed model was tested on the block stacking tasks. the experiments were designed at answering the following questions:
* How does training with behavioral cloning compare with DAGGER?
* How does conditioning on the entire demonstration compare to conditioning on the final state?
* How does conditioning on the entire demonstration compare to conditioning on a “snapshot” of the trajectory?
* Can the authors' framework generalize to tasks that it has never seen during training?
For the experiments, 140 training tasks and 43 testing tasks were collected, each with between 2 to 10 blocks and a different, desired final layout. Over 1000 demonstrations for each task were collected using a hard-coded policy rather than a human user. The authors compare 4 different architectures in these experiments:
* Behavioural cloning used to train the proposed model
* DAGGER used to train the proposed model
* The proposed model, trained with DAGGER, but conditioned on the desired final state rather than an entire demonstration
* The proposed model, trained with DAGGER, but conditioned on a 'snapshot' of the environment at the end of each subtask (ie. every time a block is stacked on another block)

== Performance Evaluation ==
[[File:oneshot3.jpg|1000px]]

The most confident action at each timestep is chosen in 100 different task configurations, and results are averaged over tasks that had the same number of blocks. The results suggest that the performance of each of the architectures is comparable to that of the hard-coded policy which they aim to imitate. Performance degrades similarly across all architectures and the hard-coded policy as the number of blocks increases. On the harder tasks, conditioning on the entire demonstration led to better performance than conditioning on snapshots or on the final state. The authors believe that this may be due to the lack of information when conditioning only on the final state as well as due to regularization caused by temporal dropout which leads to data augmentation when conditioning on the full demonstration but is omitted when conditioning only on the snapshots or final state. Both DAGGER and behavioral cloning performed comparably well. As mentioned above, noise injection was used in training to improve performance; in practice, additional noise can still be injected but some may already come from other sources.

== Visualization ==
The authors visualize the attention mechanisms underlying the main policy architecture to have a better understanding about how it operates. There are two kinds of attention that the authors are mainly interested in, one where the policy attends to different time steps in the demonstration, and the other where the policy attends to different blocks in the current state. The figures below show some of the policy attention heatmaps over time.

[[File:paper6_Visualization.png|800px]]

= Conclusions =
The proposed model successfully learns to complete new instances of a new task from just a single demonstration. The model was demonstrated to work on a series of block stacking tasks. The authors propose several extensions including enabling few-shot learning when one demonstration is insufficient, using image data as the demonstrations, and attempting many other tasks aside from block stacking.

= Criticisms =
While the paper shows an incredibly impressive result: the ability to learn a new task from just a single demonstration, there are a few points that need clearing up.
Firstly, the authors use a hard-coded policy in their experiments rather than a human. It is clear that the performance of this policy begins to degrade quickly as the complexity of the task increases. It would be useful to know what this hard-coded policy actually was, and if the proposed model could still have comparable performance if a more successful demonstration, perhaps one by a human user, were performed. Give the current popularity of adversarial examples, it would also be interesting to see the performance when conditioned on an "adversarial" demonstration, that achieves the correct final state, but intentionally performs complex or obfuscated steps to get there.
Second, it would be useful to see the model's performance on a more complex family of tasks than block stacking, since although each block stacking task is slightly different, the differences may turn out be insignificant compared to other tasks that this model should work on if it is to be a general imitation learning architecture; intuitively, the space of all possible moves and configurations is not large for the task. Also it is a bit misleading as there seems to be a need for more demonstrations to first get a reasonable policy that can generalize, leading to generic policy and then use just one demonstration on a new task expecting the policy to generalize. So it seems there is some sort of pre-training involved here. Regardless, this work is a big step forward for imitation learning, permitting a wider range of tasks for which there is little training data and no reward function available, to still be successfully solved.

= Illustrative Example: Particle Reaching =

[[File:f1.png]]

Figure 1: [Left] Agent, [Middle] Orange square is target, [Right] Green triangle is target [2].

Another simple yet insightful example of the One-Shot Imitation Learning is the particle reaching problem which provides a relatively simple suite of tasks from which the network needs to solve an arbitrary one. The problem is formulated such that for each task: there is an agent which can move based on a 2D force vector, and n landmarks at varying 2D locations (n varies from task to task) with the goal of moving the agent to the specific landmark reached in the demonstration. This is illustrated in Figure 1.

[[File:f2.png|450px]]

Figure 2: Experimental results [2].

Some insight comes from the use of different network architectures to solve this problem. The three architectures to compare (described below) are plain LSTM, LSTM with attention, and final state with attention. The key insight is that the architectures go from generic to specific, with the best generalization performance achieved with the most specific architecture, final state with attention, as seen in Figure 2. It is important to note that this conclusion does not carry forward to more complicated tasks such as the block stacking task.
*Plain LSTM: 512 hidden units, with the input being the demonstration trajectory (the position of the agent changes over time and approaches one of the targets). Output of the LSTM with the current state (from the task needed to be solved) is the input for a multi-layer perceptron (MLP) for finding the solution.
*LSTM with attention: Output of LSTM is now a set of weights for the different targets during training. These weights and the test state are used in the test task. The, now, 2D output is the input for an MLP as before.
*Final state with attention: Looks only at the final state of the demonstration since it can sufficiently provide the needed detail of which target to reach (trajectory is not required). Similar to previous architecture, produces weights used by MLP.

= Source =
# Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).
# Duan, Yan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. "One-shot imitation learning." In Advances in neural information processing systems, pp. 1087-1098. 2017.
# Y. Duan, M. Andrychowicz, B. Stadie, J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba. One-shot imitation learning. arXiv preprint arXiv:1703.07326, 2017. (Newer revision)
# Finn, Chelsea, Pieter Abbeel, and Sergey Levine. "Model-agnostic meta-learning for fast adaptation of deep networks." arXiv preprint arXiv:1703.03400 (2017).
# Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.

stat946w18/IMPROVING GANS USING OPTIMAL TRANSPORT

2018-04-21T04:23:43Z

F7xia: /* Discussion */

== Introduction ==
Recently, the problem of how to learn models that generate media such as images, video, audio and text has become very popular and is called Generative Modeling. One of the main benefits of such an approach is that generative models can be trained on unlabeled data that is readily available . Therefore, generative networks have a huge potential in the field of deep learning.

Generative Adversarial Networks (GANs) are powerful generative models used for unsupervised learning techniques where the 2 agents compete to generate a zero-sum model. A GAN model consists of a generator and a discriminator or critic. The generator is a neural network which is trained to generate data having a distribution matched with the distribution of the real data. The critic is also a neural network, which is trained to separate the generated data from the real data. A loss function that measures the distribution distance between the generated data and the real one is important to train the generator.

Optimal transport theory, which is another approach to measuring distances between distributions, evaluates the distribution distance between the generated data and the training data based on a metric, which provides another method for generator training. The main advantage of optimal transport theory over the distance measurement in GAN is its closed form solution for having a tractable training process. But the theory might also result in inconsistency in statistical estimation due to the given biased gradients if the mini-batches method is applied (Bellemare et al.,
2017).

This paper presents a variant GANs named OT-GAN, which incorporates a discriminative metric called 'Mini-batch Energy Distance' into its critic in order to overcome the issue of biased gradients.

== GANs and Optimal Transport ==

===Generative Adversarial Nets===
Original GAN was firstly reviewed. The objective function of the GAN:

[[File:equation1.png|700px]]

The goal of GANs is to train the generator g and the discriminator d finding a pair of (g,d) to achieve Nash equilibrium(such that either of them cannot reduce their cost without changing the others' parameters). However, it could cause failure of converging since the generator and the discriminator are trained based on gradient descent techniques.

===Wasserstein Distance (Earth-Mover Distance)===

In order to solve the problem of convergence failure, Arjovsky et. al. (2017) suggested Wasserstein distance (Earth-Mover distance) based on the optimal transport theory.

[[File:equation2.png|600px]]

where <math> \prod (p,g) </math> is the set of all joint distributions <math> \gamma (x,y) </math> with marginals <math> p(x) </math> (real data), <math> g(y) </math> (generated data). <math> c(x,y) </math> is a cost function and the Euclidean distance was used by Arjovsky et. al. in the paper.

The Wasserstein distance can be considered as moving the minimum amount of points between distribution <math> g(y) </math> and <math> p(x) </math> such that the generator distribution <math> g(y) </math> is similar to the real data distribution <math> p(x) </math>.

Computing the Wasserstein distance is intractable. The proposed Wasserstein GAN (W-GAN) provides an estimated solution by switching the optimal transport problem into Kantorovich-Rubinstein dual formulation using a set of 1-Lipschitz functions. A neural network can then be used to obtain an estimation.

[[File:equation3.png|600px]]

W-GAN helps to solve the unstable training process of original GAN and it can solve the optimal transport problem approximately, but it is still intractable.

===Sinkhorn Distance===
Genevay et al. (2017) proposed to use the primal formulation of optimal transport instead of the dual formulation to generative modeling. They introduced Sinkhorn distance which is a smoothed generalization of the Wasserstein distance.
[[File: equation4.png|600px]]

It introduced entropy restriction (<math> \beta </math>) to the joint distribution <math> \prod_{\beta} (p,g) </math>. This distance could be generalized to approximate the mini-batches of data <math> X ,Y</math> with <math> K </math> vectors of <math> x, y</math>. The <math> i, j </math> th entry of the cost matrix <math> C </math> can be interpreted as the cost it needs to transport the <math> x_i </math> in mini-batch X to the <math> y_i </math> in mini-batch <math>Y </math>. The resulting distance will be:

[[File: equation5.png|550px]]

where <math> M </math> is a <math> K \times K </math> matrix, each row of <math> M </math> is a joint distribution of <math> \gamma (x,y) </math> with positive entries. The summmation of rows or columns of <math> M </math> is always equal to 1.

This mini-batch Sinkhorn distance is not only fully tractable but also capable of solving the instability problem of GANs. However, it is not a valid metric over probability distribution when taking the expectation of <math> \mathcal{W}_{c} </math> and the gradients are biased when the mini-batch size is fixed.

===Energy Distance (Cramer Distance)===
In order to solve the above problem, Bellemare et al. proposed Energy distance:

[[File: equation6.png|700px]]

where <math> x, x' </math> and <math> y, y'</math> are independent samples from data distribution <math> p </math> and generator distribution <math> g </math>, respectively. Based on the Energy distance, Cramer GAN is to minimize the ED distance metric when training the generator.

==Mini-Batch Energy Distance==
Salimans et al. (2016) mentioned that comparing to use distributions over individual images, mini-batch GAN is more powerful when using the distributions over mini-batches <math> g(X), p(X) </math>. The distance measure is generated for mini-batches.

===Generalized Energy Distance===
The generalized energy distance allowed to use non-Euclidean distance functions d. It is also valid for mini-batches and is considered better than working with individual data batch.

[[File: equation7.png|670px]]

Similarly as defined in the Energy distance, <math> X, X' </math> and <math> Y, Y'</math> can be the independent samples from data distribution <math> p </math> and the generator distribution <math> g </math>, respectively. While in Generalized engergy distance, <math> X, X' </math> and <math> Y, Y'</math> can also be valid for mini-batches. The <math> D_{GED}(p,g) </math> is a metric when having <math> d </math> as a metric. Thus, taking the triangle inequality of <math> d </math> into account, <math> D(p,g) \geq 0,</math> and <math> D(p,g)=0 </math> when <math> p=g </math>.

===Mini-Batch Energy Distance===
As <math> d </math> is free to choose, authors proposed Mini-batch Energy Distance by using entropy-regularized Wasserstein distance as <math> d </math>.

[[File: equation8.png|650px]]

where <math> X, X' </math> and <math> Y, Y'</math> are independent sampled mini-batches from the data distribution <math> p </math> and the generator distribution <math> g </math>, respectively. This distance metric combines the energy distance with primal form of optimal transport over mini-batch distributions <math> g(Y) </math> and <math> p(X) </math>. Inside the generalized energy distance, the Sinkhorn distance is a valid metric between each mini-batches. By adding the <math> - \mathcal{W}_c (Y,Y')</math> and <math> \mathcal{W}_c (X,Y)</math> to equation (5) and using energy distance, the objective becomes statistically consistent (meaning the objective converges to the true parameter value for large sample sizes) and mini-batch gradients are unbiased.

==Optimal Transport GAN (OT-GAN)==

The mini-batch energy distance which was proposed depends on the transport cost function <math>c(x,y)</math>. One possibility would be to choose c to be some fixed function over vectors, like Euclidean distance, but the authors found this to perform poorly in preliminary experiments. For simple fixed cost functions like Euclidean distance, there exists many bad distributions <math>g</math> in higher dimensions for which the mini-batch energy distance is zero such that it is difficult to tell <math>p</math> and <math>g</math> apart if the sample size is not big enough. To solve this the authors propose learning the cost function adversarially, so that it can adapt to the generator distribution <math>g</math> and thereby become more discriminative.

In practice, in order to secure the statistical efficiency (i.e. being able to tell <math>p</math> and <math>g</math> apart without requiring an enormous sample size when their distance is close to zero), authors suggested using cosine distance between vectors <math> v_\eta (x) </math> and <math> v_\eta (y) </math> based on the deep neural network that maps the mini-batch data to a learned latent space. Here is the transportation cost:

[[File: euqation9.png|370px]]

where the <math> v_\eta </math> is chosen to maximize the resulting minibatch energy distance.

Unlike the practice when using the original GANs, the generator was trained more often than the critic, which keep the cost function from degeneration. The resulting generator in OT-GAN has a well defined and statistically consistent objective through the training process.

The algorithm is defined below. The backpropagation is not used in the algorithm since ignoring this gradient flow is justified by the envelope theorem (i.e. when changing the parameters of the objective function, changes in the optimizer do not contribute to a change in the objective function). Stochastic gradient descent is used as the optimization method in algorithm 1 below, although other optimizers are also possible. In fact, Adam was used in experiments.

[[File: al.png|600px]]

[[File: al_figure.png|600px]]

==Experiments==

In order to demonstrate the supermum performance of the OT-GAN, authors compared it with the original GAN and other popular models based on four experiments: Dataset recovery; CIFAR-10 test; ImageNet test; and the conditional image synthesis test.

===Mixture of Gaussian Dataset===
OT-GAN has a statistically consistent objective when it is compared with the original GAN (DC-GAN), such that the generator would not update to a wrong direction even if the signal provided by the cost function to the generator is not good. In order to prove this advantage, authors compared the OT-GAN with the original GAN loss (DAN-S) based on a simple task. The task was set to recover all of the 8 modes from 8 Gaussian mixers in which the means were arranged in a circle. MLP with RLU activation functions were used in this task. The critic was only updated for 15K iterations. The generator distribution was tracked for another 25K iteration. The results showed that the original GAN experiences the model collapse after fixing the discriminator while the OT-GAN recovered all the 8 modes from the mixed Gaussian data.

[[File: 5_1.png|600px]]

===CIFAR-10===

The dataset CIFAR-10 was then used for inspecting the effect of batch-size to the model training process and the image quality. OT-GAN and four other methods were compared using "inception score" as the criteria for comparison. Figure 3 shows the change of inceptions scores (y-axis) by the increased of the iteration number. Scores of four different batch sizes (200, 800, 3200 and 8000) were compared. The results show that a larger batch size, which would more likely cover more modes in the distribution of data, lead to a more stable model showing a larger value in inception score. However, a large batch size would also require a high-performance computational environment. The sample quality across all 5 methods, ran using a batch size of 8000, are compared in Table 1 where the OT_GAN has the best score.

The OT-GAN was trained using Adam optimizer. The learning rate was set to <math> 0.0003, \beta_1 = 0.5, \beta_2 = 0.999 </math> . The introduced OT-GAN algorithm also includes two additional hyperparameters for the Sinkhorn algorithm. The first hyperparameters indicated the number of iterations to run the algorithm and the second <math> 1 / \lambda </math> the entropy penalty of alignments. The authors found that a value of 500 worked well for both mentioned hyperparameters. The network uses the following architecture:

[[File: cf10gc.png|600px]]

[[File: 5_2.png|600px]]

Figure 4 below shows samples generated by the OT-GAN trained with a batch size of 8000. Figure 5 below shows random samples from a model trained with the same architecture and hyperparameters, but with random matching of samples in place of optimal transport.

[[File: ot_gan_cifar_10_samples.png|600px]]

In order to show the advantage of learning the cost function adversarially, the CIFAR-10 experiment was re-run with the cost as follows:

[[File: OTGAN_CosineDist.png|250px]]

When using this fixed cost and keeping the other experiment settings constant, the max inception score dropped from 8.47 with learned to 4.93 with fixed cost functions. The results of the fixed cost are seen in Figure 8 below.

[[File: OTGAN_fixedDist.png|600px]]

===ImageNet Dogs===

In order to investigate the performance of OT-GAN when dealing with the high-quality images, the dog subset of ImageNet (128*128) was used to train the model. Figure 6 shows that OT-GAN produces less nonsensical images and it has a higher inception score compare to the DC-GAN.

[[File: 5_3.png|600px]]

To analyze mode collapse in GANs the authors trained both types of GANs for a large number of epochs. They find the DCGAN shows mode collapse as soon as 900 epochs. They trained the OT-GAN for 13000 epochs and saw no evidence of mode collapse or less diversity in the samples. Samples can be viewed in Figure 9.

[[File: ModelCollapseImageNetDogs.png|600px]]

===Conditional Generation of Birds===

The last experiment was to compare OT-GAN with three popular GAN models for processing the text-to-image generation demonstrating the performance on conditional image synthesis. As can be found from Table 2, OT-GAN received the highest inception score than the scores of the other three models.

[[File: 5_4.png|600px]]

The algorithm used to obtain the results above is conditional generation generalized from '''Algorithm 1''' to include conditional information <math>s</math> such as some text description of an image. The modified algorithm is outlined in '''Algorithm 2'''.

[[File: paper23_alg2.png|600px]]

==Conclusion==

In this paper, an OT-GAN method was proposed based on the optimal transport theory. A distance metric that combines the primal form of the optimal transport and the energy distance was given was presented for realizing the OT-GAN. The results showed OT-GAN to be uniquely stable when trained with large mini batches and state of the art results were achieved on some datasets. One of the advantages of OT-GAN over other GAN models is that OT-GAN can stay on the correct track with an unbiased gradient even if the training on critic is stopped or presents a weak cost signal. The performance of the OT-GAN can be maintained when the batch size is increasing, though the computational cost has to be taken into consideration.

==Critique==

The paper presents a variant of GANs by defining a new distance metric based on the primal form of optimal transport and the mini-batch energy distance. The stability was demonstrated through the four experiments that comparing OP-GAN with other popular methods. However, limitations in computational efficiency were not discussed much. Furthermore, in section 2, the paper lacks explanation on using mini-batches instead of a vector as input when applying Sinkhorn distance. It is also confusing when explaining the algorithm in section 4 about choosing M for minimizing <math> \mathcal{W}_c </math>. Lastly, it is found that it is lack of parallel comparison with existing GAN variants in this paper. Readers may feel jumping from one algorithm to another without necessary explanations. However, one downside of OT-GAN, as mentioned in the paper, is that it requires large amounts of computation and memory.

= Discussion =
We have presented OT-GAN, a new variant of GANs where the generator is trained to minimize
a novel distance metric over probability distributions. This metric, which we call mini-batch energy
distance, combines optimal transport in primal form with an energy distance defined in an
adversarially learned feature space, resulting in a highly discriminative distance function with unbiased
mini-batch gradients. OT-GAN was shown to be uniquely stable when trained with large
mini-batches and to achieve state-of-the-art results on several common benchmarks.
One downside of OT-GAN, as currently proposed, is that it requires large amounts of computation
and memory. We achieve the best results when using very large mini-batches, which increases the
time required for each update of the parameters. All experiments in this paper, except for the mixture
of Gaussians toy example, were performed using 8 GPUs and trained for several days. In future work,
we hope to make the method more computationally efficient, as well as to scale up our approach to
multi-machine training to enable generation of even more challenging and high-resolution image
data sets.
A unique property of OT-GAN is that the mini-batch energy distance remains a valid training objective
even when we stop training the critic. Our implementation of OT-GAN updates the generative
model more often than the critic, where GANs typically do this the other way around (see e.g. Gulrajani
et al., 2017). As a result, we learn a relatively stable transport cost function c(x, y), describing
how (dis)similar two images are, as well as an image embedding function vη(x) capturing the geometry
of the training data. Preliminary experiments suggest these learned functions can be used
successfully for unsupervised learning and other applications, which we plan to investigate further
in future work.

==Reference==
Salimans, Tim, Han Zhang, Alec Radford, and Dimitris Metaxas. "Improving GANs using optimal transport." (2018).

Do Deep Neural Networks Suffer from Crowding

2018-04-21T04:21:44Z

F7xia: /* Drawbacks of CNNs */

= Introduction =
Since the increase in popularity of Deep Neural Networks (DNNs), there has been increased research in making machines capable of recognizing objects the same way humans do. Humans can recognize objects in ways that are invariant to scale, translation, and clutter. Crowding is visual effect suffered by humans, in which an object that can be recognized in isolation can no longer be recognized when other objects, called flankers, are placed close to it. This paper focuses on studying the impact of crowding on DNNs trained for object recognition by adding clutter to the images and then analyzing which models and settings suffer less from such effects.

[[File:paper25_fig_crowding_ex.png|center|600px]]
The figure shows a visual example of crowding [3]. Keep your eyes still and look at the dot in the center and try to identify the "A" in the two circles. You should see that it is much easier to make out the "A" in the right than in the left circle. The same "A" exists in both circles, however, the left circle contains flankers which are those line segments.

Another common example to visualize the same:
[[File:crowding-tigger.jpg|center|600px]]

===Drawbacks of CNNs===
CNNs fall short in explaining human perceptual invariance. Firstly, CNNs typically take input at a single uniform resolution. Biological measurements suggest that resolution is not uniform across the human visual field, but rather decays with eccentricity, i.e. distance from the center of focus. The major cause of this issue is the pooling layer in CNN structure. The pooling is an efficient technique but loses important spatial information. Pooling is also not capable to capture the hierarchical structure in the image, which is also crucial to view point problems. Even more importantly, CNNs rely not only on weight-sharing but also on data augmentation to achieve transformation invariance and so obviously a lot of processing is needed for CNNs.

The paper investigates two types of DNNs for crowding: traditional deep convolutional neural networks (DCNN) and a multi-scale eccentricity-dependent model which is an extension of the DCNNs and inspired by the retina where the receptive field size of the convolutional filters in the model grows with increasing distance from the center of the image, called the eccentricity and is explained below. The authors focus on the dependence of crowding on image factors, such as flanker configuration, target-flanker similarity, target eccentricity and premature pooling in particular. Along with that, there is major emphasis on reducing the training time of the networks since the motive is to have a simple network capable of learning space-invariant features.

= Models =
The authors describe two kinds of DNN architectures: Deep Convolutional Neural Networks, and eccentricity dependent networks, with varying pooling strategies across space and scale. Of particular note is the pooling operation, as many researchers have suggested that this may be the cause of crowding in human perception.

== Deep Convolutional Neural Networks ==
The DCNN is a basic architecture with 3 convolutional layers, spatial 3x3 max-pooling with varying strides and a fully connected layer for classification as shown in the below figure.
[[File:DCNN.png|800px|center]]

The network is fed with images resized to 60x60, with mini-batches of 128 images, 32 feature channels for all convolutional layers, and convolutional filters of size 5x5 and stride 1.

As highlighted earlier, the effect of pooling is into main consideration and hence three different configurations have been investigated as below:

# '''No total pooling''' Feature maps sizes decrease only due to boundary effects, as the 3x3 max pooling has stride 1. The square feature maps sizes after each pool layer are 60-54-48-42.
# '''Progressive pooling''' 3x3 pooling with a stride of 2 halves the square size of the feature maps, until we pool over what remains in the final layer, getting rid of any spatial information before the fully connected layer. (60-27-11-1).
# '''At end pooling''' Same as no total pooling, but before the fully connected layer, max-pool over the entire feature map. (60-54-48-1).

==Eccentricity-dependent Model==
In order to take care of the scale invariance in the input image, the eccentricity dependent DNN is utilized. This was proposed as a model of the human visual cortex by [https://arxiv.org/pdf/1406.1770.pdf, Poggio et al] and later further studied in [2]. The main intuition behind this architecture is that as we increase eccentricity, the receptive fields also increase and hence the model will become invariant to changing input scales. The authors note that the width of each scale is roughly related to the amount of translation invariance for objects at that scale, simply because once the object is outside that window, the filter no longer observes it. Therefore, the authors say that the architecture emphasizes scale invariance over translation invariance, in contrast to traditional DCNNs. From a biological perspective, eye movement can compensate for the limitations of translation invariance, but compensating for scale invariance requires changing distance from the object. In this model, the input image is cropped into varying scales (11 crops increasing by a factor of <math>\sqrt{2}</math> which are then resized to 60x60 pixels) and then fed to the network. Exponentially interpolated crops are used over linearly interpolated crops since they produce fewer boundary effects while maintaining the same behavior qualitatively. The model computes an invariant representation of the input by sampling the inverted pyramid at a discrete set of scales with the same number of filters at each scale. Since the same number of filters are used for each scale, the smaller crops will be sampled at a high resolution while the larger crops will be sampled with a low resolution. These scales are fed into the network as an input channel to the convolutional layers and share the weights across scale and space. Due to the downsampling of the input image, this is equivalent to having receptive fields of varying sizes. Intuitively, this means that the network generalizes learnings across scales and is guaranteed by during back-propagation by averaging the error derivatives over all scale channels, then using the averages to compute weight adjustments. The same set of weight adjustments to the convolutional units across different scale channels is applied.
[[File:EDM.png|2000x450px|center]]

The architecture of this model is the same as the previous DCNN model with the only change being the extra filters added for each of the scales, so the number of parameters remains the same as DCNN models. The authors perform spatial pooling, the aforementioned ''At end pooling'' is used here, and scale pooling which helps in reducing the number of scales by taking the maximum value of corresponding locations in the feature maps across multiple scales. It has three configurations: (1) at the beginning, in which all the different scales are pooled together after the first layer, 11-1-1-1-1 (2) progressively, 11-7-5-3-1 and (3) at the end, 11-11-11-11-1, in which all 11 scales are pooled together at the last layer.

===Contrast Normalization===
Since there are multiple scales of an input image, in some experiments, normalization is performed such that the sum of the pixel intensities in each scale is in the same range [0,1] (this is to prevent smaller crops, which have more non-black pixels, from disproportionately dominating max-pooling across scales). The normalized pixel intensities are then divided by a factor proportional to the crop area [[File:sqrtf.png|60px]] where i=1 is the smallest crop.

=Experiments=
Targets are the set of objects to be recognized and flankers are the set of objects the model has not been trained to recognize, which act as clutter with respect to these target objects. The target objects are the even MNIST numbers having translational variance (shifted at different locations of the image along the horizontal axis), while flankers are from odd MNIST numbers, not MNIST dataset (contains alphabet letters) and Omniglot dataset (contains characters). Examples of the target and flanker configurations are shown below:
[[File:eximages.png|800px|center]]

The target and the object are referred to as ''a'' and ''x'' respectively with the below four configurations:
# No flankers. Only the target object. (a in the plots)
# One central flanker closer to the center of the image than the target. (xa)
# One peripheral flanker closer to the boundary of the image that the target. (ax)
# Two flankers spaced equally around the target, being both the same object, see Figure 1 above for an example (xax).

Training is done using backpropagation with images of size <math>1920 px^2</math> with embedded targets objects and flankers of size of <math>120 px^2</math>. The training and test images are divided as per the usual MNIST configuration. To determine if there is a difference between the peripheral flankers and the central flankers, all the tests are performed in the right half image plane.

==DNNs trained with Target and Flankers==
This is a constant spacing training setup where identical flankers are placed at a distance of 120 pixels either side of the target(xax) with the target having translational variance. The tests are evaluated on (i) DCNN with at the end pooling, and (ii) eccentricity-dependent model with 11-11-11-11-1 scale pooling, at the end spatial pooling and contrast normalization. The results are reported by different flanker types <math>(xax,ax, xa)</math> at test.
[[File:result1.png|x450px|center]]

===Observations===
* With the flanker configuration same as the training one, models are better at recognizing objects in clutter rather than isolated objects for all image locations
* If the target-flanker spacing is changed, then models perform worse
* the eccentricity model is much better at recognizing objects in isolation than the DCNN because the multi-scale crops divide the image into discrete regions, letting the model learn from image parts as well as the whole image
* Only the eccentricity-dependent model is robust to different flanker configurations not included in training when the target is centered.

==DNNs trained with Images with the Target in Isolation==
Here the target objects are in isolation and with translational variance while the test-set is the same set of flanker configurations as used before. The constant spacing and constant eccentricity effect have been evaluated.

[[File:result2.png|750x400px|center]]

In addition to the evaluation of DCNNs in constant target eccentricity at 240 pixels, here they are tested with images in which the target is fixed at 720 pixels from the center of the image, as shown in Fig 3. Since the target is already at the edge of the visual field, a flanker cannot be more peripheral in the image than the target. Same results as for the 240 pixels target eccentricity can be extracted. The closer the flanker is to the target, the more accuracy decreases. Also, it can be seen that when the target is close to the image boundary, recognition is poor because of boundary effects eroding away information about the target.

[[File:paper25_supplemental1.png|800px|center]]

The authors also test the effect of flankers from different datasets on a DCNN model with at end pooling, with results shown in Fig. 7 below. Omniglot flankers crowd less than MNIST digits, and the authors note that this is because they are visually similar to MNIST digits, but are not actually digits, and thus activate the model's convolutional filters less than MNIST digits. The notMNIST digits however, result it more crowding. This is due to the fact that the different font style results in more high intensity pixels and edges. The intensity distributions for the 3 datasets is shown in the histograms in Fig. 12 below. The correlation between crowding and relative frequency of high intensity pixels can be seen from this figure.

[[File:crowding_at_end_pooling.png|750px|center]]

[[File:DCNN dataset histogram.png|750px|center]]

===DCNN Observations===
* Accuracy decreases with the increase in the number of flankers.
* Unsurprisingly, CNNs are capable of being invariant to translations.
* In the constant target eccentricity setup, where the target is fixed at the center of the image with varying target-flanker spacing, we observe that as the distance between target and flankers increase, recognition gets better.
* Spatial pooling helps the network in learning invariance.
* Flankers similar to the target object helps in recognition since they activate the convolutional filter more.
* notMNIST data affects leads to more crowding since they have many more edges and white image pixels which activate the convolutional layers more.

===Eccentric Model===
The set-up is the same as explained earlier. The spacial pooling keeps constant. The effect of pooling across scales are investigated. The three configurations for scale pooling are (i) at the beginning, (ii)progressively and (iii) at the end.
[[File:result3.png|750x400px|center]]

====Observations====
* The recognition accuracy is dependent on the eccentricity of the target object.
* If the target is placed at the center and no contrast normalization is done, then the recognition accuracy is high since this model concentrates the most on the central region of the image.
* If contrast normalization is done, then all the scales will contribute equal amount and hence the eccentricity dependence is removed.
* Early pooling is harmful since it might take away the useful information very early which might be useful to the network.

Without contrast normalization, the middle portion of the image can be focused more with high resolution so the target at the center with no normalization performs well in that case. But if normalization is done, then all the segments of the image contribute to the classification and hence the overall accuracy is not that great but the system becomes robust to the changes in eccentricity.

==Complex Clutter==
Here, the targets are randomly embedded into images of the Places dataset and shifted along horizontally in order to investigate model robustness when the target is not at the image center. Tests are performed on DCNN and the eccentricity model with and without contrast normalization using at end pooling. The results are shown in Figure 9 below.

[[File:result4.png|750x400px|center]]

====Observations====
* Only eccentricity model without contrast normalization can recognize the target and only when the target is close to the image center.
* The eccentricity model does not need to be trained on different types of clutter to become robust to those types of clutter, but it needs to fixate on the relevant part of the image to recognize the target. If it can fixate on the relevant part of the image, it can still discriminate it, even at different scales. This implies that the eccentricity model is robust to clutter.

=Conclusions=
This paper investigates the effect of crowding on a DNN. Using a simple technique of adding clutter in the model didn't improve the performance. We often think that just training the network with data similar to the test data would achieve good results in a general scenario too but that's not the case as we trained the model with flankers and it did not give us the ideal results for the target objects. The following 4 techniques influenced crowding in DNN:
*'''Flanker Configuration''': When models are trained with images of objects in isolation, adding flankers harms recognition. Adding two flankers is the same or worse than adding just one and the smaller the spacing between flanker and target, the more crowding occurs. This is because the pooling operation merges nearby responses, such as the target and flankers if they are close.
*'''The Similarity between target and flanker''': Flankers more similar to targets cause more crowding, because of the selectivity property of the learned DNN filters.
*'''Dependence on target location and contrast normalization''': In DCNNs and eccentricity-dependent models with contrast normalization, recognition accuracy is the same across all eccentricities. In eccentricity-dependent networks without contrast normalization, recognition does not decrease despite the presence of clutter when the target is at the center of the image.
*'''Effect of pooling''': adding pooling leads to better recognition accuracy of the models. Yet, in the eccentricity model, pooling across the scales too early in the hierarchy leads to lower accuracy.
* The Eccentricity Dependent Models can be used for modeling the feedforward path of the primate visual cortex.
* If target locations are proposed, then the system can become even more robust and hence a simple network can become robust to clutter while also reducing the amount of training data and time needed

=Critique=
This paper only tries to check the impact of flankers on targets as to how crowding can affect recognition but it does not propose anything novel in terms of architecture to take care of such type of crowding. The paper only shows that the eccentricity based model does better (than plain DCNN model) when the target is placed at the center of the image but maybe windowing over the frames the same way that a convolutional model passes a filter over an image, instead of taking crops starting from the middle, might help.

This paper focuses on image classification. For a stronger argument, their model could be applied to the task of object detection. Perhaps crowding does not have as large of an impact when the objects of interest are localized by a region proposal network. Further, the artificial crowding introduced in the paper may not be random enough for the neural network to learn to classify the object of interest as opposed to the entire cluster of objects. For example, in the case of an even MNIST digit being flanked by two odd MNIST digits, there are only 25 possible combinations of flankers and targets.

This paper does not provide a convincing argument that the problem of crowding as experienced by humans somehow shares a similar mechanism to the problem of DNN accuracy falling when there is more clutter in the scene. The multi-scale architecture does not appear similar to the distribution of rods and cones in the retina[https://www.ncbi.nlm.nih.gov/books/NBK10848/figure/A763/?report=objectonly]. It might be that the eccentric model does well when the target is centered because it is being sampled by more scales, not because it is similar to a primate visual cortex, and primates are able to recognize an object in clutter when looking directly at it.

=References=
# Volokitin A, Roig G, Poggio T:"Do Deep Neural Networks Suffer from Crowding?" Conference on Neural Information Processing Systems (NIPS). 2017
# Francis X. Chen, Gemma Roig, Leyla Isik, Xavier Boix and Tomaso Poggio: "Eccentricity Dependent Deep Neural Networks for Modeling Human Vision" Journal of Vision. 17. 808. 10.1167/17.10.808.
# J Harrison, W & W Remington, R & Mattingley, Jason. (2014). Visual crowding is anisotropic along the horizontal meridian during smooth pursuit. Journal of vision. 14. 10.1167/14.1.21. http://willjharrison.com/2014/01/new-paper-visual-crowding-is-anisotropic-along-the-horizontal-meridian-during-smooth-pursuit/

Do Deep Neural Networks Suffer from Crowding

2018-04-21T04:20:50Z

F7xia: /* Conclusions */

= Introduction =
Since the increase in popularity of Deep Neural Networks (DNNs), there has been increased research in making machines capable of recognizing objects the same way humans do. Humans can recognize objects in ways that are invariant to scale, translation, and clutter. Crowding is visual effect suffered by humans, in which an object that can be recognized in isolation can no longer be recognized when other objects, called flankers, are placed close to it. This paper focuses on studying the impact of crowding on DNNs trained for object recognition by adding clutter to the images and then analyzing which models and settings suffer less from such effects.

[[File:paper25_fig_crowding_ex.png|center|600px]]
The figure shows a visual example of crowding [3]. Keep your eyes still and look at the dot in the center and try to identify the "A" in the two circles. You should see that it is much easier to make out the "A" in the right than in the left circle. The same "A" exists in both circles, however, the left circle contains flankers which are those line segments.

Another common example to visualize the same:
[[File:crowding-tigger.jpg|center|600px]]

===Drawbacks of CNNs===
CNNs fall short in explaining human perceptual invariance. Firstly, CNNs typically take input at a single uniform resolution. Biological measurements suggest that resolution is not uniform across the human visual field, but rather decays with eccentricity, i.e. distance from the center of focus. The major cause of this issue is the pooling layer in CNN structure. The pooling is an efficient technique but loses important spatial information. Pooling is also not capable to capture the hierarchical structure in image, which is also crucial to view point problems. Even more importantly, CNNs rely not only on weight-sharing but also on data augmentation to achieve transformation invariance and so obviously a lot of processing is needed for CNNs.

The paper investigates two types of DNNs for crowding: traditional deep convolutional neural networks (DCNN) and a multi-scale eccentricity-dependent model which is an extension of the DCNNs and inspired by the retina where the receptive field size of the convolutional filters in the model grows with increasing distance from the center of the image, called the eccentricity and is explained below. The authors focus on the dependence of crowding on image factors, such as flanker configuration, target-flanker similarity, target eccentricity and premature pooling in particular. Along with that, there is major emphasis on reducing the training time of the networks since the motive is to have a simple network capable of learning space-invariant features.

= Models =
The authors describe two kinds of DNN architectures: Deep Convolutional Neural Networks, and eccentricity dependent networks, with varying pooling strategies across space and scale. Of particular note is the pooling operation, as many researchers have suggested that this may be the cause of crowding in human perception.

== Deep Convolutional Neural Networks ==
The DCNN is a basic architecture with 3 convolutional layers, spatial 3x3 max-pooling with varying strides and a fully connected layer for classification as shown in the below figure.
[[File:DCNN.png|800px|center]]

The network is fed with images resized to 60x60, with mini-batches of 128 images, 32 feature channels for all convolutional layers, and convolutional filters of size 5x5 and stride 1.

As highlighted earlier, the effect of pooling is into main consideration and hence three different configurations have been investigated as below:

# '''No total pooling''' Feature maps sizes decrease only due to boundary effects, as the 3x3 max pooling has stride 1. The square feature maps sizes after each pool layer are 60-54-48-42.
# '''Progressive pooling''' 3x3 pooling with a stride of 2 halves the square size of the feature maps, until we pool over what remains in the final layer, getting rid of any spatial information before the fully connected layer. (60-27-11-1).
# '''At end pooling''' Same as no total pooling, but before the fully connected layer, max-pool over the entire feature map. (60-54-48-1).

==Eccentricity-dependent Model==
In order to take care of the scale invariance in the input image, the eccentricity dependent DNN is utilized. This was proposed as a model of the human visual cortex by [https://arxiv.org/pdf/1406.1770.pdf, Poggio et al] and later further studied in [2]. The main intuition behind this architecture is that as we increase eccentricity, the receptive fields also increase and hence the model will become invariant to changing input scales. The authors note that the width of each scale is roughly related to the amount of translation invariance for objects at that scale, simply because once the object is outside that window, the filter no longer observes it. Therefore, the authors say that the architecture emphasizes scale invariance over translation invariance, in contrast to traditional DCNNs. From a biological perspective, eye movement can compensate for the limitations of translation invariance, but compensating for scale invariance requires changing distance from the object. In this model, the input image is cropped into varying scales (11 crops increasing by a factor of <math>\sqrt{2}</math> which are then resized to 60x60 pixels) and then fed to the network. Exponentially interpolated crops are used over linearly interpolated crops since they produce fewer boundary effects while maintaining the same behavior qualitatively. The model computes an invariant representation of the input by sampling the inverted pyramid at a discrete set of scales with the same number of filters at each scale. Since the same number of filters are used for each scale, the smaller crops will be sampled at a high resolution while the larger crops will be sampled with a low resolution. These scales are fed into the network as an input channel to the convolutional layers and share the weights across scale and space. Due to the downsampling of the input image, this is equivalent to having receptive fields of varying sizes. Intuitively, this means that the network generalizes learnings across scales and is guaranteed by during back-propagation by averaging the error derivatives over all scale channels, then using the averages to compute weight adjustments. The same set of weight adjustments to the convolutional units across different scale channels is applied.
[[File:EDM.png|2000x450px|center]]

The architecture of this model is the same as the previous DCNN model with the only change being the extra filters added for each of the scales, so the number of parameters remains the same as DCNN models. The authors perform spatial pooling, the aforementioned ''At end pooling'' is used here, and scale pooling which helps in reducing the number of scales by taking the maximum value of corresponding locations in the feature maps across multiple scales. It has three configurations: (1) at the beginning, in which all the different scales are pooled together after the first layer, 11-1-1-1-1 (2) progressively, 11-7-5-3-1 and (3) at the end, 11-11-11-11-1, in which all 11 scales are pooled together at the last layer.

===Contrast Normalization===
Since there are multiple scales of an input image, in some experiments, normalization is performed such that the sum of the pixel intensities in each scale is in the same range [0,1] (this is to prevent smaller crops, which have more non-black pixels, from disproportionately dominating max-pooling across scales). The normalized pixel intensities are then divided by a factor proportional to the crop area [[File:sqrtf.png|60px]] where i=1 is the smallest crop.

=Experiments=
Targets are the set of objects to be recognized and flankers are the set of objects the model has not been trained to recognize, which act as clutter with respect to these target objects. The target objects are the even MNIST numbers having translational variance (shifted at different locations of the image along the horizontal axis), while flankers are from odd MNIST numbers, not MNIST dataset (contains alphabet letters) and Omniglot dataset (contains characters). Examples of the target and flanker configurations are shown below:
[[File:eximages.png|800px|center]]

The target and the object are referred to as ''a'' and ''x'' respectively with the below four configurations:
# No flankers. Only the target object. (a in the plots)
# One central flanker closer to the center of the image than the target. (xa)
# One peripheral flanker closer to the boundary of the image that the target. (ax)
# Two flankers spaced equally around the target, being both the same object, see Figure 1 above for an example (xax).

Training is done using backpropagation with images of size <math>1920 px^2</math> with embedded targets objects and flankers of size of <math>120 px^2</math>. The training and test images are divided as per the usual MNIST configuration. To determine if there is a difference between the peripheral flankers and the central flankers, all the tests are performed in the right half image plane.

==DNNs trained with Target and Flankers==
This is a constant spacing training setup where identical flankers are placed at a distance of 120 pixels either side of the target(xax) with the target having translational variance. The tests are evaluated on (i) DCNN with at the end pooling, and (ii) eccentricity-dependent model with 11-11-11-11-1 scale pooling, at the end spatial pooling and contrast normalization. The results are reported by different flanker types <math>(xax,ax, xa)</math> at test.
[[File:result1.png|x450px|center]]

===Observations===
* With the flanker configuration same as the training one, models are better at recognizing objects in clutter rather than isolated objects for all image locations
* If the target-flanker spacing is changed, then models perform worse
* the eccentricity model is much better at recognizing objects in isolation than the DCNN because the multi-scale crops divide the image into discrete regions, letting the model learn from image parts as well as the whole image
* Only the eccentricity-dependent model is robust to different flanker configurations not included in training when the target is centered.

==DNNs trained with Images with the Target in Isolation==
Here the target objects are in isolation and with translational variance while the test-set is the same set of flanker configurations as used before. The constant spacing and constant eccentricity effect have been evaluated.

[[File:result2.png|750x400px|center]]

In addition to the evaluation of DCNNs in constant target eccentricity at 240 pixels, here they are tested with images in which the target is fixed at 720 pixels from the center of the image, as shown in Fig 3. Since the target is already at the edge of the visual field, a flanker cannot be more peripheral in the image than the target. Same results as for the 240 pixels target eccentricity can be extracted. The closer the flanker is to the target, the more accuracy decreases. Also, it can be seen that when the target is close to the image boundary, recognition is poor because of boundary effects eroding away information about the target.

[[File:paper25_supplemental1.png|800px|center]]

The authors also test the effect of flankers from different datasets on a DCNN model with at end pooling, with results shown in Fig. 7 below. Omniglot flankers crowd less than MNIST digits, and the authors note that this is because they are visually similar to MNIST digits, but are not actually digits, and thus activate the model's convolutional filters less than MNIST digits. The notMNIST digits however, result it more crowding. This is due to the fact that the different font style results in more high intensity pixels and edges. The intensity distributions for the 3 datasets is shown in the histograms in Fig. 12 below. The correlation between crowding and relative frequency of high intensity pixels can be seen from this figure.

[[File:crowding_at_end_pooling.png|750px|center]]

[[File:DCNN dataset histogram.png|750px|center]]

===DCNN Observations===
* Accuracy decreases with the increase in the number of flankers.
* Unsurprisingly, CNNs are capable of being invariant to translations.
* In the constant target eccentricity setup, where the target is fixed at the center of the image with varying target-flanker spacing, we observe that as the distance between target and flankers increase, recognition gets better.
* Spatial pooling helps the network in learning invariance.
* Flankers similar to the target object helps in recognition since they activate the convolutional filter more.
* notMNIST data affects leads to more crowding since they have many more edges and white image pixels which activate the convolutional layers more.

===Eccentric Model===
The set-up is the same as explained earlier. The spacial pooling keeps constant. The effect of pooling across scales are investigated. The three configurations for scale pooling are (i) at the beginning, (ii)progressively and (iii) at the end.
[[File:result3.png|750x400px|center]]

====Observations====
* The recognition accuracy is dependent on the eccentricity of the target object.
* If the target is placed at the center and no contrast normalization is done, then the recognition accuracy is high since this model concentrates the most on the central region of the image.
* If contrast normalization is done, then all the scales will contribute equal amount and hence the eccentricity dependence is removed.
* Early pooling is harmful since it might take away the useful information very early which might be useful to the network.

Without contrast normalization, the middle portion of the image can be focused more with high resolution so the target at the center with no normalization performs well in that case. But if normalization is done, then all the segments of the image contribute to the classification and hence the overall accuracy is not that great but the system becomes robust to the changes in eccentricity.

==Complex Clutter==
Here, the targets are randomly embedded into images of the Places dataset and shifted along horizontally in order to investigate model robustness when the target is not at the image center. Tests are performed on DCNN and the eccentricity model with and without contrast normalization using at end pooling. The results are shown in Figure 9 below.

[[File:result4.png|750x400px|center]]

====Observations====
* Only eccentricity model without contrast normalization can recognize the target and only when the target is close to the image center.
* The eccentricity model does not need to be trained on different types of clutter to become robust to those types of clutter, but it needs to fixate on the relevant part of the image to recognize the target. If it can fixate on the relevant part of the image, it can still discriminate it, even at different scales. This implies that the eccentricity model is robust to clutter.

=Conclusions=
This paper investigates the effect of crowding on a DNN. Using a simple technique of adding clutter in the model didn't improve the performance. We often think that just training the network with data similar to the test data would achieve good results in a general scenario too but that's not the case as we trained the model with flankers and it did not give us the ideal results for the target objects. The following 4 techniques influenced crowding in DNN:
*'''Flanker Configuration''': When models are trained with images of objects in isolation, adding flankers harms recognition. Adding two flankers is the same or worse than adding just one and the smaller the spacing between flanker and target, the more crowding occurs. This is because the pooling operation merges nearby responses, such as the target and flankers if they are close.
*'''The Similarity between target and flanker''': Flankers more similar to targets cause more crowding, because of the selectivity property of the learned DNN filters.
*'''Dependence on target location and contrast normalization''': In DCNNs and eccentricity-dependent models with contrast normalization, recognition accuracy is the same across all eccentricities. In eccentricity-dependent networks without contrast normalization, recognition does not decrease despite the presence of clutter when the target is at the center of the image.
*'''Effect of pooling''': adding pooling leads to better recognition accuracy of the models. Yet, in the eccentricity model, pooling across the scales too early in the hierarchy leads to lower accuracy.
* The Eccentricity Dependent Models can be used for modeling the feedforward path of the primate visual cortex.
* If target locations are proposed, then the system can become even more robust and hence a simple network can become robust to clutter while also reducing the amount of training data and time needed

=Critique=
This paper only tries to check the impact of flankers on targets as to how crowding can affect recognition but it does not propose anything novel in terms of architecture to take care of such type of crowding. The paper only shows that the eccentricity based model does better (than plain DCNN model) when the target is placed at the center of the image but maybe windowing over the frames the same way that a convolutional model passes a filter over an image, instead of taking crops starting from the middle, might help.

This paper focuses on image classification. For a stronger argument, their model could be applied to the task of object detection. Perhaps crowding does not have as large of an impact when the objects of interest are localized by a region proposal network. Further, the artificial crowding introduced in the paper may not be random enough for the neural network to learn to classify the object of interest as opposed to the entire cluster of objects. For example, in the case of an even MNIST digit being flanked by two odd MNIST digits, there are only 25 possible combinations of flankers and targets.

This paper does not provide a convincing argument that the problem of crowding as experienced by humans somehow shares a similar mechanism to the problem of DNN accuracy falling when there is more clutter in the scene. The multi-scale architecture does not appear similar to the distribution of rods and cones in the retina[https://www.ncbi.nlm.nih.gov/books/NBK10848/figure/A763/?report=objectonly]. It might be that the eccentric model does well when the target is centered because it is being sampled by more scales, not because it is similar to a primate visual cortex, and primates are able to recognize an object in clutter when looking directly at it.

=References=
# Volokitin A, Roig G, Poggio T:"Do Deep Neural Networks Suffer from Crowding?" Conference on Neural Information Processing Systems (NIPS). 2017
# Francis X. Chen, Gemma Roig, Leyla Isik, Xavier Boix and Tomaso Poggio: "Eccentricity Dependent Deep Neural Networks for Modeling Human Vision" Journal of Vision. 17. 808. 10.1167/17.10.808.
# J Harrison, W & W Remington, R & Mattingley, Jason. (2014). Visual crowding is anisotropic along the horizontal meridian during smooth pursuit. Journal of vision. 14. 10.1167/14.1.21. http://willjharrison.com/2014/01/new-paper-visual-crowding-is-anisotropic-along-the-horizontal-meridian-during-smooth-pursuit/

Do Deep Neural Networks Suffer from Crowding

2018-04-21T04:20:19Z

F7xia: /* Critique */

= Introduction =
Since the increase in popularity of Deep Neural Networks (DNNs), there has been increased research in making machines capable of recognizing objects the same way humans do. Humans can recognize objects in ways that are invariant to scale, translation, and clutter. Crowding is visual effect suffered by humans, in which an object that can be recognized in isolation can no longer be recognized when other objects, called flankers, are placed close to it. This paper focuses on studying the impact of crowding on DNNs trained for object recognition by adding clutter to the images and then analyzing which models and settings suffer less from such effects.

[[File:paper25_fig_crowding_ex.png|center|600px]]
The figure shows a visual example of crowding [3]. Keep your eyes still and look at the dot in the center and try to identify the "A" in the two circles. You should see that it is much easier to make out the "A" in the right than in the left circle. The same "A" exists in both circles, however, the left circle contains flankers which are those line segments.

Another common example to visualize the same:
[[File:crowding-tigger.jpg|center|600px]]

===Drawbacks of CNNs===
CNNs fall short in explaining human perceptual invariance. Firstly, CNNs typically take input at a single uniform resolution. Biological measurements suggest that resolution is not uniform across the human visual field, but rather decays with eccentricity, i.e. distance from the center of focus. The major cause of this issue is the pooling layer in CNN structure. The pooling is an efficient technique but loses important spatial information. Pooling is also not capable to capture the hierarchical structure in image, which is also crucial to view point problems. Even more importantly, CNNs rely not only on weight-sharing but also on data augmentation to achieve transformation invariance and so obviously a lot of processing is needed for CNNs.

The paper investigates two types of DNNs for crowding: traditional deep convolutional neural networks (DCNN) and a multi-scale eccentricity-dependent model which is an extension of the DCNNs and inspired by the retina where the receptive field size of the convolutional filters in the model grows with increasing distance from the center of the image, called the eccentricity and is explained below. The authors focus on the dependence of crowding on image factors, such as flanker configuration, target-flanker similarity, target eccentricity and premature pooling in particular. Along with that, there is major emphasis on reducing the training time of the networks since the motive is to have a simple network capable of learning space-invariant features.

= Models =
The authors describe two kinds of DNN architectures: Deep Convolutional Neural Networks, and eccentricity dependent networks, with varying pooling strategies across space and scale. Of particular note is the pooling operation, as many researchers have suggested that this may be the cause of crowding in human perception.

== Deep Convolutional Neural Networks ==
The DCNN is a basic architecture with 3 convolutional layers, spatial 3x3 max-pooling with varying strides and a fully connected layer for classification as shown in the below figure.
[[File:DCNN.png|800px|center]]

The network is fed with images resized to 60x60, with mini-batches of 128 images, 32 feature channels for all convolutional layers, and convolutional filters of size 5x5 and stride 1.

As highlighted earlier, the effect of pooling is into main consideration and hence three different configurations have been investigated as below:

# '''No total pooling''' Feature maps sizes decrease only due to boundary effects, as the 3x3 max pooling has stride 1. The square feature maps sizes after each pool layer are 60-54-48-42.
# '''Progressive pooling''' 3x3 pooling with a stride of 2 halves the square size of the feature maps, until we pool over what remains in the final layer, getting rid of any spatial information before the fully connected layer. (60-27-11-1).
# '''At end pooling''' Same as no total pooling, but before the fully connected layer, max-pool over the entire feature map. (60-54-48-1).

==Eccentricity-dependent Model==
In order to take care of the scale invariance in the input image, the eccentricity dependent DNN is utilized. This was proposed as a model of the human visual cortex by [https://arxiv.org/pdf/1406.1770.pdf, Poggio et al] and later further studied in [2]. The main intuition behind this architecture is that as we increase eccentricity, the receptive fields also increase and hence the model will become invariant to changing input scales. The authors note that the width of each scale is roughly related to the amount of translation invariance for objects at that scale, simply because once the object is outside that window, the filter no longer observes it. Therefore, the authors say that the architecture emphasizes scale invariance over translation invariance, in contrast to traditional DCNNs. From a biological perspective, eye movement can compensate for the limitations of translation invariance, but compensating for scale invariance requires changing distance from the object. In this model, the input image is cropped into varying scales (11 crops increasing by a factor of <math>\sqrt{2}</math> which are then resized to 60x60 pixels) and then fed to the network. Exponentially interpolated crops are used over linearly interpolated crops since they produce fewer boundary effects while maintaining the same behavior qualitatively. The model computes an invariant representation of the input by sampling the inverted pyramid at a discrete set of scales with the same number of filters at each scale. Since the same number of filters are used for each scale, the smaller crops will be sampled at a high resolution while the larger crops will be sampled with a low resolution. These scales are fed into the network as an input channel to the convolutional layers and share the weights across scale and space. Due to the downsampling of the input image, this is equivalent to having receptive fields of varying sizes. Intuitively, this means that the network generalizes learnings across scales and is guaranteed by during back-propagation by averaging the error derivatives over all scale channels, then using the averages to compute weight adjustments. The same set of weight adjustments to the convolutional units across different scale channels is applied.
[[File:EDM.png|2000x450px|center]]

The architecture of this model is the same as the previous DCNN model with the only change being the extra filters added for each of the scales, so the number of parameters remains the same as DCNN models. The authors perform spatial pooling, the aforementioned ''At end pooling'' is used here, and scale pooling which helps in reducing the number of scales by taking the maximum value of corresponding locations in the feature maps across multiple scales. It has three configurations: (1) at the beginning, in which all the different scales are pooled together after the first layer, 11-1-1-1-1 (2) progressively, 11-7-5-3-1 and (3) at the end, 11-11-11-11-1, in which all 11 scales are pooled together at the last layer.

===Contrast Normalization===
Since there are multiple scales of an input image, in some experiments, normalization is performed such that the sum of the pixel intensities in each scale is in the same range [0,1] (this is to prevent smaller crops, which have more non-black pixels, from disproportionately dominating max-pooling across scales). The normalized pixel intensities are then divided by a factor proportional to the crop area [[File:sqrtf.png|60px]] where i=1 is the smallest crop.

=Experiments=
Targets are the set of objects to be recognized and flankers are the set of objects the model has not been trained to recognize, which act as clutter with respect to these target objects. The target objects are the even MNIST numbers having translational variance (shifted at different locations of the image along the horizontal axis), while flankers are from odd MNIST numbers, not MNIST dataset (contains alphabet letters) and Omniglot dataset (contains characters). Examples of the target and flanker configurations are shown below:
[[File:eximages.png|800px|center]]

The target and the object are referred to as ''a'' and ''x'' respectively with the below four configurations:
# No flankers. Only the target object. (a in the plots)
# One central flanker closer to the center of the image than the target. (xa)
# One peripheral flanker closer to the boundary of the image that the target. (ax)
# Two flankers spaced equally around the target, being both the same object, see Figure 1 above for an example (xax).

Training is done using backpropagation with images of size <math>1920 px^2</math> with embedded targets objects and flankers of size of <math>120 px^2</math>. The training and test images are divided as per the usual MNIST configuration. To determine if there is a difference between the peripheral flankers and the central flankers, all the tests are performed in the right half image plane.

==DNNs trained with Target and Flankers==
This is a constant spacing training setup where identical flankers are placed at a distance of 120 pixels either side of the target(xax) with the target having translational variance. The tests are evaluated on (i) DCNN with at the end pooling, and (ii) eccentricity-dependent model with 11-11-11-11-1 scale pooling, at the end spatial pooling and contrast normalization. The results are reported by different flanker types <math>(xax,ax, xa)</math> at test.
[[File:result1.png|x450px|center]]

===Observations===
* With the flanker configuration same as the training one, models are better at recognizing objects in clutter rather than isolated objects for all image locations
* If the target-flanker spacing is changed, then models perform worse
* the eccentricity model is much better at recognizing objects in isolation than the DCNN because the multi-scale crops divide the image into discrete regions, letting the model learn from image parts as well as the whole image
* Only the eccentricity-dependent model is robust to different flanker configurations not included in training when the target is centered.

==DNNs trained with Images with the Target in Isolation==
Here the target objects are in isolation and with translational variance while the test-set is the same set of flanker configurations as used before. The constant spacing and constant eccentricity effect have been evaluated.

[[File:result2.png|750x400px|center]]

In addition to the evaluation of DCNNs in constant target eccentricity at 240 pixels, here they are tested with images in which the target is fixed at 720 pixels from the center of the image, as shown in Fig 3. Since the target is already at the edge of the visual field, a flanker cannot be more peripheral in the image than the target. Same results as for the 240 pixels target eccentricity can be extracted. The closer the flanker is to the target, the more accuracy decreases. Also, it can be seen that when the target is close to the image boundary, recognition is poor because of boundary effects eroding away information about the target.

[[File:paper25_supplemental1.png|800px|center]]

The authors also test the effect of flankers from different datasets on a DCNN model with at end pooling, with results shown in Fig. 7 below. Omniglot flankers crowd less than MNIST digits, and the authors note that this is because they are visually similar to MNIST digits, but are not actually digits, and thus activate the model's convolutional filters less than MNIST digits. The notMNIST digits however, result it more crowding. This is due to the fact that the different font style results in more high intensity pixels and edges. The intensity distributions for the 3 datasets is shown in the histograms in Fig. 12 below. The correlation between crowding and relative frequency of high intensity pixels can be seen from this figure.

[[File:crowding_at_end_pooling.png|750px|center]]

[[File:DCNN dataset histogram.png|750px|center]]

===DCNN Observations===
* Accuracy decreases with the increase in the number of flankers.
* Unsurprisingly, CNNs are capable of being invariant to translations.
* In the constant target eccentricity setup, where the target is fixed at the center of the image with varying target-flanker spacing, we observe that as the distance between target and flankers increase, recognition gets better.
* Spatial pooling helps the network in learning invariance.
* Flankers similar to the target object helps in recognition since they activate the convolutional filter more.
* notMNIST data affects leads to more crowding since they have many more edges and white image pixels which activate the convolutional layers more.

===Eccentric Model===
The set-up is the same as explained earlier. The spacial pooling keeps constant. The effect of pooling across scales are investigated. The three configurations for scale pooling are (i) at the beginning, (ii)progressively and (iii) at the end.
[[File:result3.png|750x400px|center]]

====Observations====
* The recognition accuracy is dependent on the eccentricity of the target object.
* If the target is placed at the center and no contrast normalization is done, then the recognition accuracy is high since this model concentrates the most on the central region of the image.
* If contrast normalization is done, then all the scales will contribute equal amount and hence the eccentricity dependence is removed.
* Early pooling is harmful since it might take away the useful information very early which might be useful to the network.

Without contrast normalization, the middle portion of the image can be focused more with high resolution so the target at the center with no normalization performs well in that case. But if normalization is done, then all the segments of the image contribute to the classification and hence the overall accuracy is not that great but the system becomes robust to the changes in eccentricity.

==Complex Clutter==
Here, the targets are randomly embedded into images of the Places dataset and shifted along horizontally in order to investigate model robustness when the target is not at the image center. Tests are performed on DCNN and the eccentricity model with and without contrast normalization using at end pooling. The results are shown in Figure 9 below.

[[File:result4.png|750x400px|center]]

====Observations====
* Only eccentricity model without contrast normalization can recognize the target and only when the target is close to the image center.
* The eccentricity model does not need to be trained on different types of clutter to become robust to those types of clutter, but it needs to fixate on the relevant part of the image to recognize the target. If it can fixate on the relevant part of the image, it can still discriminate it, even at different scales. This implies that the eccentricity model is robust to clutter.

=Conclusions=
This paper investigates the effect of crowding on a DNN. Using a simple technique of adding clutter in the model didn't improve the performance. We often think that just training the network with data similar to the test data would achieve good results in a general scenario too but that's not the case as we trained the model with flankers and it did not give us the ideal results for the target objects. The following 4 techniques influenced crowding in DNN:
*'''Flanker Configuration''': When models are trained with images of objects in isolation, adding flankers harms recognition. Adding two flankers is the same or worse than adding just one and the smaller the spacing between flanker and target, the more crowding occurs. This is because the pooling operation merges nearby responses, such as the target and flankers if they are close.
*'''Similarity between target and flanker''': Flankers more similar to targets cause more crowding, because of the selectivity property of the learned DNN filters.
*'''Dependence on target location and contrast normalization''': In DCNNs and eccentricity-dependent models with contrast normalization, recognition accuracy is the same across all eccentricities. In eccentricity-dependent networks without contrast normalization, recognition does not decrease despite the presence of clutter when the target is at the center of the image.
*'''Effect of pooling''': adding pooling leads to better recognition accuracy of the models. Yet, in the eccentricity model, pooling across the scales too early in the hierarchy leads to lower accuracy.
* The Eccentricity Dependent Models can be used for modeling the feedforward path of the primate visual cortex.
* If target locations are proposed, then the system can become even more robust and hence a simple network can become robust to clutter while also reducing the amount of training data and time needed

=Critique=
This paper only tries to check the impact of flankers on targets as to how crowding can affect recognition but it does not propose anything novel in terms of architecture to take care of such type of crowding. The paper only shows that the eccentricity based model does better (than plain DCNN model) when the target is placed at the center of the image but maybe windowing over the frames the same way that a convolutional model passes a filter over an image, instead of taking crops starting from the middle, might help.

This paper focuses on image classification. For a stronger argument, their model could be applied to the task of object detection. Perhaps crowding does not have as large of an impact when the objects of interest are localized by a region proposal network. Further, the artificial crowding introduced in the paper may not be random enough for the neural network to learn to classify the object of interest as opposed to the entire cluster of objects. For example, in the case of an even MNIST digit being flanked by two odd MNIST digits, there are only 25 possible combinations of flankers and targets.

This paper does not provide a convincing argument that the problem of crowding as experienced by humans somehow shares a similar mechanism to the problem of DNN accuracy falling when there is more clutter in the scene. The multi-scale architecture does not appear similar to the distribution of rods and cones in the retina[https://www.ncbi.nlm.nih.gov/books/NBK10848/figure/A763/?report=objectonly]. It might be that the eccentric model does well when the target is centered because it is being sampled by more scales, not because it is similar to a primate visual cortex, and primates are able to recognize an object in clutter when looking directly at it.

=References=
# Volokitin A, Roig G, Poggio T:"Do Deep Neural Networks Suffer from Crowding?" Conference on Neural Information Processing Systems (NIPS). 2017
# Francis X. Chen, Gemma Roig, Leyla Isik, Xavier Boix and Tomaso Poggio: "Eccentricity Dependent Deep Neural Networks for Modeling Human Vision" Journal of Vision. 17. 808. 10.1167/17.10.808.
# J Harrison, W & W Remington, R & Mattingley, Jason. (2014). Visual crowding is anisotropic along the horizontal meridian during smooth pursuit. Journal of vision. 14. 10.1167/14.1.21. http://willjharrison.com/2014/01/new-paper-visual-crowding-is-anisotropic-along-the-horizontal-meridian-during-smooth-pursuit/

stat946w18/Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolutional Layers

2018-04-21T04:18:58Z

F7xia: /* Results */

== Introduction ==

With the recent and ongoing surge in low-power, intelligent agents (such as wearables, smartphones, and IoT devices), there exists a growing need for machine learning models to work well in resource-constrained environments. Deep learning models have achieved state-of-the-art on a broad range of tasks; however, they are difficult to deploy in their original forms. For example, AlexNet (Krizhevsky et al., 2012), a model for image classification, contains 61 million parameters and requires 1.5 billion floating point operations (FLOPs) in one inference pass. A more accurate model, ResNet-50 (He et al., 2016), has 25 million parameters but requires 4.08 FLOPs. A high-end desktop GPU such as a Titan Xp is capable of [https://www.nvidia.com/en-us/titan/titan-xp/ (12 TFLOPS (tera-FLOPs per second))], while the Adreno 540 GPU used in a Samsung Galaxy S8 is only capable of [https://gflops.surge.sh (567 GFLOPS)] which is less than 5% of the Titan Xp. Clearly, it would be difficult to deploy and run these models on low-power devices.

In general, model compression can be accomplished using four main non-mutually exclusive methods (Cheng et al., 2017): weight pruning, quantization, matrix transformations, and weight tying. By non-mutually exclusive, we mean that these methods can be used not only separately but also in combination for compressing a single model; the use of one method does not exclude any of the other methods from being viable.

Ye et al. (2018) explores pruning entire channels in a convolutional neural network (CNN). Past work has mostly focused on norm[based or error-based heuristics to prune channels; instead, Ye et al. (2018) show that their approach is easily reproducible and has favorable qualities from an optimization standpoint. In other words, they argue that the norm-based assumption is not as informative or theoretically justified as their approach, and provide strong empirical evidence of these findings.

== Motivation ==

Some previous works on pruning channel filters (Li et al., 2016; Molchanov et al., 2016) have focused on using the L1 norm to determine the importance of a channel. Ye et al. (2018) show that, in the deep linear convolution case, penalizing the per-layer norm is coarse-grained; they argue that one cannot assign different coefficients to L1 penalties associated with different layers without risking the loss function being susceptible to trivial re-parameterizations. As an example, consider the following deep linear convolutional neural network with modified LASSO loss:

$$\min \mathbb{E}_D \lVert W_{2n} * \dots * W_1 x - y\rVert^2 + \lambda \sum_{i=1}^n \lVert W_{2i} \rVert_1$$

where W are the weights and * is convolution. Here we have chosen the coefficient 0 for the L1 penalty associated with odd-numbered layers and the coefficient 1 for the L1 penalty associated with even-numbered layers. This loss is susceptible to trivial re-parameterizations: without affecting the least-squares loss, we can always reduce the LASSO loss by halving the weights of all even-numbered layers and doubling the weights of all odd-numbered layers.

Furthermore, batch normalization (Ioffe, 2015) is incompatible with this method of weight regularization. Consider batch normalization at the <math>l</math>-th layer.

<center><math>x^{l+1} = max\{\gamma \cdot BN_{\mu,\sigma,\epsilon}(W^l * x^l) + \beta, 0\}</math></center>

Due to the batch normalization, any uniform scaling of <math>W^l</math> which would change <math>l_1</math> and <math>l_2</math> norms, but has no have no effect on <math>x^{l+1}</math>. Thus, when trying to minimize weight norms of multiple layers, it is unclear how to properly choose penalties for each layer. Therefore, penalizing the norm of a filter in a deep convolutional network is hard to justify from a theoretical perspective.

In contrast with these existing approaches, the authors focus on enforcing sparsity of a tiny set of parameters in CNN — scale parameter <math>\gamma</math> in all batch normalization. Not only placing sparse constraints on <math>\gamma</math> is simpler and easier to monitor, but more importantly, they put forward two reasons:

1. Every <math>\gamma</math> always multiplies a normalized random variable, thus the channel importance becomes comparable across different layers by measuring the magnitude values of <math>\gamma</math>;

2. The reparameterization effect across different layers is avoided if its subsequent convolution layer is also batch-normalized. In other words, the impacts from the scale changes of <math>\gamma</math> parameter are independent across different layers.

Thus, although not providing a complete theoretical guarantee on loss, Ye et al. (2018) develop a pruning technique that claims to be more justified than norm-based pruning is.

== Method ==

At a high level, Ye et al. (2018) propose that, instead of discovering sparsity via penalizing the per-filter or per-channel norm, penalize the batch normalization scale parameters ''gamma'' instead. The reasoning is that by having fewer parameters to constrain and working with normalized values, sparsity is easier to enforce, monitor, and learn. Having sparse batch normalization terms has the effect of pruning '''entire''' channels: if ''gamma'' is zero, then the output at that layer becomes constant (the bias term), and thus the preceding channels can be pruned.

=== Summary ===

The basic algorithm can be summarized as follows:

1. Penalize the L1-norm of the batch normalization scaling parameters in the loss

2. Train until loss plateaus

3. Remove channels that correspond to a downstream zero in batch normalization

4. Fine-tune the pruned model using regular learning

=== Details ===

There still exist a few problems that this summary has not addressed so far. Sub-gradient descent is known to have inverse square root convergence rate on subdifferentials (Gordon et al., 2012), so the sparsity gradient descent update may be suboptimal. Furthermore, the sparse penalty needs to be normalized with respect to previous channel sizes, since the penalty should be roughly equally distributed across all convolution layers.

==== Slow Convergence ====
To address the issue of slow convergence, Ye et al. (2018) use an iterative shrinking-thresholding algorithm (ISTA) (Beck & Teboulle, 2009) to update the batch normalization scale parameter. The intuition for ISTA is that the structure of the optimization objective can be taken advantage of. Consider: $$L(x) = f(x) + g(x).$$

Let ''f'' be the model loss and ''g'' be the non-differentiable penalty (LASSO). ISTA is able to use the structure of the loss and converge in O(1/n), instead of O(1/sqrt(n)) when using subgradient descent, which assumes no structure about the loss. Even though ISTA is used in convex settings, Ye et. al (2018) argue that it still performs better than gradient descent.

==== Penalty Normalization ====

In the paper, Ye et al. (2018) normalize the per-layer sparse penalty with respect to the global input size, the current layer kernel areas, the previous layer kernel areas, and the local input feature map area.

[[File:Screenshot_from_2018-02-28_17-06-41.png]] (Ye et al., 2018)

To control the global penalty, a hyperparamter ''rho'' is multiplied with all the per-layer ''lambda'' in the final loss.

=== Steps ===

The final algorithm can be summarized as follows:

1. Compute the per-layer normalized sparse penalty constant <math>\lambda</math>

2. Compute the global LASSO loss with global scaling constant <math>\rho</math>

3. Until convergence, train scaling parameters using ISTA and non-scaling parameters using regular gradient descent.

4. Remove channels that correspond to a downstream zero in batch normalization

5. Fine-tune the pruned model using regular learning

== Results ==

The authors show state-of-the-art performance, compared with other channel-pruning approaches. It is important to note that it would be unfair to compare against general pruning approaches; channel pruning specifically removes channels without introducing '''intra-kernel sparsity''', whereas other pruning approaches introduce irregular kernel sparsity and hence computational inefficiencies.

=== CIFAR-10 Experiment ===

Model A is trained with a sparse penalty of <math>\rho = 0.0002</math> for 30 thousand steps, and then increased to <math>\rho = 0.001</math>. Model B is trained by taking Model A and increasing the sparse penalty up to 0.002. Similarly Model C is a continuation of Model B with a penalty of 0.008.

[[File:Screenshot_from_2018-02-28_17-24-25.png]]

For the convNet, reducing the number of parameters in the base model increased the accuracy in model A. This suggests that the base model is over-parameterized. Otherwise, there would be a trade-off of accuracy and model efficiency.

=== ILSVRC2012 Experiment ===

The authors note that while ResNet-101 takes hundreds of epochs to train, pruning only takes 5-10, with fine-tuning adding another 2, giving an empirical example how long pruning might take in practice. Both models were trained with an aggressive sparsity penalty of 0.1.

[[File:Screenshot_from_2018-02-28_17-24-36.png]]

=== Image Foreground-Background Segmentation Experiment ===

The authors note that it is common practice to take a network with pre-trained on a large task and fine-tune it to apply it to a different, smaller task. One might expect there might be some extra channels that while useful for the large task, can be omitted for the simpler task. This experiment replicated that use-case by taking a NN originally trained on multiple datasets and applying the proposed pruning method. The authors note that the pruned network actually improves over the original network in all but the most challenging test dataset, which is in line with the initial expectation. The model was trained with a sparsity penalty of 0.5 and the results are shown in table below

[[File:paper8_Segmentation.png|700px]]

The neural network used in this experiment is composed of two branches:
* An inception branch that locates the foreground objects
* A DenseNet branch to regress the edges

It was found that the pruning primarily affected the inception branch as shown in Figure 1 below. This likely explains the poor performance on more challenging datasets as a result of a higher requirement on foreground objects, which has been impacted by the pruning of the inception branch.

[[File:pruned_inception.png|600px]]

== Conclusion ==

Pruning large neural architectures to fit on low-power devices is an important task. For a real quantitative measure of efficiency, it would be interesting to conduct actual power measurements on the pruned models versus baselines; reduction in FLOPs doesn't necessarily correspond with vastly reduced power since memory accesses dominate energy consumption (Han et al., 2015). However, the reduction in the number of FLOPs and parameters is encouraging, so moderate power savings should be expected.

It would also be interesting to combine multiple approaches, or "throw the whole kitchen sink" at this task. Han et al. (2015) sparked much recent interest by successfully combining weight pruning, quantization, and Huffman coding without loss in accuracy. However, their approach introduced irregular sparsity in the convolutional layers, so a direct comparison cannot be made.

In conclusion, this novel, theoretically-motivated interpretation of channel pruning was successfully applied to several important tasks.

== Implementation ==
A PyTorch implementation is available here: https://github.com/jack-willturner/batchnorm-pruning

== References ==

* Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).
* He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
* Cheng, Y., Wang, D., Zhou, P., & Zhang, T. (2017). A Survey of Model Compression and Acceleration for Deep Neural Networks. arXiv preprint arXiv:1710.09282.
* Ye, J., Lu, X., Lin, Z., & Wang, J. Z. (2018). Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolution Layers. arXiv preprint arXiv:1802.00124.
* Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P. (2016). Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710.
* Molchanov, P., Tyree, S., Karras, T., Aila, T., & Kautz, J. (2016). Pruning convolutional neural networks for resource efficient inference.
* Ioffe, S., & Szegedy, C. (2015, June). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (pp. 448-456).
* Gordon, G., & Tibshirani, R. (2012). Subgradient method. https://www.cs.cmu.edu/~ggordon/10725-F12/slides/06-sg-method.pdf
* Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1), 183-202.
* Han, S., Mao, H., & Dally, W. J. (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149

stat946w18/Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolutional Layers

2018-04-21T04:18:49Z

F7xia: /* Results */

== Introduction ==

With the recent and ongoing surge in low-power, intelligent agents (such as wearables, smartphones, and IoT devices), there exists a growing need for machine learning models to work well in resource-constrained environments. Deep learning models have achieved state-of-the-art on a broad range of tasks; however, they are difficult to deploy in their original forms. For example, AlexNet (Krizhevsky et al., 2012), a model for image classification, contains 61 million parameters and requires 1.5 billion floating point operations (FLOPs) in one inference pass. A more accurate model, ResNet-50 (He et al., 2016), has 25 million parameters but requires 4.08 FLOPs. A high-end desktop GPU such as a Titan Xp is capable of [https://www.nvidia.com/en-us/titan/titan-xp/ (12 TFLOPS (tera-FLOPs per second))], while the Adreno 540 GPU used in a Samsung Galaxy S8 is only capable of [https://gflops.surge.sh (567 GFLOPS)] which is less than 5% of the Titan Xp. Clearly, it would be difficult to deploy and run these models on low-power devices.

In general, model compression can be accomplished using four main non-mutually exclusive methods (Cheng et al., 2017): weight pruning, quantization, matrix transformations, and weight tying. By non-mutually exclusive, we mean that these methods can be used not only separately but also in combination for compressing a single model; the use of one method does not exclude any of the other methods from being viable.

Ye et al. (2018) explores pruning entire channels in a convolutional neural network (CNN). Past work has mostly focused on norm[based or error-based heuristics to prune channels; instead, Ye et al. (2018) show that their approach is easily reproducible and has favorable qualities from an optimization standpoint. In other words, they argue that the norm-based assumption is not as informative or theoretically justified as their approach, and provide strong empirical evidence of these findings.

== Motivation ==

Some previous works on pruning channel filters (Li et al., 2016; Molchanov et al., 2016) have focused on using the L1 norm to determine the importance of a channel. Ye et al. (2018) show that, in the deep linear convolution case, penalizing the per-layer norm is coarse-grained; they argue that one cannot assign different coefficients to L1 penalties associated with different layers without risking the loss function being susceptible to trivial re-parameterizations. As an example, consider the following deep linear convolutional neural network with modified LASSO loss:

$$\min \mathbb{E}_D \lVert W_{2n} * \dots * W_1 x - y\rVert^2 + \lambda \sum_{i=1}^n \lVert W_{2i} \rVert_1$$

where W are the weights and * is convolution. Here we have chosen the coefficient 0 for the L1 penalty associated with odd-numbered layers and the coefficient 1 for the L1 penalty associated with even-numbered layers. This loss is susceptible to trivial re-parameterizations: without affecting the least-squares loss, we can always reduce the LASSO loss by halving the weights of all even-numbered layers and doubling the weights of all odd-numbered layers.

Furthermore, batch normalization (Ioffe, 2015) is incompatible with this method of weight regularization. Consider batch normalization at the <math>l</math>-th layer.

<center><math>x^{l+1} = max\{\gamma \cdot BN_{\mu,\sigma,\epsilon}(W^l * x^l) + \beta, 0\}</math></center>

Due to the batch normalization, any uniform scaling of <math>W^l</math> which would change <math>l_1</math> and <math>l_2</math> norms, but has no have no effect on <math>x^{l+1}</math>. Thus, when trying to minimize weight norms of multiple layers, it is unclear how to properly choose penalties for each layer. Therefore, penalizing the norm of a filter in a deep convolutional network is hard to justify from a theoretical perspective.

In contrast with these existing approaches, the authors focus on enforcing sparsity of a tiny set of parameters in CNN — scale parameter <math>\gamma</math> in all batch normalization. Not only placing sparse constraints on <math>\gamma</math> is simpler and easier to monitor, but more importantly, they put forward two reasons:

1. Every <math>\gamma</math> always multiplies a normalized random variable, thus the channel importance becomes comparable across different layers by measuring the magnitude values of <math>\gamma</math>;

2. The reparameterization effect across different layers is avoided if its subsequent convolution layer is also batch-normalized. In other words, the impacts from the scale changes of <math>\gamma</math> parameter are independent across different layers.

Thus, although not providing a complete theoretical guarantee on loss, Ye et al. (2018) develop a pruning technique that claims to be more justified than norm-based pruning is.

== Method ==

At a high level, Ye et al. (2018) propose that, instead of discovering sparsity via penalizing the per-filter or per-channel norm, penalize the batch normalization scale parameters ''gamma'' instead. The reasoning is that by having fewer parameters to constrain and working with normalized values, sparsity is easier to enforce, monitor, and learn. Having sparse batch normalization terms has the effect of pruning '''entire''' channels: if ''gamma'' is zero, then the output at that layer becomes constant (the bias term), and thus the preceding channels can be pruned.

=== Summary ===

The basic algorithm can be summarized as follows:

1. Penalize the L1-norm of the batch normalization scaling parameters in the loss

2. Train until loss plateaus

3. Remove channels that correspond to a downstream zero in batch normalization

4. Fine-tune the pruned model using regular learning

=== Details ===

There still exist a few problems that this summary has not addressed so far. Sub-gradient descent is known to have inverse square root convergence rate on subdifferentials (Gordon et al., 2012), so the sparsity gradient descent update may be suboptimal. Furthermore, the sparse penalty needs to be normalized with respect to previous channel sizes, since the penalty should be roughly equally distributed across all convolution layers.

==== Slow Convergence ====
To address the issue of slow convergence, Ye et al. (2018) use an iterative shrinking-thresholding algorithm (ISTA) (Beck & Teboulle, 2009) to update the batch normalization scale parameter. The intuition for ISTA is that the structure of the optimization objective can be taken advantage of. Consider: $$L(x) = f(x) + g(x).$$

Let ''f'' be the model loss and ''g'' be the non-differentiable penalty (LASSO). ISTA is able to use the structure of the loss and converge in O(1/n), instead of O(1/sqrt(n)) when using subgradient descent, which assumes no structure about the loss. Even though ISTA is used in convex settings, Ye et. al (2018) argue that it still performs better than gradient descent.

==== Penalty Normalization ====

In the paper, Ye et al. (2018) normalize the per-layer sparse penalty with respect to the global input size, the current layer kernel areas, the previous layer kernel areas, and the local input feature map area.

[[File:Screenshot_from_2018-02-28_17-06-41.png]] (Ye et al., 2018)

To control the global penalty, a hyperparamter ''rho'' is multiplied with all the per-layer ''lambda'' in the final loss.

=== Steps ===

The final algorithm can be summarized as follows:

1. Compute the per-layer normalized sparse penalty constant <math>\lambda</math>

2. Compute the global LASSO loss with global scaling constant <math>\rho</math>

3. Until convergence, train scaling parameters using ISTA and non-scaling parameters using regular gradient descent.

4. Remove channels that correspond to a downstream zero in batch normalization

5. Fine-tune the pruned model using regular learning

== Results ==

=== CIFAR-10 Experiment ===

Model A is trained with a sparse penalty of <math>\rho = 0.0002</math> for 30 thousand steps, and then increased to <math>\rho = 0.001</math>. Model B is trained by taking Model A and increasing the sparse penalty up to 0.002. Similarly Model C is a continuation of Model B with a penalty of 0.008.

[[File:Screenshot_from_2018-02-28_17-24-25.png]]

For the convNet, reducing the number of parameters in the base model increased the accuracy in model A. This suggests that the base model is over-parameterized. Otherwise, there would be a trade-off of accuracy and model efficiency.

=== ILSVRC2012 Experiment ===

The authors note that while ResNet-101 takes hundreds of epochs to train, pruning only takes 5-10, with fine-tuning adding another 2, giving an empirical example how long pruning might take in practice. Both models were trained with an aggressive sparsity penalty of 0.1.

[[File:Screenshot_from_2018-02-28_17-24-36.png]]

=== Image Foreground-Background Segmentation Experiment ===

The authors note that it is common practice to take a network with pre-trained on a large task and fine-tune it to apply it to a different, smaller task. One might expect there might be some extra channels that while useful for the large task, can be omitted for the simpler task. This experiment replicated that use-case by taking a NN originally trained on multiple datasets and applying the proposed pruning method. The authors note that the pruned network actually improves over the original network in all but the most challenging test dataset, which is in line with the initial expectation. The model was trained with a sparsity penalty of 0.5 and the results are shown in table below

[[File:paper8_Segmentation.png|700px]]

The neural network used in this experiment is composed of two branches:
* An inception branch that locates the foreground objects
* A DenseNet branch to regress the edges

It was found that the pruning primarily affected the inception branch as shown in Figure 1 below. This likely explains the poor performance on more challenging datasets as a result of a higher requirement on foreground objects, which has been impacted by the pruning of the inception branch.

[[File:pruned_inception.png|600px]]

== Conclusion ==

Pruning large neural architectures to fit on low-power devices is an important task. For a real quantitative measure of efficiency, it would be interesting to conduct actual power measurements on the pruned models versus baselines; reduction in FLOPs doesn't necessarily correspond with vastly reduced power since memory accesses dominate energy consumption (Han et al., 2015). However, the reduction in the number of FLOPs and parameters is encouraging, so moderate power savings should be expected.

It would also be interesting to combine multiple approaches, or "throw the whole kitchen sink" at this task. Han et al. (2015) sparked much recent interest by successfully combining weight pruning, quantization, and Huffman coding without loss in accuracy. However, their approach introduced irregular sparsity in the convolutional layers, so a direct comparison cannot be made.

In conclusion, this novel, theoretically-motivated interpretation of channel pruning was successfully applied to several important tasks.

== Implementation ==
A PyTorch implementation is available here: https://github.com/jack-willturner/batchnorm-pruning

== References ==

* Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).
* He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
* Cheng, Y., Wang, D., Zhou, P., & Zhang, T. (2017). A Survey of Model Compression and Acceleration for Deep Neural Networks. arXiv preprint arXiv:1710.09282.
* Ye, J., Lu, X., Lin, Z., & Wang, J. Z. (2018). Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolution Layers. arXiv preprint arXiv:1802.00124.
* Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P. (2016). Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710.
* Molchanov, P., Tyree, S., Karras, T., Aila, T., & Kautz, J. (2016). Pruning convolutional neural networks for resource efficient inference.
* Ioffe, S., & Szegedy, C. (2015, June). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (pp. 448-456).
* Gordon, G., & Tibshirani, R. (2012). Subgradient method. https://www.cs.cmu.edu/~ggordon/10725-F12/slides/06-sg-method.pdf
* Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1), 183-202.
* Han, S., Mao, H., & Dally, W. J. (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149

stat946w18/MaskRNN: Instance Level Video Object Segmentation

2018-04-21T04:15:25Z

F7xia: /* MaskRNN: Binary Segmentation Network */

== Introduction ==
Deep Learning has produced state of the art results in many computer vision tasks like image classification, object localization, object detection, object segmentation, semantic segmentation and instance level video object segmentation. Image classification classify the image based on the prominent objects. Object localization is the task of finding objects’ location in the frame. Object Segmentation task involves providing a pixel map which represents the pixel wise location of the objects in the image. Semantic segmentation task attempts at segmenting the image into meaningful parts. Instance level video object segmentation is the task of consistent object segmentation in video sequences. Deforming shapes, fast movements, and occlusion from multiple objects, are just some of the significant challenges in instance level video object segmentation.

There are 2 different types of video object segmentation, Unsupervised and Semi-supervised.
* In unsupervised video object segmentation, the task is to find the salient objects and track the main objects in the video.
* In a semi-supervised setting, the ground truth mask of the salient objects is provided for the first frame. The task is thus simplified to only track the objects required.

In this paper, the authors look at an unsupervised video object segmentation technique.

== Background Papers ==
Video object segmentation has been performed using spatio-temporal graphs [[https://pdfs.semanticscholar.org/7221/c3470fa89879aab3ef270570ced15cde28de.pdf 5], [http://ieeexplore.ieee.org/abstract/document/5539893/ 6], [http://openaccess.thecvf.com/content_iccv_2013/papers/Li_Video_Segmentation_by_2013_ICCV_paper.pdf 7], [https://link.springer.com/content/pdf/10.1007/s11263-011-0512-5.pdf 8]] and deep learning. The graph based methods construct 3D spatio-temporal graphs in order to model the inter and the intra-frame relationship of pixels or superpixels in a video. Hence they are computationally slower than deep learning methods and are unable to run at real-time. There are 2 main deep learning techniques for semi-supervised video object segmentation: One Shot Video Object Segmentation (OSVOS) and Learning Video Object Segmentation from Static Images (MaskTrack). Following is a brief description of the new techniques introduced by these papers for semi-supervised video object segmentation task.

=== OSVOS (One-Shot Video Object Segmentation) ===

[[File:OSVOS.jpg | 1000px]]

This paper introduces the technique of using a frame-by-frame object segmentation without any temporal information from the previous frames of the video. The paper uses a VGG-16 network with pre-trained weights from image classification task. This network is then converted into a fully-connected network (FCN) by removing the fully connected dense layers at the end and adding convolution layers to generate a segment mask of the input. This network is then trained on the DAVIS 2016 dataset.

During testing, the trained VGG-16 FCN is fine-tuned using the first frame of the video using the ground truth. Because this is a semi-supervised case, the segmented mask (ground truth) for the first frame is available. The first frame data is augmented by zooming/rotating/flipping the first frame and the associated segment mask.

=== MaskTrack (Learning Video Object Segmentation from Static Images) ===

[[File:MaskTrack.jpg | 500px]]

MaskTrack takes the output of the previous frame to improve its predictions and to generate the segmentation mask for the next frame. Thus the input to the network is 4 channel wide (3 RGB channels from the frame at time <math>t</math> plus one binary segmentation mask from frame <math>t-1</math>). The output of the network is the binary segmentation mask for frame at time <math>t</math>. Using the binary segmentation mask (referred to as guided object segmentation in the paper), the network is able to use some temporal information from the previous frame to improve its segmentation mask prediction for the next frame.

The model of the MaskTrack network is similar to a modular VGG-16 and is referred to as MaskTrack ConvNet in the paper. The network is trained offline on saliency segmentation datasets: ECSSD, MSRA 10K, SOD and PASCAL-S. The input mask for the binary segmentation mask channel is generated via non-rigid deformation and affine transformation of the ground truth segmentation mask. Similar data-augmentation techniques are also used during online training. Just like OSVOS, MaskTrack uses the first frame as ground truth (with augmented images) to fine-tune the network to improve prediction score for the particular video sequence.

A parallel ConvNet network is used to generate a predicted segment mask based on the optical flow magnitude. The optical flow between 2 frames is calculated using the EpicFlow algorithm. The output of the two networks is combined using an averaging operation to generate the final predicted segmented mask.

Table 1 below gives a summary comparison of the different state of the art algorithms. The noteworthy information included in this table is that the technique presented in this paper is the only one which takes into account long-term temporal information. This is accomplished with a recurrent neural net. Furthermore, the bounding box is also estimated instead of just a segmentation mask. The authors claim that this allows the incorporation of a location prior from the tracked object.

[[File:Paper19-SegmentationComp.png]]

== Dataset ==
The three major datasets used in this paper are DAVIS-2016, DAVIS-2017 and Segtrack v2. DAVIS-2016 dataset provides video sequences with only one segment mask for all salient objects. DAVIS-2017 improves the ground truth data by providing segmentation mask for each salient object as a separate color segment mask. Segtrack v2 also provides multiple segmentation mask for all salient objects in the video sequence. These datasets try to recreate real-life scenarios like occlusions, low resolution videos, background clutter, motion blur, fast motion etc.

== MaskRNN: Introduction ==
Most techniques mentioned above don’t work directly on instance level segmentation of the objects through the video sequence. The above approaches focus on image segmentation on each frame and using additional information (mask propagation and optical flow) from the preceding frame perform predictions for the current frame. To address the instance level segmentation problem, MaskRNN proposes a framework where the salient objects are tracked and segmented by capturing the temporal information in the video sequence using a recurrent neural network.

== MaskRNN: Overview ==
In a video sequence <math>I = \{I_1, I_2, …, I_T\}</math>, the sequence of <math>T</math> frames are given as input to the network, where the video sequence contains <math>N</math> salient objects. The ground truth for the first frame <math>y_1^*</math> is also provided for <math>N</math> salient objects.
In this paper, the problem is formulated as a time dependency problem and using a recurrent neural network, the prediction of the previous frame influences the prediction of the next frame. The approach also computes the optical flow between frames (optical flow is the apparent motion of objects between two consecutive frames in the form of a 2D vector field representing the displacement in brightness patterns for each pixel, apparent because it depends on the relative motion between the observer and the scene) and uses that as the input to the neural network. The optical flow is also used to align the output of the predicted mask. “The warped prediction, the optical flow itself, and the appearance of the current frame are then used as input for <math>N</math> deep nets, one for each of the <math>N</math> objects.”[1 - MaskRNN] Each deep net is a made of an object localization network and a binary segmentation network. The binary segmentation network is used to generate the segmentation mask for an object. The object localization network is used to alleviate outliers from the predictions. The final prediction of the segmentation mask is generated by merging the predictions of the 2 networks. For <math>N</math> objects, there are N deep nets which predict the mask for each salient object. The predictions are then merged into a single prediction using an <math>\text{argmax}</math> operation at test time.

== MaskRNN: Multiple Instance Level Segmentation ==

[[File:2ObjectSeg.jpg | 850px]]

Image segmentation requires producing a pixel level segmentation mask and this can become a multi-class problem. Instead, using the approach from [2- Mask R-CNN] this approach is converted into a multiple binary segmentation problem. A separate segmentation mask is predicted separately for each salient object and thus we get a binary segmentation problem. The binary segments are combined using an <math>\text{argmax}</math> operation where each pixel is assigned to the object containing the largest predicted probability.

=== MaskRNN: Binary Segmentation Network ===

[[File:MaskRNNDeepNet.jpg | 850px]]

The above picture shows a single deep net employed for predicting the segment mask for one salient object in the video frame. The network consists of 2 networks: binary segmentation network and object localization network. The binary segmentation network is split into two streams: appearance and flow stream. The input of the appearance stream is the RGB frame at time t and the wrapped prediction of the binary segmentation mask from time <math>t-1</math>. The wrapping function uses the optical flow between frame <math>t-1</math> and frame <math>t</math> to generate a new binary segmentation mask for frame <math>t</math>. The input to the flow stream is the concatenation of the optical flow magnitude between frames <math>t-1</math> to <math>t</math> and frames <math>t</math> to <math>t+1</math> and the wrapped prediction of the segmentation mask from frame <math>t-1</math>. The magnitude of the optical flow is replicated into an RBG format before feeding it to the flow stream. The network architecture closely resembles a VGG-16 network without the pooling or fully connected layers at the end. The fully connected layers are replaced with convolutional and bilinear interpolation upsampling layers which are then linearly combined to form a feature representation that is the same size as the input image. This feature representation is then used to generate a binary segment mask. This technique is borrowed from the Fully Convolutional Network mentioned above. The output of the flow stream and the appearance stream is linearly combined and sigmoid function is applied to the result to generate a binary mask for ith object. All parts of the network are fully differentiable and thus it can be fully trained in every pass.

=== MaskRNN: Object Localization Network: ===
Using a similar technique to the Fast-RCNN method of object localization, where the region of interest (RoI) pooling of the features of the region proposals (i.e. the bounding box proposals here) is performed and passed through fully connected layers to perform regression, the Object localization network generates a bounding box of the salient object in the frame. This bounding box is enlarged by a factor of 1.25 and combined with the output of binary segmentation mask. Only the segment mask available in the bounding box is used for prediction and the pixels outside of the bounding box are marked as zero. MaskRNN uses the convolutional feature output of the appearance stream as the input to the RoI-pooling layer to generate the predicted bounding box. A pixel is classified as foreground if it is both predicted to be in the foreground by the binary segmentation net and within the enlarged estimated bounding box from the object localization net.

=== Training and Finetuning ===
For training the network depicted in Figure 1, backpropagation through time is used in order to preserve the recurrence relationship connecting the frames of the video sequence. Predictive performance is further improved by following the algorithm for semi-supervised setting for video object segmentation with fine-tuning achieved by using the first frame segmentation mask of the ground truth. In this way, the network is further optimized using the ground truth data.

== MaskRNN: Implementation Details ==
=== Offline Training ===
The deep net is first trained offline on a set of static images. The ground truth is randomly perturbed locally to generate the imperfect mask from frame <math>t-1</math>. Two different networks are trained offline separately for DAVIS-2016 and DAVIS-2017 datasets for a fair evaluation of both datasets. After both the object localization net and binary segmentation networks have trained, the temporal information in the network is used to further improve the segmented prediction results. Because of GPU memory constraints, the RNN is only able to backpropagate the gradients back 7 frames and learn long-term temporal information.

For optical flow, a pre-trained flowNet2.0 is used to compute the optical flow between frames. (A flowNet (Dosovitskiy 2015) is a deep neural network trained to predict optical flow. The simplest form of flowNet has an architecture consisting of two parts. The first part accepts the two images between which the optical flow is to be computed as input, as applies a sequence of convolution and max-pooling operations, as in a standard convolutional neural network. In the second part, repeated up-convolution operations are applied, increasing the dimensions of the feature-maps. Besides the output of the previous upconvolution, each upconvolution is also fed as input the output of the corresponding down-convolution from the first part of the network. Thus part of the architecture resembles that of a U-net (Ronneberger, 2015). The output of the network is the predicted optical flow. )

=== Online Finetuning ===
The deep nets (without the RNN) are then fine-tuned during test time by online training the networks on the ground truth of the first frame and some augmentations of the first frame data. The learning rate is set to <math>10^{-5}</math> for online training for 200 iterations and the learning rate is gradually decayed over time. Data augmentation techniques similar to those in offline training, namely random resizing, rotating, cropping and flipping is applied. Also, it should be noted that the RNN is ''not'' employed during online finetuning since only a single frame of training data is available.

== MaskRNN: Experimental Results ==
=== Evaluation Metrics ===
There are 3 different techniques for performance analysis for Video Object Segmentation techniques:

1. Region Similarity (Jaccard Index): Region similarity or Intersection-over-union is used to capture precision of the area covered by the prediction segmentation mask compared to the ground truth segmentation mask. It calculates the average across all frames of the dataset. This is particularly challenging for small sized foreground objects.

\begin{equation}
IoU = \frac{|M \cap G|}{|M| + |G| - |M \cap G|}
\label{equation:Jaccard}
\end{equation}

2. Contour Accuracy (F-score): This metric measures the accuracy in the boundary of the predicted segment mask and the ground truth segment mask, by calculating the the precision and the recall of the two sets of points on the contours of the ground truth segment and the output segment via a bipartite graph matching. It is a measure of accurate delineation of the foreground objects.

[[File:Fscore.jpg | 200px|center]]

3. Temporal Stability : This estimates the degree of deformation needed to transform the segmentation masks from one frame to the next and is measured by the dissimilarity of the set of points on the contours of the segmentation between two adjacent frames.

Region similarity measures the true segmented area in the prediction, while Contour Accuracy measures the accuracy of the contours/segmented mask boundary.

=== Ablation Study ===

The ablation study summarized how the different components contributed to the algorithm evaluated on DAVIS-2016 and DAVIS-2017 datasets.

[[File:MaskRNNTable2.jpg | 700px|center]]

The above table presents the contribution of each component of the network to the final prediction score. Online fine-tuning improves the performance by a large margin, as the network becomes adjusted to the appearance of the specific object being tracked. Addition of RNN/Localization Net and FStream all seem to positively affect the performance of the deep net. The FStream provides information on motion boundaries which help in videos with cluttered backgrounds, the RNN provides more consistent segmentation masks over time. The localization net has a more ambiguous effect on the network; adding the bounding box regression loss decreases the performance of the segmentation net but applying the bounding box to restrict the segmentation mask improves the results over those achieved by only using the segmentation net. In other words, the localization net should only be used in conjunction with the segmentation net while the segmentation net can be used by itself.

=== Quantitative Evaluation ===

The authors use DAVIS-2016, DAVIS-2017 and Segtrack v2 to compare the performance of the proposed approach to other methods based on foreground-background video object segmentation and multiple instance-level video object segmentation.

[[File:MaskRNNTable3.jpg | 700px]]

The above table shows the results for contour accuracy mean and region similarity. The MaskRNN method seems to outperform all previously proposed methods. The performance gain is significant by employing a Recurrent Neural Network for learning recurrence relationship and using a object localization network to improve prediction results.

The following table shows the improvements in the state of the art achieved by MaskRNN on the DAVIS-2017 and the SegTrack v2 dataset.

[[File:MaskRNNTable4.jpg | 700px]]

=== Qualitative Evaluation ===
The authors showed example qualitative results from the DAVIS and Segtrack datasets.

Below are some success cases of object segmentation under complex motion, cluttered background, and/or multiple object occlusion.

[[File:maskrnn_example.png | 700px]]

Below are a few failure cases. The authors explain two reasons for failure: a) when similar objects of interest are contained in the frame (left two images), and b) when there are large variations in scale and viewpoint (right two images).

[[File:maskrnn_example_fail.png | 700px]]

== Conclusion ==
In this paper a novel approach to instance level video object segmentation task is presented which performs better than current state of the art. The long-term recurrence relationship is learnt using an RNN. The object localization network is added to improve accuracy of the system. Due to the recurrent component and the combination of segmentation and localization nets, the approach takes advantage of the long-term temporal information and the location prior to improve the results. Using online fine-tuning the network is adjusted to predict better for the current video sequence.

== Critique ==
The paper provides a technique to track multiple objects in a video. The novelty is to add back-propagation through time to improve the tracking accuracy and using a localization network to remove any outliers in the segmented binary mask. However, the network architecture it too large and isn't able to run in real-time. There are N deep-Nets for N objects and each deep-Net contains 2 parallel VGG-16 convolutional networks.

== Implementation ==

The implementation of this paper was produced as part of the NIPS Paper Implementation Challenge. This implementation can be found at the following open source project: https://github.com/philferriere/tfvos.

== References ==
# Dosovitskiy, Alexey, et al. "Flownet: Learning optical flow with convolutional networks." Proceedings of the IEEE International Conference on Computer Vision. 2015.
# Hu, Y., Huang, J., & Schwing, A. "MaskRNN: Instance level video object segmentation". Conference on Neural Information Processing Systems (NIPS). 2017
# Ferriere, P. (n.d.). Semi-Supervised Video Object Segmentation (VOS) with Tensorflow. Retrieved March 20, 2018, from https://github.com/philferriere/tfvos
# Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image segmentation." International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015.
# Lee, Yong Jae, Jaechul Kim, and Kristen Grauman. "Key-segments for video object segmentation." Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011.
# Grundmann, Matthias, et al. "Efficient hierarchical graph-based video segmentation." Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010.
# Li, Fuxin, et al. "Video segmentation by tracking many figure-ground segments." Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE, 2013.
# Tsai, David, et al. "Motion coherent tracking using multi-label MRF optimization." International journal of computer vision 100.2 (2012): 190-202.

stat946w18/MaskRNN: Instance Level Video Object Segmentation

2018-04-21T04:13:33Z

F7xia: /* Ablation Study */

== Introduction ==
Deep Learning has produced state of the art results in many computer vision tasks like image classification, object localization, object detection, object segmentation, semantic segmentation and instance level video object segmentation. Image classification classify the image based on the prominent objects. Object localization is the task of finding objects’ location in the frame. Object Segmentation task involves providing a pixel map which represents the pixel wise location of the objects in the image. Semantic segmentation task attempts at segmenting the image into meaningful parts. Instance level video object segmentation is the task of consistent object segmentation in video sequences. Deforming shapes, fast movements, and occlusion from multiple objects, are just some of the significant challenges in instance level video object segmentation.

There are 2 different types of video object segmentation, Unsupervised and Semi-supervised.
* In unsupervised video object segmentation, the task is to find the salient objects and track the main objects in the video.
* In a semi-supervised setting, the ground truth mask of the salient objects is provided for the first frame. The task is thus simplified to only track the objects required.

In this paper, the authors look at an unsupervised video object segmentation technique.

== Background Papers ==
Video object segmentation has been performed using spatio-temporal graphs [[https://pdfs.semanticscholar.org/7221/c3470fa89879aab3ef270570ced15cde28de.pdf 5], [http://ieeexplore.ieee.org/abstract/document/5539893/ 6], [http://openaccess.thecvf.com/content_iccv_2013/papers/Li_Video_Segmentation_by_2013_ICCV_paper.pdf 7], [https://link.springer.com/content/pdf/10.1007/s11263-011-0512-5.pdf 8]] and deep learning. The graph based methods construct 3D spatio-temporal graphs in order to model the inter and the intra-frame relationship of pixels or superpixels in a video. Hence they are computationally slower than deep learning methods and are unable to run at real-time. There are 2 main deep learning techniques for semi-supervised video object segmentation: One Shot Video Object Segmentation (OSVOS) and Learning Video Object Segmentation from Static Images (MaskTrack). Following is a brief description of the new techniques introduced by these papers for semi-supervised video object segmentation task.

=== OSVOS (One-Shot Video Object Segmentation) ===

[[File:OSVOS.jpg | 1000px]]

This paper introduces the technique of using a frame-by-frame object segmentation without any temporal information from the previous frames of the video. The paper uses a VGG-16 network with pre-trained weights from image classification task. This network is then converted into a fully-connected network (FCN) by removing the fully connected dense layers at the end and adding convolution layers to generate a segment mask of the input. This network is then trained on the DAVIS 2016 dataset.

During testing, the trained VGG-16 FCN is fine-tuned using the first frame of the video using the ground truth. Because this is a semi-supervised case, the segmented mask (ground truth) for the first frame is available. The first frame data is augmented by zooming/rotating/flipping the first frame and the associated segment mask.

=== MaskTrack (Learning Video Object Segmentation from Static Images) ===

[[File:MaskTrack.jpg | 500px]]

MaskTrack takes the output of the previous frame to improve its predictions and to generate the segmentation mask for the next frame. Thus the input to the network is 4 channel wide (3 RGB channels from the frame at time <math>t</math> plus one binary segmentation mask from frame <math>t-1</math>). The output of the network is the binary segmentation mask for frame at time <math>t</math>. Using the binary segmentation mask (referred to as guided object segmentation in the paper), the network is able to use some temporal information from the previous frame to improve its segmentation mask prediction for the next frame.

The model of the MaskTrack network is similar to a modular VGG-16 and is referred to as MaskTrack ConvNet in the paper. The network is trained offline on saliency segmentation datasets: ECSSD, MSRA 10K, SOD and PASCAL-S. The input mask for the binary segmentation mask channel is generated via non-rigid deformation and affine transformation of the ground truth segmentation mask. Similar data-augmentation techniques are also used during online training. Just like OSVOS, MaskTrack uses the first frame as ground truth (with augmented images) to fine-tune the network to improve prediction score for the particular video sequence.

A parallel ConvNet network is used to generate a predicted segment mask based on the optical flow magnitude. The optical flow between 2 frames is calculated using the EpicFlow algorithm. The output of the two networks is combined using an averaging operation to generate the final predicted segmented mask.

Table 1 below gives a summary comparison of the different state of the art algorithms. The noteworthy information included in this table is that the technique presented in this paper is the only one which takes into account long-term temporal information. This is accomplished with a recurrent neural net. Furthermore, the bounding box is also estimated instead of just a segmentation mask. The authors claim that this allows the incorporation of a location prior from the tracked object.

[[File:Paper19-SegmentationComp.png]]

== Dataset ==
The three major datasets used in this paper are DAVIS-2016, DAVIS-2017 and Segtrack v2. DAVIS-2016 dataset provides video sequences with only one segment mask for all salient objects. DAVIS-2017 improves the ground truth data by providing segmentation mask for each salient object as a separate color segment mask. Segtrack v2 also provides multiple segmentation mask for all salient objects in the video sequence. These datasets try to recreate real-life scenarios like occlusions, low resolution videos, background clutter, motion blur, fast motion etc.

== MaskRNN: Introduction ==
Most techniques mentioned above don’t work directly on instance level segmentation of the objects through the video sequence. The above approaches focus on image segmentation on each frame and using additional information (mask propagation and optical flow) from the preceding frame perform predictions for the current frame. To address the instance level segmentation problem, MaskRNN proposes a framework where the salient objects are tracked and segmented by capturing the temporal information in the video sequence using a recurrent neural network.

== MaskRNN: Overview ==
In a video sequence <math>I = \{I_1, I_2, …, I_T\}</math>, the sequence of <math>T</math> frames are given as input to the network, where the video sequence contains <math>N</math> salient objects. The ground truth for the first frame <math>y_1^*</math> is also provided for <math>N</math> salient objects.
In this paper, the problem is formulated as a time dependency problem and using a recurrent neural network, the prediction of the previous frame influences the prediction of the next frame. The approach also computes the optical flow between frames (optical flow is the apparent motion of objects between two consecutive frames in the form of a 2D vector field representing the displacement in brightness patterns for each pixel, apparent because it depends on the relative motion between the observer and the scene) and uses that as the input to the neural network. The optical flow is also used to align the output of the predicted mask. “The warped prediction, the optical flow itself, and the appearance of the current frame are then used as input for <math>N</math> deep nets, one for each of the <math>N</math> objects.”[1 - MaskRNN] Each deep net is a made of an object localization network and a binary segmentation network. The binary segmentation network is used to generate the segmentation mask for an object. The object localization network is used to alleviate outliers from the predictions. The final prediction of the segmentation mask is generated by merging the predictions of the 2 networks. For <math>N</math> objects, there are N deep nets which predict the mask for each salient object. The predictions are then merged into a single prediction using an <math>\text{argmax}</math> operation at test time.

== MaskRNN: Multiple Instance Level Segmentation ==

[[File:2ObjectSeg.jpg | 850px]]

Image segmentation requires producing a pixel level segmentation mask and this can become a multi-class problem. Instead, using the approach from [2- Mask R-CNN] this approach is converted into a multiple binary segmentation problem. A separate segmentation mask is predicted separately for each salient object and thus we get a binary segmentation problem. The binary segments are combined using an <math>\text{argmax}</math> operation where each pixel is assigned to the object containing the largest predicted probability.

=== MaskRNN: Binary Segmentation Network ===

[[File:MaskRNNDeepNet.jpg | 850px]]

The above picture shows a single deep net employed for predicting the segment mask for one salient object in the video frame. The network consists of 2 networks: binary segmentation network and object localization network. The binary segmentation network is split into two streams: appearance and flow stream. The input of the appearance stream is the RGB frame at time t and the wrapped prediction of the binary segmentation mask from time <math>t-1</math>. The wrapping function uses the optical flow between frame <math>t-1</math> and frame <math>t</math> to generate a new binary segmentation mask for frame <math>t</math>. The input to the flow stream is the concatenation of the optical flow magnitude between frames <math>t-1</math> to <math>t</math> and frames <math>t</math> to <math>t+1</math> and the wrapped prediction of the segmentation mask from frame <math>t-1</math>. The magnitude of the optical flow is replicated into an RBG format before feeding it to the flow stream. The network architecture closely resembles a VGG-16 network without the pooling or fully connected layers at the end. The fully connected layers are replaced with convolutional and bilinear interpolation upsampling layers which are then linearly combined to form a feature representation that is the same size of the input image. This feature representation is then used to generate a binary segment mask. This technique is borrowed from the Fully Convolutional Network mentioned above. The output of the flow stream and the appearance stream is linearly combined and sigmoid function is applied to the result to generate binary mask for ith object. All parts of the network are fully differentiable and thus it can be fully trained in every pass.

=== MaskRNN: Object Localization Network: ===
Using a similar technique to the Fast-RCNN method of object localization, where the region of interest (RoI) pooling of the features of the region proposals (i.e. the bounding box proposals here) is performed and passed through fully connected layers to perform regression, the Object localization network generates a bounding box of the salient object in the frame. This bounding box is enlarged by a factor of 1.25 and combined with the output of binary segmentation mask. Only the segment mask available in the bounding box is used for prediction and the pixels outside of the bounding box are marked as zero. MaskRNN uses the convolutional feature output of the appearance stream as the input to the RoI-pooling layer to generate the predicted bounding box. A pixel is classified as foreground if it is both predicted to be in the foreground by the binary segmentation net and within the enlarged estimated bounding box from the object localization net.

=== Training and Finetuning ===
For training the network depicted in Figure 1, backpropagation through time is used in order to preserve the recurrence relationship connecting the frames of the video sequence. Predictive performance is further improved by following the algorithm for semi-supervised setting for video object segmentation with fine-tuning achieved by using the first frame segmentation mask of the ground truth. In this way, the network is further optimized using the ground truth data.

== MaskRNN: Implementation Details ==
=== Offline Training ===
The deep net is first trained offline on a set of static images. The ground truth is randomly perturbed locally to generate the imperfect mask from frame <math>t-1</math>. Two different networks are trained offline separately for DAVIS-2016 and DAVIS-2017 datasets for a fair evaluation of both datasets. After both the object localization net and binary segmentation networks have trained, the temporal information in the network is used to further improve the segmented prediction results. Because of GPU memory constraints, the RNN is only able to backpropagate the gradients back 7 frames and learn long-term temporal information.

For optical flow, a pre-trained flowNet2.0 is used to compute the optical flow between frames. (A flowNet (Dosovitskiy 2015) is a deep neural network trained to predict optical flow. The simplest form of flowNet has an architecture consisting of two parts. The first part accepts the two images between which the optical flow is to be computed as input, as applies a sequence of convolution and max-pooling operations, as in a standard convolutional neural network. In the second part, repeated up-convolution operations are applied, increasing the dimensions of the feature-maps. Besides the output of the previous upconvolution, each upconvolution is also fed as input the output of the corresponding down-convolution from the first part of the network. Thus part of the architecture resembles that of a U-net (Ronneberger, 2015). The output of the network is the predicted optical flow. )

=== Online Finetuning ===
The deep nets (without the RNN) are then fine-tuned during test time by online training the networks on the ground truth of the first frame and some augmentations of the first frame data. The learning rate is set to <math>10^{-5}</math> for online training for 200 iterations and the learning rate is gradually decayed over time. Data augmentation techniques similar to those in offline training, namely random resizing, rotating, cropping and flipping is applied. Also, it should be noted that the RNN is ''not'' employed during online finetuning since only a single frame of training data is available.

== MaskRNN: Experimental Results ==
=== Evaluation Metrics ===
There are 3 different techniques for performance analysis for Video Object Segmentation techniques:

1. Region Similarity (Jaccard Index): Region similarity or Intersection-over-union is used to capture precision of the area covered by the prediction segmentation mask compared to the ground truth segmentation mask. It calculates the average across all frames of the dataset. This is particularly challenging for small sized foreground objects.

\begin{equation}
IoU = \frac{|M \cap G|}{|M| + |G| - |M \cap G|}
\label{equation:Jaccard}
\end{equation}

2. Contour Accuracy (F-score): This metric measures the accuracy in the boundary of the predicted segment mask and the ground truth segment mask, by calculating the the precision and the recall of the two sets of points on the contours of the ground truth segment and the output segment via a bipartite graph matching. It is a measure of accurate delineation of the foreground objects.

[[File:Fscore.jpg | 200px|center]]

3. Temporal Stability : This estimates the degree of deformation needed to transform the segmentation masks from one frame to the next and is measured by the dissimilarity of the set of points on the contours of the segmentation between two adjacent frames.

Region similarity measures the true segmented area in the prediction, while Contour Accuracy measures the accuracy of the contours/segmented mask boundary.

=== Ablation Study ===

The ablation study summarized how the different components contributed to the algorithm evaluated on DAVIS-2016 and DAVIS-2017 datasets.

[[File:MaskRNNTable2.jpg | 700px|center]]

The above table presents the contribution of each component of the network to the final prediction score. Online fine-tuning improves the performance by a large margin, as the network becomes adjusted to the appearance of the specific object being tracked. Addition of RNN/Localization Net and FStream all seem to positively affect the performance of the deep net. The FStream provides information on motion boundaries which help in videos with cluttered backgrounds, the RNN provides more consistent segmentation masks over time. The localization net has a more ambiguous effect on the network; adding the bounding box regression loss decreases the performance of the segmentation net but applying the bounding box to restrict the segmentation mask improves the results over those achieved by only using the segmentation net. In other words, the localization net should only be used in conjunction with the segmentation net while the segmentation net can be used by itself.

=== Quantitative Evaluation ===

The authors use DAVIS-2016, DAVIS-2017 and Segtrack v2 to compare the performance of the proposed approach to other methods based on foreground-background video object segmentation and multiple instance-level video object segmentation.

[[File:MaskRNNTable3.jpg | 700px]]

The above table shows the results for contour accuracy mean and region similarity. The MaskRNN method seems to outperform all previously proposed methods. The performance gain is significant by employing a Recurrent Neural Network for learning recurrence relationship and using a object localization network to improve prediction results.

The following table shows the improvements in the state of the art achieved by MaskRNN on the DAVIS-2017 and the SegTrack v2 dataset.

[[File:MaskRNNTable4.jpg | 700px]]

=== Qualitative Evaluation ===
The authors showed example qualitative results from the DAVIS and Segtrack datasets.

Below are some success cases of object segmentation under complex motion, cluttered background, and/or multiple object occlusion.

[[File:maskrnn_example.png | 700px]]

Below are a few failure cases. The authors explain two reasons for failure: a) when similar objects of interest are contained in the frame (left two images), and b) when there are large variations in scale and viewpoint (right two images).

[[File:maskrnn_example_fail.png | 700px]]

== Conclusion ==
In this paper a novel approach to instance level video object segmentation task is presented which performs better than current state of the art. The long-term recurrence relationship is learnt using an RNN. The object localization network is added to improve accuracy of the system. Due to the recurrent component and the combination of segmentation and localization nets, the approach takes advantage of the long-term temporal information and the location prior to improve the results. Using online fine-tuning the network is adjusted to predict better for the current video sequence.

== Critique ==
The paper provides a technique to track multiple objects in a video. The novelty is to add back-propagation through time to improve the tracking accuracy and using a localization network to remove any outliers in the segmented binary mask. However, the network architecture it too large and isn't able to run in real-time. There are N deep-Nets for N objects and each deep-Net contains 2 parallel VGG-16 convolutional networks.

== Implementation ==

The implementation of this paper was produced as part of the NIPS Paper Implementation Challenge. This implementation can be found at the following open source project: https://github.com/philferriere/tfvos.

== References ==
# Dosovitskiy, Alexey, et al. "Flownet: Learning optical flow with convolutional networks." Proceedings of the IEEE International Conference on Computer Vision. 2015.
# Hu, Y., Huang, J., & Schwing, A. "MaskRNN: Instance level video object segmentation". Conference on Neural Information Processing Systems (NIPS). 2017
# Ferriere, P. (n.d.). Semi-Supervised Video Object Segmentation (VOS) with Tensorflow. Retrieved March 20, 2018, from https://github.com/philferriere/tfvos
# Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image segmentation." International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015.
# Lee, Yong Jae, Jaechul Kim, and Kristen Grauman. "Key-segments for video object segmentation." Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011.
# Grundmann, Matthias, et al. "Efficient hierarchical graph-based video segmentation." Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010.
# Li, Fuxin, et al. "Video segmentation by tracking many figure-ground segments." Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE, 2013.
# Tsai, David, et al. "Motion coherent tracking using multi-label MRF optimization." International journal of computer vision 100.2 (2012): 190-202.

Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments

2018-04-21T04:12:11Z