http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Mpafla&feedformat=atomstatwiki - User contributions [US]2022-09-25T08:32:56ZUser contributionsMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=learn_what_not_to_learn&diff=41166learn what not to learn2018-11-23T17:03:25Z<p>Mpafla: /* Action Elimination */</p>
<hr />
<div>=Introduction=<br />
In reinforcement learning, it is often difficult for agent to learn when the action space is large. For a specific case that many actions are irrelevant, it is sometimes easier for the algorithm to learn which action not to take. The paper propose a new reinforcement learning approach for dealing with large action spaces by restricting the available actions in each state to a subset of the most likely ones. More specifically, it propose a system that learns the approximation of Q-function and concurrently learns to eliminate actions. The method need to utilize an external elimination signal which incorporates domain-specific prior knowledge. For example, in parser-based text games, the parser gives feedback regarding irrelevant actions after the action is played. (e.g., Player: "Climb the tree." Parser: "There are no trees to climb") Then a machine learning model can be trained to generalize to unseen states. <br />
<br />
The paper focus mainly on tasks where both states and the actions are natural language. It introduces a novel deep reinforcement learning approach which has a DQN network and an Action Elimination Network(AEN), both using the CNN for NLP tasks. The AEN is trained to predict invalid actions, supervised by the elimination signal from the environment. '''Note that the core assumption is that it is easy to predict which actions are invalid or inferior in each state and leverage that information for control.'''<br />
<br />
The text-based game called "Zork", which let player to interact with a virtual world through a text based interface, is tested by using the elimination framework. The AE algorithm has achieved faster learning rate than the baseline agents through eliminating irrelevant actions.<br />
<br />
Below shows an example for the Zork interface:<br />
<br />
[[File:AEF_zork_interface.png]]<br />
<br />
All state and action are given in natural language. Input for the game contains more than a thousand possible actions in each state since player can type anything.<br />
<br />
=Related Work=<br />
Text-Based Games(TBG): The state of the environment in TBG is described by simple language. The player interacts with the environment with text command which respects a pre-defined grammar. A popular example is Zork which has been tested in the paper. TBG is a good research intersection of RL and NLP, it requires language understanding, long-term memory, planning, exploration, affordance extraction and common sense. It also often introduce stochastic dynamics to increase randomness.<br />
<br />
Representations for TBG: Good word representation is necessary in order to learn control policies from texts. Previous work on TBG used pre-trained embeddings directly for control. other works combined pre-trained embeddings with neural networks.<br />
<br />
DRL with linear function approximation: DRL methods such as the DQN have achieved state-of-the-art results in a variety of challenging, high-dimensional domains. This is mainly because neural networks can learn rich domain representations for value function and policy. On the other hand, linear representation batch reinforcement learning methods are more stable and accurate, while feature engineering is necessary.<br />
<br />
RL in Large Action Spaces: Prior work concentrated on factorizing the action space into binary subspace(Pazis and Parr, 2011; Dulac-Arnold et al., 2012; Lagoudakis and Parr, 2003), other works proposed to embed the discrete actions into a continuous space, then choose the nearest discrete action according to the optimal actions in the continuous space(Dulac-Arnold et al., 2015; Van Hasselt and Wiering, 2009). He et. al. (2015)extended DQN to unbounded(natural language) action spaces.<br />
Learning to eliminate actions was first mentioned by (Even-Dar, Mannor, and Mansour, 2003). They proposed to learn confidence intervals around the value function in each state. Lipton et al.(2016a) proposed to learn a classifier that detects hazardous state and then use it to shape the reward. Fulda et al.(2017) presented a method for affordance extraction via inner products of pre-trained word embeddings.<br />
<br />
=Action Elimination=<br />
<br />
After executing an action, the agent observes a binary elimination signal e(s, a) to determine which actions not to take. It equals 1<br />
if action a may be eliminated in state s (and 0 otherwise). The signal helps mitigating the problem of large discrete action spaces. We start with the following<br />
definitions:<br />
<br />
'''Definition 1:''' <br />
<br />
Valid state-action pairs with respect to an elimination signal are state action pairs which the elimination process should not eliminate.<br />
<br />
'''Definition 2:'''<br />
<br />
Admissible state-action pairs with respect to an elimination algorithm are state action pairs which the elimination algorithm does not eliminate.<br />
<br />
'''Definition 3:'''<br />
<br />
Action Elimination Q-learning is a Q-learning algorithm which updates only admissible state-action pairs and chooses the best action in the next state from its admissible actions. We allow the base Q-learning algorithm to be any algorithm that converges to <math display="inline">Q^*</math> with probability 1 after observing each state-action infinitely often.<br />
<br />
The approach in the paper builds on the standard RL formulation. At each time step t, the agent observes state <math display="inline">s_t </math> and chooses a discrete action <math display="inline">a_t\in\{1,...,|A|\} </math>. Then the agent obtains a reward <math display="inline">r_t(s_t,a_t) </math> and next state <math display="inline">s_{t+1} </math>. The goal of the algorithm is to learn a policy <math display="inline">\pi(a|s) </math> which maximizes the expected future discount return <math display="inline">V^\pi(s)=E^\pi[\sum_{t=0}^{\infty}\gamma^tr(s_t,a_t)|s_0=s]. </math>After executing an action, the agent observes a binary elimination signal e(s,a), which equals to 1 if action a can be eliminated for state s, 0 otherwise. <br />
<br />
==Advantages of Action Elimination==<br />
The main advantages of action elimination is that it allows the agent to overcome some of the main difficulties in large action spaces which are Function Approximation and Sample Complexity. <br />
<br />
Function approximation: Errors in the Q-function estimates may cause the learning algorithm to converge to a suboptimal policy, this phenomenon becomes more noticeable when the action space is large. Action elimination mitigate this effect by taking the max operator only on valid actions, thus, reducing potential overestimation. Besides, by ignoring the invalid actions, the function approximation can also learn a simpler mapping leading to faster convergence.<br />
<br />
Sample complexity: The sample complexity measures the number of steps during learning, in which the policy is not <math display="inline">\epsilon</math>-optimal. The invalid action often returns no reward and doesn't change the state, (Lattimore and Hutter, 2012)resulting in an action gap of <math display="inline">\epsilon=(1-\gamma)V^*(s)</math>, and this translates to <math display="inline">V^*(s)^{-2}(1-\gamma)^{-5}log(1/\delta)</math> wasted samples for learning each invalid state-action pair. Practically, elimination algorithm can eliminate these invalid actions and therefore speed up the learning process approximately by <math display="inline">A/A'</math>.<br />
<br />
==Action elimination with contextual bandits==<br />
<br />
Let <math display="inline">x(s_t)\in R^d </math> be the feature representation of <math display="inline">s_t </math>. We assume that under this representation there exists a set of parameters <math display="inline">\theta_a^*\in R_d </math> such that the elimination signal in state <math display="inline">s_t </math> is <math display="inline">e_t(s_t,a) = \theta_a^Tx(s_t)+\eta_t </math>, where <math display="inline"> \Vert\theta_a^*\Vert_2\leq S</math>. <math display="inline">\eta_t</math> is an R-subgaussian random variable with zero mean that models additive noise to the elimination signal. When there is no noise in the elimination signal, R=0. Otherwise, <math display="inline">R\leq 1</math> since the elimination signal is bounded in [0,1]. Assume the elimination signal satisfies: <math display="inline">0\leq E[e_t(s_t,a)]\leq l </math> for any valid action and <math display="inline"> u\leq E[e_t(s_t, a)]\leq 1</math> for any invalid action. And <math display="inline"> l\leq u</math>. Denote by <math display="inline">X_{t,a}</math> as the matrix whose rows are the observed state representation vectors in which action a was chosen, up to time t. <math display="inline">E_{t,a}</math> as the vector whose elements are the observed state representation elimination signals in which action a was chosen, up to time t. Denote the solution to the regularized linear regression <math display="inline">\Vert X_{t,a}\theta_{t,a}-E_{t,a}\Vert_2^2+\lambda\Vert \theta_{t,a}\Vert_2^2 </math> (for some <math display="inline">\lambda>0</math>) by <math display="inline">\hat{\theta}_{t,a}=\bar{V}_{t,a}^{-1}X_{t,a}^TE_{t,a} </math>, where <math display="inline">\bar{V}_{t,a}=\lambda I + X_{t,a}^TX_{t,a}</math>.<br />
<br />
<br />
According to Theorem 2 in (Abbasi-Yadkori, Pal, and Szepesvari, 2011), <math display="inline">|\hat{\theta}_{t,a}^{T}x(s_t)-\theta_a^{*T}x(s_t)|\leq\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)} \forall t>0</math>, where <math display="inline">\sqrt{\beta_t(\delta)}=R\sqrt{2log(det(\bar{V}_{t,a}^{1/2})det(\lambda I)^{-1/2}/\delta)}+\lambda^{1/2}S</math>, with probability of at least <math display="inline">1-\delta</math>. If <math display="inline">\forall s \Vert x(s)\Vert_2 \leq L</math>, then <math display="inline">\beta_t</math> can be bounded by <math display="inline">\sqrt{\beta_t(\delta)} \leq R \sqrt{dlog(1+tL^2/\lambda/\delta)}+\lambda^{1/2}S</math>. Next, define <math display="inline">\tilde{\delta}=\delta/k</math> and bound this probability for all the actions. i.e., <math display="inline">\forall a,t>0</math><br />
<br />
<math display="inline">Pr(|\hat{\theta}_{t,a}^{T}x(s_t)-\theta_a^{*T}x(s_t)|\leq\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)}) \leq 1-\delta</math><br />
<br />
Recall that <math display="inline">E[e_t(s,a)]=\theta_a^{*T}x(s_t)\leq l</math> if a is a valid action. Then we can eliminate action a at state <math display="inline">s_t</math> if it satisfies:<br />
<br />
<math display="inline">\hat{\theta}_{t,a}^{T}x(s_t)-\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)})>l</math><br />
<br />
with probability <math display="inline">1-\delta</math> that we never eliminate any valid action. Note that <math display="inline">l, u</math> are not known. In practice, choosing <math display="inline">l</math> to be 0.5 should suffice.<br />
<br />
==Concurrent Learning==<br />
In fact, Q-learning and contextual bandit algorithms can learn simultaneously, resulting in the convergence of both algorithms, i.e., finding an optimal policy and a minimal valid action space. <br />
<br />
If the elimination is done based on the concentration bounds of the linear contextual bandits, it can be ensured that Action Elimination Q-learning converges, as shown in Proposition 1.<br />
<br />
'''Proposition 1:'''<br />
<br />
Assume that all state action pairs (s,a) are visited infinitely often, unless eliminated according to <math display="inline">\hat{\theta}_{t-1,a}^Tx(s)-\sqrt{\beta_{t-1}(\tilde{\delta})x(s)^T\bar{V}_{t-1,a}^{-1}x(s))}>l</math>. Then, with a probability of at least <math display="inline">1-\delta</math>, action elimination Q-learning converges to the optimal Q-function for any valid state-action pairs. In addition, actions which should be eliminated are visited at most <math display="inline">T_{s,a}(t)\leq 4\beta_t/(u-l)^2<br />
+1</math> times.<br />
<br />
Notice that when there is no noise in the elimination signal(R=0), we correctly eliminate actions with probability 1. so invalid actions will be sampled a finite number of times.<br />
<br />
=Method=<br />
The assumption that <math display="inline">e_t(s_t,a)=\theta_a^{*T}x(s_t)+\eta_t </math> might not hold when using raw features like word2vec. So the paper proposes to use the neural network's last layer as features. A practical challenge here is that the features must be fixed over time when used by the contextual bandit. So batch-updates framework(Levine et al., 2017;Riquelme, Tucker, and Snoek, 2018) is used, where a new contextual bandit model is learned for every few steps that uses the last layer activations of the AEN as features.<br />
<br />
==Architecture of action elimination framework==<br />
<br />
[[File:AEF_architecture.png]]<br />
<br />
After taking action <math display="inline">a_t</math>, the agent observes <math display="inline">(r_t,s_{t+1},e_t)</math>. The agent use it to learn two function approximation deep neural networks: A DQN and an AEN. AEN provides an admissible actions set <math display="inline">A'</math> to the DQN. The architecture for both the AEN and DQN is an NLP CNN(100 convolutional filters for AEN and 500 for DQN, with three different 1D kernels of length (1,2,3)), based on(Kim, 2014). the state is represented as a sequence of words, composed of the game descriptor and the player's inventory. these are truncated or zero padded to a length of 50 descriptor + 15 inventory words and each word is embedded into continuous vectors using word2vec in <math display="inline">R^{300}</math>. The features of the last four states are then concatenated together such that the final state representations s are in <math display="inline">R^{78000}</math>. The AEN is trained to minimize the MSE loss, using the elimination signal as a label.<br />
<br />
==Psuedocode of the Algorithm==<br />
<br />
[[File:AEF_pseudocode.png]]<br />
<br />
AE-DQN trains two networks: a DQN denoted by Q and an AEN denoted by E. The algorithm creates a linear contextual bandit model from it every L iterations with procedure AENUpdate(). This procedure uses the activations of the last hidden layer of E as features, which are then used to create a contextual linear bandit model.AENUpdate() then solved this model and plugin it into the target AEN. The contextual linear bandit model <math display="inline">(E^-,V)</math> is then used to eliminate actions via the ACT() and Target() functions. ACT() follows an <math display="inline">\epsilon</math>-greedy mechanism on the admissible actions set. For exploitation, it selects the action with highest Q-value by taking an argmax on Q-values among <math display="inline">A'</math>. For exploration, it selects an action uniformly from <math display="inline">A'</math>. The targets() procedure is estimating the value function by taking max over Q-values only among admissible actions, hence, reducing function approximation errors.<br />
<br />
<br />
=Experiment=<br />
==Zork domain==<br />
The world of Zork presents a rich environment with a large state and action space. <br />
Zork players describe their actions using natural language instructions. For example, "open the mailbox". Then their actions were processed by a sophisticated natural language parser. Based on the results, the game presents the outcome of the action. The goal of Zork is to collect the Twenty Treasures of Zork and install them in the trophy case. Points that are generated from the game's scoring system are given to the agent as the reward. For example, the player gets the points when solving the puzzles. Placing all treasures in the trophy will get 350 points. The elimination signal is given in two forms, "wrong parse" flag, and text feedback "you cannot take that". These two signals are grouped together into a single binary signal which then provided to the algorithm. <br />
<br />
Experiments begin with the two subdomains of Zork domains: Egg Quest and the Troll Quest. For these subdomains, an additional reward signal is provided to guide the agent towards solving specific tasks and make the results more visible. A reward of -1 is applied at every time step to encourage the agent to favor short paths. Each trajectory terminates is upon completing the quest or after T steps are taken. The discounted factor for training is <math display="inline">\gamma=0.8</math> and <math display="inline">\gamma=1</math> during evaluation. Also <math display="inline">\beta=0.5, l=0.6</math> in all experiments. <br />
<br />
===Egg Quest===<br />
The goal for this quest is to find and open the jewel-encrusted egg hidden on a tree in the forest. The agent will get 100 points upon completing this task. For action space, there are 9 fixed actions for navigation, and a second subset which consisting <math display="inline">N_{Take}</math> actions for taking possible objects in the game. <math display="inline">N_{Take}=200 (set A_1), N_{Take}=300 (set A_2)</math> has been tested separately.<br />
AE-DQN (blue) and a vanilla DQN agent (green) has been tested in this quest.<br />
<br />
[[File:AEF_zork_comparison.png]]<br />
<br />
Figure a) corresponds to the set <math display="inline">A_1</math>, with T=100, b) corresponds to the set <math display="inline">A_2</math>, with T=200. Both agents has performed well on these two sets. However the AE-DQN agent has learned must faster than DQN, which implies that action elimination is more robust when the action space is large.<br />
<br />
<br />
===Troll Quest===<br />
The goal of this quest is to find the troll. To do it the agent need to find the way to the house, use a lantern to expose the hidden entrance to the underworld. It will get 100 points upon achieving the goal. This quest is a larger problem than Egg Quest. The action set <math display="inline">A_1</math> is 200 take actions and 15 necessary actions, 215 in total.<br />
<br />
[[File:AEF_troll_comparison.png]]<br />
<br />
The red line above is an "optimal elimination" baseline which consists of only 35 actions(15 essential, and 20 relevant take actions). We can see that AE-DQN still outperforms DQN, and also achieving compatible performance to the "optimal elimination" baseline. <br />
<br />
<br />
===Open Zork===<br />
Lastly, the "Open Zork" domain has been tested which only the environment reward has been used. 1M steps has been trained. Each trajectory terminates after T=200 steps. Two action sets have been used:<math display="inline">A_3</math>, the "Minimal Zork" action set, which is the minimal set of actions (131) that is required to solve the game. <math display="inline">A_4</math>, the "Open Zork" action set (1227) which composed of {Verb, Object} tuples for all the verbs and objects in the game.<br />
<br />
[[File:AEF_open_zork_comparison.png]]<br />
<br />
The above Figure shows the learning curve for both AE-DQN and DQN. We can see that AE-DQN (blue) still outperform the DQN (blue) in terms of speed and cumulative reward.<br />
<br />
=Conclusion=<br />
In this paper, the authors proposed a Deep Reinforcement Learning model for sub-optimal actions while performing Q-learning. Moreover, they improved learning and reduced the action space when the model was tested on Zork, a textbased game.<br />
<br />
=Critique=<br />
<br />
=Reference=</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learn_what_not_to_learn&diff=41165learn what not to learn2018-11-23T17:03:12Z<p>Mpafla: /* Action Elimination */</p>
<hr />
<div>=Introduction=<br />
In reinforcement learning, it is often difficult for agent to learn when the action space is large. For a specific case that many actions are irrelevant, it is sometimes easier for the algorithm to learn which action not to take. The paper propose a new reinforcement learning approach for dealing with large action spaces by restricting the available actions in each state to a subset of the most likely ones. More specifically, it propose a system that learns the approximation of Q-function and concurrently learns to eliminate actions. The method need to utilize an external elimination signal which incorporates domain-specific prior knowledge. For example, in parser-based text games, the parser gives feedback regarding irrelevant actions after the action is played. (e.g., Player: "Climb the tree." Parser: "There are no trees to climb") Then a machine learning model can be trained to generalize to unseen states. <br />
<br />
The paper focus mainly on tasks where both states and the actions are natural language. It introduces a novel deep reinforcement learning approach which has a DQN network and an Action Elimination Network(AEN), both using the CNN for NLP tasks. The AEN is trained to predict invalid actions, supervised by the elimination signal from the environment. '''Note that the core assumption is that it is easy to predict which actions are invalid or inferior in each state and leverage that information for control.'''<br />
<br />
The text-based game called "Zork", which let player to interact with a virtual world through a text based interface, is tested by using the elimination framework. The AE algorithm has achieved faster learning rate than the baseline agents through eliminating irrelevant actions.<br />
<br />
Below shows an example for the Zork interface:<br />
<br />
[[File:AEF_zork_interface.png]]<br />
<br />
All state and action are given in natural language. Input for the game contains more than a thousand possible actions in each state since player can type anything.<br />
<br />
=Related Work=<br />
Text-Based Games(TBG): The state of the environment in TBG is described by simple language. The player interacts with the environment with text command which respects a pre-defined grammar. A popular example is Zork which has been tested in the paper. TBG is a good research intersection of RL and NLP, it requires language understanding, long-term memory, planning, exploration, affordance extraction and common sense. It also often introduce stochastic dynamics to increase randomness.<br />
<br />
Representations for TBG: Good word representation is necessary in order to learn control policies from texts. Previous work on TBG used pre-trained embeddings directly for control. other works combined pre-trained embeddings with neural networks.<br />
<br />
DRL with linear function approximation: DRL methods such as the DQN have achieved state-of-the-art results in a variety of challenging, high-dimensional domains. This is mainly because neural networks can learn rich domain representations for value function and policy. On the other hand, linear representation batch reinforcement learning methods are more stable and accurate, while feature engineering is necessary.<br />
<br />
RL in Large Action Spaces: Prior work concentrated on factorizing the action space into binary subspace(Pazis and Parr, 2011; Dulac-Arnold et al., 2012; Lagoudakis and Parr, 2003), other works proposed to embed the discrete actions into a continuous space, then choose the nearest discrete action according to the optimal actions in the continuous space(Dulac-Arnold et al., 2015; Van Hasselt and Wiering, 2009). He et. al. (2015)extended DQN to unbounded(natural language) action spaces.<br />
Learning to eliminate actions was first mentioned by (Even-Dar, Mannor, and Mansour, 2003). They proposed to learn confidence intervals around the value function in each state. Lipton et al.(2016a) proposed to learn a classifier that detects hazardous state and then use it to shape the reward. Fulda et al.(2017) presented a method for affordance extraction via inner products of pre-trained word embeddings.<br />
<br />
=Action Elimination=<br />
<br />
After executing an action, the agent observes a binary elimination signal e(s, a) to determine which actions not to take. It equals 1<br />
if action a may be eliminated in state s (and 0 otherwise). The signal helps mitigating the problem of large discrete action spaces. We start with the following<br />
definitions:<br />
<br />
<br />
'''Definition 1:''' <br />
<br />
Valid state-action pairs with respect to an elimination signal are state action pairs which the elimination process should not eliminate.<br />
<br />
'''Definition 2:'''<br />
<br />
Admissible state-action pairs with respect to an elimination algorithm are state action pairs which the elimination algorithm does not eliminate.<br />
<br />
'''Definition 3:'''<br />
<br />
Action Elimination Q-learning is a Q-learning algorithm which updates only admissible state-action pairs and chooses the best action in the next state from its admissible actions. We allow the base Q-learning algorithm to be any algorithm that converges to <math display="inline">Q^*</math> with probability 1 after observing each state-action infinitely often.<br />
<br />
The approach in the paper builds on the standard RL formulation. At each time step t, the agent observes state <math display="inline">s_t </math> and chooses a discrete action <math display="inline">a_t\in\{1,...,|A|\} </math>. Then the agent obtains a reward <math display="inline">r_t(s_t,a_t) </math> and next state <math display="inline">s_{t+1} </math>. The goal of the algorithm is to learn a policy <math display="inline">\pi(a|s) </math> which maximizes the expected future discount return <math display="inline">V^\pi(s)=E^\pi[\sum_{t=0}^{\infty}\gamma^tr(s_t,a_t)|s_0=s]. </math>After executing an action, the agent observes a binary elimination signal e(s,a), which equals to 1 if action a can be eliminated for state s, 0 otherwise. <br />
<br />
==Advantages of Action Elimination==<br />
The main advantages of action elimination is that it allows the agent to overcome some of the main difficulties in large action spaces which are Function Approximation and Sample Complexity. <br />
<br />
Function approximation: Errors in the Q-function estimates may cause the learning algorithm to converge to a suboptimal policy, this phenomenon becomes more noticeable when the action space is large. Action elimination mitigate this effect by taking the max operator only on valid actions, thus, reducing potential overestimation. Besides, by ignoring the invalid actions, the function approximation can also learn a simpler mapping leading to faster convergence.<br />
<br />
Sample complexity: The sample complexity measures the number of steps during learning, in which the policy is not <math display="inline">\epsilon</math>-optimal. The invalid action often returns no reward and doesn't change the state, (Lattimore and Hutter, 2012)resulting in an action gap of <math display="inline">\epsilon=(1-\gamma)V^*(s)</math>, and this translates to <math display="inline">V^*(s)^{-2}(1-\gamma)^{-5}log(1/\delta)</math> wasted samples for learning each invalid state-action pair. Practically, elimination algorithm can eliminate these invalid actions and therefore speed up the learning process approximately by <math display="inline">A/A'</math>.<br />
<br />
==Action elimination with contextual bandits==<br />
<br />
Let <math display="inline">x(s_t)\in R^d </math> be the feature representation of <math display="inline">s_t </math>. We assume that under this representation there exists a set of parameters <math display="inline">\theta_a^*\in R_d </math> such that the elimination signal in state <math display="inline">s_t </math> is <math display="inline">e_t(s_t,a) = \theta_a^Tx(s_t)+\eta_t </math>, where <math display="inline"> \Vert\theta_a^*\Vert_2\leq S</math>. <math display="inline">\eta_t</math> is an R-subgaussian random variable with zero mean that models additive noise to the elimination signal. When there is no noise in the elimination signal, R=0. Otherwise, <math display="inline">R\leq 1</math> since the elimination signal is bounded in [0,1]. Assume the elimination signal satisfies: <math display="inline">0\leq E[e_t(s_t,a)]\leq l </math> for any valid action and <math display="inline"> u\leq E[e_t(s_t, a)]\leq 1</math> for any invalid action. And <math display="inline"> l\leq u</math>. Denote by <math display="inline">X_{t,a}</math> as the matrix whose rows are the observed state representation vectors in which action a was chosen, up to time t. <math display="inline">E_{t,a}</math> as the vector whose elements are the observed state representation elimination signals in which action a was chosen, up to time t. Denote the solution to the regularized linear regression <math display="inline">\Vert X_{t,a}\theta_{t,a}-E_{t,a}\Vert_2^2+\lambda\Vert \theta_{t,a}\Vert_2^2 </math> (for some <math display="inline">\lambda>0</math>) by <math display="inline">\hat{\theta}_{t,a}=\bar{V}_{t,a}^{-1}X_{t,a}^TE_{t,a} </math>, where <math display="inline">\bar{V}_{t,a}=\lambda I + X_{t,a}^TX_{t,a}</math>.<br />
<br />
<br />
According to Theorem 2 in (Abbasi-Yadkori, Pal, and Szepesvari, 2011), <math display="inline">|\hat{\theta}_{t,a}^{T}x(s_t)-\theta_a^{*T}x(s_t)|\leq\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)} \forall t>0</math>, where <math display="inline">\sqrt{\beta_t(\delta)}=R\sqrt{2log(det(\bar{V}_{t,a}^{1/2})det(\lambda I)^{-1/2}/\delta)}+\lambda^{1/2}S</math>, with probability of at least <math display="inline">1-\delta</math>. If <math display="inline">\forall s \Vert x(s)\Vert_2 \leq L</math>, then <math display="inline">\beta_t</math> can be bounded by <math display="inline">\sqrt{\beta_t(\delta)} \leq R \sqrt{dlog(1+tL^2/\lambda/\delta)}+\lambda^{1/2}S</math>. Next, define <math display="inline">\tilde{\delta}=\delta/k</math> and bound this probability for all the actions. i.e., <math display="inline">\forall a,t>0</math><br />
<br />
<math display="inline">Pr(|\hat{\theta}_{t,a}^{T}x(s_t)-\theta_a^{*T}x(s_t)|\leq\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)}) \leq 1-\delta</math><br />
<br />
Recall that <math display="inline">E[e_t(s,a)]=\theta_a^{*T}x(s_t)\leq l</math> if a is a valid action. Then we can eliminate action a at state <math display="inline">s_t</math> if it satisfies:<br />
<br />
<math display="inline">\hat{\theta}_{t,a}^{T}x(s_t)-\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)})>l</math><br />
<br />
with probability <math display="inline">1-\delta</math> that we never eliminate any valid action. Note that <math display="inline">l, u</math> are not known. In practice, choosing <math display="inline">l</math> to be 0.5 should suffice.<br />
<br />
==Concurrent Learning==<br />
In fact, Q-learning and contextual bandit algorithms can learn simultaneously, resulting in the convergence of both algorithms, i.e., finding an optimal policy and a minimal valid action space. <br />
<br />
If the elimination is done based on the concentration bounds of the linear contextual bandits, it can be ensured that Action Elimination Q-learning converges, as shown in Proposition 1.<br />
<br />
'''Proposition 1:'''<br />
<br />
Assume that all state action pairs (s,a) are visited infinitely often, unless eliminated according to <math display="inline">\hat{\theta}_{t-1,a}^Tx(s)-\sqrt{\beta_{t-1}(\tilde{\delta})x(s)^T\bar{V}_{t-1,a}^{-1}x(s))}>l</math>. Then, with a probability of at least <math display="inline">1-\delta</math>, action elimination Q-learning converges to the optimal Q-function for any valid state-action pairs. In addition, actions which should be eliminated are visited at most <math display="inline">T_{s,a}(t)\leq 4\beta_t/(u-l)^2<br />
+1</math> times.<br />
<br />
Notice that when there is no noise in the elimination signal(R=0), we correctly eliminate actions with probability 1. so invalid actions will be sampled a finite number of times.<br />
<br />
=Method=<br />
The assumption that <math display="inline">e_t(s_t,a)=\theta_a^{*T}x(s_t)+\eta_t </math> might not hold when using raw features like word2vec. So the paper proposes to use the neural network's last layer as features. A practical challenge here is that the features must be fixed over time when used by the contextual bandit. So batch-updates framework(Levine et al., 2017;Riquelme, Tucker, and Snoek, 2018) is used, where a new contextual bandit model is learned for every few steps that uses the last layer activations of the AEN as features.<br />
<br />
==Architecture of action elimination framework==<br />
<br />
[[File:AEF_architecture.png]]<br />
<br />
After taking action <math display="inline">a_t</math>, the agent observes <math display="inline">(r_t,s_{t+1},e_t)</math>. The agent use it to learn two function approximation deep neural networks: A DQN and an AEN. AEN provides an admissible actions set <math display="inline">A'</math> to the DQN. The architecture for both the AEN and DQN is an NLP CNN(100 convolutional filters for AEN and 500 for DQN, with three different 1D kernels of length (1,2,3)), based on(Kim, 2014). the state is represented as a sequence of words, composed of the game descriptor and the player's inventory. these are truncated or zero padded to a length of 50 descriptor + 15 inventory words and each word is embedded into continuous vectors using word2vec in <math display="inline">R^{300}</math>. The features of the last four states are then concatenated together such that the final state representations s are in <math display="inline">R^{78000}</math>. The AEN is trained to minimize the MSE loss, using the elimination signal as a label.<br />
<br />
==Psuedocode of the Algorithm==<br />
<br />
[[File:AEF_pseudocode.png]]<br />
<br />
AE-DQN trains two networks: a DQN denoted by Q and an AEN denoted by E. The algorithm creates a linear contextual bandit model from it every L iterations with procedure AENUpdate(). This procedure uses the activations of the last hidden layer of E as features, which are then used to create a contextual linear bandit model.AENUpdate() then solved this model and plugin it into the target AEN. The contextual linear bandit model <math display="inline">(E^-,V)</math> is then used to eliminate actions via the ACT() and Target() functions. ACT() follows an <math display="inline">\epsilon</math>-greedy mechanism on the admissible actions set. For exploitation, it selects the action with highest Q-value by taking an argmax on Q-values among <math display="inline">A'</math>. For exploration, it selects an action uniformly from <math display="inline">A'</math>. The targets() procedure is estimating the value function by taking max over Q-values only among admissible actions, hence, reducing function approximation errors.<br />
<br />
<br />
=Experiment=<br />
==Zork domain==<br />
The world of Zork presents a rich environment with a large state and action space. <br />
Zork players describe their actions using natural language instructions. For example, "open the mailbox". Then their actions were processed by a sophisticated natural language parser. Based on the results, the game presents the outcome of the action. The goal of Zork is to collect the Twenty Treasures of Zork and install them in the trophy case. Points that are generated from the game's scoring system are given to the agent as the reward. For example, the player gets the points when solving the puzzles. Placing all treasures in the trophy will get 350 points. The elimination signal is given in two forms, "wrong parse" flag, and text feedback "you cannot take that". These two signals are grouped together into a single binary signal which then provided to the algorithm. <br />
<br />
Experiments begin with the two subdomains of Zork domains: Egg Quest and the Troll Quest. For these subdomains, an additional reward signal is provided to guide the agent towards solving specific tasks and make the results more visible. A reward of -1 is applied at every time step to encourage the agent to favor short paths. Each trajectory terminates is upon completing the quest or after T steps are taken. The discounted factor for training is <math display="inline">\gamma=0.8</math> and <math display="inline">\gamma=1</math> during evaluation. Also <math display="inline">\beta=0.5, l=0.6</math> in all experiments. <br />
<br />
===Egg Quest===<br />
The goal for this quest is to find and open the jewel-encrusted egg hidden on a tree in the forest. The agent will get 100 points upon completing this task. For action space, there are 9 fixed actions for navigation, and a second subset which consisting <math display="inline">N_{Take}</math> actions for taking possible objects in the game. <math display="inline">N_{Take}=200 (set A_1), N_{Take}=300 (set A_2)</math> has been tested separately.<br />
AE-DQN (blue) and a vanilla DQN agent (green) has been tested in this quest.<br />
<br />
[[File:AEF_zork_comparison.png]]<br />
<br />
Figure a) corresponds to the set <math display="inline">A_1</math>, with T=100, b) corresponds to the set <math display="inline">A_2</math>, with T=200. Both agents has performed well on these two sets. However the AE-DQN agent has learned must faster than DQN, which implies that action elimination is more robust when the action space is large.<br />
<br />
<br />
===Troll Quest===<br />
The goal of this quest is to find the troll. To do it the agent need to find the way to the house, use a lantern to expose the hidden entrance to the underworld. It will get 100 points upon achieving the goal. This quest is a larger problem than Egg Quest. The action set <math display="inline">A_1</math> is 200 take actions and 15 necessary actions, 215 in total.<br />
<br />
[[File:AEF_troll_comparison.png]]<br />
<br />
The red line above is an "optimal elimination" baseline which consists of only 35 actions(15 essential, and 20 relevant take actions). We can see that AE-DQN still outperforms DQN, and also achieving compatible performance to the "optimal elimination" baseline. <br />
<br />
<br />
===Open Zork===<br />
Lastly, the "Open Zork" domain has been tested which only the environment reward has been used. 1M steps has been trained. Each trajectory terminates after T=200 steps. Two action sets have been used:<math display="inline">A_3</math>, the "Minimal Zork" action set, which is the minimal set of actions (131) that is required to solve the game. <math display="inline">A_4</math>, the "Open Zork" action set (1227) which composed of {Verb, Object} tuples for all the verbs and objects in the game.<br />
<br />
[[File:AEF_open_zork_comparison.png]]<br />
<br />
The above Figure shows the learning curve for both AE-DQN and DQN. We can see that AE-DQN (blue) still outperform the DQN (blue) in terms of speed and cumulative reward.<br />
<br />
=Conclusion=<br />
In this paper, the authors proposed a Deep Reinforcement Learning model for sub-optimal actions while performing Q-learning. Moreover, they improved learning and reduced the action space when the model was tested on Zork, a textbased game.<br />
<br />
=Critique=<br />
<br />
=Reference=</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learn_what_not_to_learn&diff=41164learn what not to learn2018-11-23T17:03:02Z<p>Mpafla: /* Related Work */</p>
<hr />
<div>=Introduction=<br />
In reinforcement learning, it is often difficult for agent to learn when the action space is large. For a specific case that many actions are irrelevant, it is sometimes easier for the algorithm to learn which action not to take. The paper propose a new reinforcement learning approach for dealing with large action spaces by restricting the available actions in each state to a subset of the most likely ones. More specifically, it propose a system that learns the approximation of Q-function and concurrently learns to eliminate actions. The method need to utilize an external elimination signal which incorporates domain-specific prior knowledge. For example, in parser-based text games, the parser gives feedback regarding irrelevant actions after the action is played. (e.g., Player: "Climb the tree." Parser: "There are no trees to climb") Then a machine learning model can be trained to generalize to unseen states. <br />
<br />
The paper focus mainly on tasks where both states and the actions are natural language. It introduces a novel deep reinforcement learning approach which has a DQN network and an Action Elimination Network(AEN), both using the CNN for NLP tasks. The AEN is trained to predict invalid actions, supervised by the elimination signal from the environment. '''Note that the core assumption is that it is easy to predict which actions are invalid or inferior in each state and leverage that information for control.'''<br />
<br />
The text-based game called "Zork", which let player to interact with a virtual world through a text based interface, is tested by using the elimination framework. The AE algorithm has achieved faster learning rate than the baseline agents through eliminating irrelevant actions.<br />
<br />
Below shows an example for the Zork interface:<br />
<br />
[[File:AEF_zork_interface.png]]<br />
<br />
All state and action are given in natural language. Input for the game contains more than a thousand possible actions in each state since player can type anything.<br />
<br />
=Related Work=<br />
Text-Based Games(TBG): The state of the environment in TBG is described by simple language. The player interacts with the environment with text command which respects a pre-defined grammar. A popular example is Zork which has been tested in the paper. TBG is a good research intersection of RL and NLP, it requires language understanding, long-term memory, planning, exploration, affordance extraction and common sense. It also often introduce stochastic dynamics to increase randomness.<br />
<br />
Representations for TBG: Good word representation is necessary in order to learn control policies from texts. Previous work on TBG used pre-trained embeddings directly for control. other works combined pre-trained embeddings with neural networks.<br />
<br />
DRL with linear function approximation: DRL methods such as the DQN have achieved state-of-the-art results in a variety of challenging, high-dimensional domains. This is mainly because neural networks can learn rich domain representations for value function and policy. On the other hand, linear representation batch reinforcement learning methods are more stable and accurate, while feature engineering is necessary.<br />
<br />
RL in Large Action Spaces: Prior work concentrated on factorizing the action space into binary subspace(Pazis and Parr, 2011; Dulac-Arnold et al., 2012; Lagoudakis and Parr, 2003), other works proposed to embed the discrete actions into a continuous space, then choose the nearest discrete action according to the optimal actions in the continuous space(Dulac-Arnold et al., 2015; Van Hasselt and Wiering, 2009). He et. al. (2015)extended DQN to unbounded(natural language) action spaces.<br />
Learning to eliminate actions was first mentioned by (Even-Dar, Mannor, and Mansour, 2003). They proposed to learn confidence intervals around the value function in each state. Lipton et al.(2016a) proposed to learn a classifier that detects hazardous state and then use it to shape the reward. Fulda et al.(2017) presented a method for affordance extraction via inner products of pre-trained word embeddings.<br />
<br />
=Action Elimination=<br />
<br />
'''Definition 1:''' <br />
<br />
Valid state-action pairs with respect to an elimination signal are state action pairs which the elimination process should not eliminate.<br />
<br />
'''Definition 2:'''<br />
<br />
Admissible state-action pairs with respect to an elimination algorithm are state action pairs which the elimination algorithm does not eliminate.<br />
<br />
'''Definition 3:'''<br />
<br />
Action Elimination Q-learning is a Q-learning algorithm which updates only admissible state-action pairs and chooses the best action in the next state from its admissible actions. We allow the base Q-learning algorithm to be any algorithm that converges to <math display="inline">Q^*</math> with probability 1 after observing each state-action infinitely often.<br />
<br />
The approach in the paper builds on the standard RL formulation. At each time step t, the agent observes state <math display="inline">s_t </math> and chooses a discrete action <math display="inline">a_t\in\{1,...,|A|\} </math>. Then the agent obtains a reward <math display="inline">r_t(s_t,a_t) </math> and next state <math display="inline">s_{t+1} </math>. The goal of the algorithm is to learn a policy <math display="inline">\pi(a|s) </math> which maximizes the expected future discount return <math display="inline">V^\pi(s)=E^\pi[\sum_{t=0}^{\infty}\gamma^tr(s_t,a_t)|s_0=s]. </math>After executing an action, the agent observes a binary elimination signal e(s,a), which equals to 1 if action a can be eliminated for state s, 0 otherwise. <br />
<br />
==Advantages of Action Elimination==<br />
The main advantages of action elimination is that it allows the agent to overcome some of the main difficulties in large action spaces which are Function Approximation and Sample Complexity. <br />
<br />
Function approximation: Errors in the Q-function estimates may cause the learning algorithm to converge to a suboptimal policy, this phenomenon becomes more noticeable when the action space is large. Action elimination mitigate this effect by taking the max operator only on valid actions, thus, reducing potential overestimation. Besides, by ignoring the invalid actions, the function approximation can also learn a simpler mapping leading to faster convergence.<br />
<br />
Sample complexity: The sample complexity measures the number of steps during learning, in which the policy is not <math display="inline">\epsilon</math>-optimal. The invalid action often returns no reward and doesn't change the state, (Lattimore and Hutter, 2012)resulting in an action gap of <math display="inline">\epsilon=(1-\gamma)V^*(s)</math>, and this translates to <math display="inline">V^*(s)^{-2}(1-\gamma)^{-5}log(1/\delta)</math> wasted samples for learning each invalid state-action pair. Practically, elimination algorithm can eliminate these invalid actions and therefore speed up the learning process approximately by <math display="inline">A/A'</math>.<br />
<br />
==Action elimination with contextual bandits==<br />
<br />
Let <math display="inline">x(s_t)\in R^d </math> be the feature representation of <math display="inline">s_t </math>. We assume that under this representation there exists a set of parameters <math display="inline">\theta_a^*\in R_d </math> such that the elimination signal in state <math display="inline">s_t </math> is <math display="inline">e_t(s_t,a) = \theta_a^Tx(s_t)+\eta_t </math>, where <math display="inline"> \Vert\theta_a^*\Vert_2\leq S</math>. <math display="inline">\eta_t</math> is an R-subgaussian random variable with zero mean that models additive noise to the elimination signal. When there is no noise in the elimination signal, R=0. Otherwise, <math display="inline">R\leq 1</math> since the elimination signal is bounded in [0,1]. Assume the elimination signal satisfies: <math display="inline">0\leq E[e_t(s_t,a)]\leq l </math> for any valid action and <math display="inline"> u\leq E[e_t(s_t, a)]\leq 1</math> for any invalid action. And <math display="inline"> l\leq u</math>. Denote by <math display="inline">X_{t,a}</math> as the matrix whose rows are the observed state representation vectors in which action a was chosen, up to time t. <math display="inline">E_{t,a}</math> as the vector whose elements are the observed state representation elimination signals in which action a was chosen, up to time t. Denote the solution to the regularized linear regression <math display="inline">\Vert X_{t,a}\theta_{t,a}-E_{t,a}\Vert_2^2+\lambda\Vert \theta_{t,a}\Vert_2^2 </math> (for some <math display="inline">\lambda>0</math>) by <math display="inline">\hat{\theta}_{t,a}=\bar{V}_{t,a}^{-1}X_{t,a}^TE_{t,a} </math>, where <math display="inline">\bar{V}_{t,a}=\lambda I + X_{t,a}^TX_{t,a}</math>.<br />
<br />
<br />
According to Theorem 2 in (Abbasi-Yadkori, Pal, and Szepesvari, 2011), <math display="inline">|\hat{\theta}_{t,a}^{T}x(s_t)-\theta_a^{*T}x(s_t)|\leq\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)} \forall t>0</math>, where <math display="inline">\sqrt{\beta_t(\delta)}=R\sqrt{2log(det(\bar{V}_{t,a}^{1/2})det(\lambda I)^{-1/2}/\delta)}+\lambda^{1/2}S</math>, with probability of at least <math display="inline">1-\delta</math>. If <math display="inline">\forall s \Vert x(s)\Vert_2 \leq L</math>, then <math display="inline">\beta_t</math> can be bounded by <math display="inline">\sqrt{\beta_t(\delta)} \leq R \sqrt{dlog(1+tL^2/\lambda/\delta)}+\lambda^{1/2}S</math>. Next, define <math display="inline">\tilde{\delta}=\delta/k</math> and bound this probability for all the actions. i.e., <math display="inline">\forall a,t>0</math><br />
<br />
<math display="inline">Pr(|\hat{\theta}_{t,a}^{T}x(s_t)-\theta_a^{*T}x(s_t)|\leq\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)}) \leq 1-\delta</math><br />
<br />
Recall that <math display="inline">E[e_t(s,a)]=\theta_a^{*T}x(s_t)\leq l</math> if a is a valid action. Then we can eliminate action a at state <math display="inline">s_t</math> if it satisfies:<br />
<br />
<math display="inline">\hat{\theta}_{t,a}^{T}x(s_t)-\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)})>l</math><br />
<br />
with probability <math display="inline">1-\delta</math> that we never eliminate any valid action. Note that <math display="inline">l, u</math> are not known. In practice, choosing <math display="inline">l</math> to be 0.5 should suffice.<br />
<br />
==Concurrent Learning==<br />
In fact, Q-learning and contextual bandit algorithms can learn simultaneously, resulting in the convergence of both algorithms, i.e., finding an optimal policy and a minimal valid action space. <br />
<br />
If the elimination is done based on the concentration bounds of the linear contextual bandits, it can be ensured that Action Elimination Q-learning converges, as shown in Proposition 1.<br />
<br />
'''Proposition 1:'''<br />
<br />
Assume that all state action pairs (s,a) are visited infinitely often, unless eliminated according to <math display="inline">\hat{\theta}_{t-1,a}^Tx(s)-\sqrt{\beta_{t-1}(\tilde{\delta})x(s)^T\bar{V}_{t-1,a}^{-1}x(s))}>l</math>. Then, with a probability of at least <math display="inline">1-\delta</math>, action elimination Q-learning converges to the optimal Q-function for any valid state-action pairs. In addition, actions which should be eliminated are visited at most <math display="inline">T_{s,a}(t)\leq 4\beta_t/(u-l)^2<br />
+1</math> times.<br />
<br />
Notice that when there is no noise in the elimination signal(R=0), we correctly eliminate actions with probability 1. so invalid actions will be sampled a finite number of times.<br />
<br />
=Method=<br />
The assumption that <math display="inline">e_t(s_t,a)=\theta_a^{*T}x(s_t)+\eta_t </math> might not hold when using raw features like word2vec. So the paper proposes to use the neural network's last layer as features. A practical challenge here is that the features must be fixed over time when used by the contextual bandit. So batch-updates framework(Levine et al., 2017;Riquelme, Tucker, and Snoek, 2018) is used, where a new contextual bandit model is learned for every few steps that uses the last layer activations of the AEN as features.<br />
<br />
==Architecture of action elimination framework==<br />
<br />
[[File:AEF_architecture.png]]<br />
<br />
After taking action <math display="inline">a_t</math>, the agent observes <math display="inline">(r_t,s_{t+1},e_t)</math>. The agent use it to learn two function approximation deep neural networks: A DQN and an AEN. AEN provides an admissible actions set <math display="inline">A'</math> to the DQN. The architecture for both the AEN and DQN is an NLP CNN(100 convolutional filters for AEN and 500 for DQN, with three different 1D kernels of length (1,2,3)), based on(Kim, 2014). the state is represented as a sequence of words, composed of the game descriptor and the player's inventory. these are truncated or zero padded to a length of 50 descriptor + 15 inventory words and each word is embedded into continuous vectors using word2vec in <math display="inline">R^{300}</math>. The features of the last four states are then concatenated together such that the final state representations s are in <math display="inline">R^{78000}</math>. The AEN is trained to minimize the MSE loss, using the elimination signal as a label.<br />
<br />
==Psuedocode of the Algorithm==<br />
<br />
[[File:AEF_pseudocode.png]]<br />
<br />
AE-DQN trains two networks: a DQN denoted by Q and an AEN denoted by E. The algorithm creates a linear contextual bandit model from it every L iterations with procedure AENUpdate(). This procedure uses the activations of the last hidden layer of E as features, which are then used to create a contextual linear bandit model.AENUpdate() then solved this model and plugin it into the target AEN. The contextual linear bandit model <math display="inline">(E^-,V)</math> is then used to eliminate actions via the ACT() and Target() functions. ACT() follows an <math display="inline">\epsilon</math>-greedy mechanism on the admissible actions set. For exploitation, it selects the action with highest Q-value by taking an argmax on Q-values among <math display="inline">A'</math>. For exploration, it selects an action uniformly from <math display="inline">A'</math>. The targets() procedure is estimating the value function by taking max over Q-values only among admissible actions, hence, reducing function approximation errors.<br />
<br />
<br />
=Experiment=<br />
==Zork domain==<br />
The world of Zork presents a rich environment with a large state and action space. <br />
Zork players describe their actions using natural language instructions. For example, "open the mailbox". Then their actions were processed by a sophisticated natural language parser. Based on the results, the game presents the outcome of the action. The goal of Zork is to collect the Twenty Treasures of Zork and install them in the trophy case. Points that are generated from the game's scoring system are given to the agent as the reward. For example, the player gets the points when solving the puzzles. Placing all treasures in the trophy will get 350 points. The elimination signal is given in two forms, "wrong parse" flag, and text feedback "you cannot take that". These two signals are grouped together into a single binary signal which then provided to the algorithm. <br />
<br />
Experiments begin with the two subdomains of Zork domains: Egg Quest and the Troll Quest. For these subdomains, an additional reward signal is provided to guide the agent towards solving specific tasks and make the results more visible. A reward of -1 is applied at every time step to encourage the agent to favor short paths. Each trajectory terminates is upon completing the quest or after T steps are taken. The discounted factor for training is <math display="inline">\gamma=0.8</math> and <math display="inline">\gamma=1</math> during evaluation. Also <math display="inline">\beta=0.5, l=0.6</math> in all experiments. <br />
<br />
===Egg Quest===<br />
The goal for this quest is to find and open the jewel-encrusted egg hidden on a tree in the forest. The agent will get 100 points upon completing this task. For action space, there are 9 fixed actions for navigation, and a second subset which consisting <math display="inline">N_{Take}</math> actions for taking possible objects in the game. <math display="inline">N_{Take}=200 (set A_1), N_{Take}=300 (set A_2)</math> has been tested separately.<br />
AE-DQN (blue) and a vanilla DQN agent (green) has been tested in this quest.<br />
<br />
[[File:AEF_zork_comparison.png]]<br />
<br />
Figure a) corresponds to the set <math display="inline">A_1</math>, with T=100, b) corresponds to the set <math display="inline">A_2</math>, with T=200. Both agents has performed well on these two sets. However the AE-DQN agent has learned must faster than DQN, which implies that action elimination is more robust when the action space is large.<br />
<br />
<br />
===Troll Quest===<br />
The goal of this quest is to find the troll. To do it the agent need to find the way to the house, use a lantern to expose the hidden entrance to the underworld. It will get 100 points upon achieving the goal. This quest is a larger problem than Egg Quest. The action set <math display="inline">A_1</math> is 200 take actions and 15 necessary actions, 215 in total.<br />
<br />
[[File:AEF_troll_comparison.png]]<br />
<br />
The red line above is an "optimal elimination" baseline which consists of only 35 actions(15 essential, and 20 relevant take actions). We can see that AE-DQN still outperforms DQN, and also achieving compatible performance to the "optimal elimination" baseline. <br />
<br />
<br />
===Open Zork===<br />
Lastly, the "Open Zork" domain has been tested which only the environment reward has been used. 1M steps has been trained. Each trajectory terminates after T=200 steps. Two action sets have been used:<math display="inline">A_3</math>, the "Minimal Zork" action set, which is the minimal set of actions (131) that is required to solve the game. <math display="inline">A_4</math>, the "Open Zork" action set (1227) which composed of {Verb, Object} tuples for all the verbs and objects in the game.<br />
<br />
[[File:AEF_open_zork_comparison.png]]<br />
<br />
The above Figure shows the learning curve for both AE-DQN and DQN. We can see that AE-DQN (blue) still outperform the DQN (blue) in terms of speed and cumulative reward.<br />
<br />
=Conclusion=<br />
In this paper, the authors proposed a Deep Reinforcement Learning model for sub-optimal actions while performing Q-learning. Moreover, they improved learning and reduced the action space when the model was tested on Zork, a textbased game.<br />
<br />
=Critique=<br />
<br />
=Reference=</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learn_what_not_to_learn&diff=41162learn what not to learn2018-11-23T17:01:08Z<p>Mpafla: /* Action Elimination */</p>
<hr />
<div>=Introduction=<br />
In reinforcement learning, it is often difficult for agent to learn when the action space is large. For a specific case that many actions are irrelevant, it is sometimes easier for the algorithm to learn which action not to take. The paper propose a new reinforcement learning approach for dealing with large action spaces by restricting the available actions in each state to a subset of the most likely ones. More specifically, it propose a system that learns the approximation of Q-function and concurrently learns to eliminate actions. The method need to utilize an external elimination signal which incorporates domain-specific prior knowledge. For example, in parser-based text games, the parser gives feedback regarding irrelevant actions after the action is played. (e.g., Player: "Climb the tree." Parser: "There are no trees to climb") Then a machine learning model can be trained to generalize to unseen states. <br />
<br />
The paper focus mainly on tasks where both states and the actions are natural language. It introduces a novel deep reinforcement learning approach which has a DQN network and an Action Elimination Network(AEN), both using the CNN for NLP tasks. The AEN is trained to predict invalid actions, supervised by the elimination signal from the environment. '''Note that the core assumption is that it is easy to predict which actions are invalid or inferior in each state and leverage that information for control.'''<br />
<br />
The text-based game called "Zork", which let player to interact with a virtual world through a text based interface, is tested by using the elimination framework. The AE algorithm has achieved faster learning rate than the baseline agents through eliminating irrelevant actions.<br />
<br />
Below shows an example for the Zork interface:<br />
<br />
[[File:AEF_zork_interface.png]]<br />
<br />
All state and action are given in natural language. Input for the game contains more than a thousand possible actions in each state since player can type anything.<br />
<br />
=Related Work=<br />
Text-Based Games(TBG): The state of the environment in TBG is described by simple language. The player interacts with the environment with text command which respects a pre-defined grammar. A popular example is Zork which has been tested in the paper. TBG is a good research intersection of RL and NLP, it requires language understanding, long-term memory, planning, exploration, affordance extraction and common sense. It also often introduce stochastic dynamics to increase randomness.<br />
<br />
Representations for TBG: Good word representation is necessary in order to learn control policies from texts. Previous work on TBG used pre-trained embeddings directly for control. other works combined pre-trained embeddings with neural networks.<br />
<br />
DRL with linear function approximation: DRL methods such as the DQN have achieved state-of-the-art results in a variety of challenging, high-dimensional domains. This is mainly because neural networks can learn rich domain representations for value function and policy. On the other hand, linear representation batch reinforcement learning methods are more stable and accurate, while feature engineering is necessary.<br />
<br />
RL in Large Action Spaces: Prior work concentrated on factorizing the action space into binary subspace(Pazis and Parr, 2011; Dulac-Arnold et al., 2012; Lagoudakis and Parr, 2003), other works proposed to embed the discrete actions into a continuous space, then choose the nearest discrete action according to the optimal actions in the continuous space(Dulac-Arnold et al., 2015; Van Hasselt and Wiering, 2009). He et. al. (2015)extended DQN to unbounded(natural language) action spaces.<br />
Learning to eliminate actions was first mentioned by (Even-Dar, Mannor, and Mansour, 2003). They proposed to learn confidence intervals around the value function in each state. Lipton et al.(2016a) proposed to learn a classifier that detects hazardous state and then use it to shape the reward. Fulda et al.(2017) presented a method for affordance extraction via inner products of pre-trained word embeddings.<br />
<br />
After executing an action, the agent observes a binary elimination signal e(s, a) to determine which actions not to take. It equals 1<br />
if action a may be eliminated in state s (and 0 otherwise). The signal helps mitigating the problem of large discrete action spaces. We start with the following<br />
definitions:<br />
<br />
<br />
=Action Elimination=<br />
<br />
'''Definition 1:''' <br />
<br />
Valid state-action pairs with respect to an elimination signal are state action pairs which the elimination process should not eliminate.<br />
<br />
'''Definition 2:'''<br />
<br />
Admissible state-action pairs with respect to an elimination algorithm are state action pairs which the elimination algorithm does not eliminate.<br />
<br />
'''Definition 3:'''<br />
<br />
Action Elimination Q-learning is a Q-learning algorithm which updates only admissible state-action pairs and chooses the best action in the next state from its admissible actions. We allow the base Q-learning algorithm to be any algorithm that converges to <math display="inline">Q^*</math> with probability 1 after observing each state-action infinitely often.<br />
<br />
The approach in the paper builds on the standard RL formulation. At each time step t, the agent observes state <math display="inline">s_t </math> and chooses a discrete action <math display="inline">a_t\in\{1,...,|A|\} </math>. Then the agent obtains a reward <math display="inline">r_t(s_t,a_t) </math> and next state <math display="inline">s_{t+1} </math>. The goal of the algorithm is to learn a policy <math display="inline">\pi(a|s) </math> which maximizes the expected future discount return <math display="inline">V^\pi(s)=E^\pi[\sum_{t=0}^{\infty}\gamma^tr(s_t,a_t)|s_0=s]. </math>After executing an action, the agent observes a binary elimination signal e(s,a), which equals to 1 if action a can be eliminated for state s, 0 otherwise. <br />
<br />
==Advantages of Action Elimination==<br />
The main advantages of action elimination is that it allows the agent to overcome some of the main difficulties in large action spaces which are Function Approximation and Sample Complexity. <br />
<br />
Function approximation: Errors in the Q-function estimates may cause the learning algorithm to converge to a suboptimal policy, this phenomenon becomes more noticeable when the action space is large. Action elimination mitigate this effect by taking the max operator only on valid actions, thus, reducing potential overestimation. Besides, by ignoring the invalid actions, the function approximation can also learn a simpler mapping leading to faster convergence.<br />
<br />
Sample complexity: The sample complexity measures the number of steps during learning, in which the policy is not <math display="inline">\epsilon</math>-optimal. The invalid action often returns no reward and doesn't change the state, (Lattimore and Hutter, 2012)resulting in an action gap of <math display="inline">\epsilon=(1-\gamma)V^*(s)</math>, and this translates to <math display="inline">V^*(s)^{-2}(1-\gamma)^{-5}log(1/\delta)</math> wasted samples for learning each invalid state-action pair. Practically, elimination algorithm can eliminate these invalid actions and therefore speed up the learning process approximately by <math display="inline">A/A'</math>.<br />
<br />
==Action elimination with contextual bandits==<br />
<br />
Let <math display="inline">x(s_t)\in R^d </math> be the feature representation of <math display="inline">s_t </math>. We assume that under this representation there exists a set of parameters <math display="inline">\theta_a^*\in R_d </math> such that the elimination signal in state <math display="inline">s_t </math> is <math display="inline">e_t(s_t,a) = \theta_a^Tx(s_t)+\eta_t </math>, where <math display="inline"> \Vert\theta_a^*\Vert_2\leq S</math>. <math display="inline">\eta_t</math> is an R-subgaussian random variable with zero mean that models additive noise to the elimination signal. When there is no noise in the elimination signal, R=0. Otherwise, <math display="inline">R\leq 1</math> since the elimination signal is bounded in [0,1]. Assume the elimination signal satisfies: <math display="inline">0\leq E[e_t(s_t,a)]\leq l </math> for any valid action and <math display="inline"> u\leq E[e_t(s_t, a)]\leq 1</math> for any invalid action. And <math display="inline"> l\leq u</math>. Denote by <math display="inline">X_{t,a}</math> as the matrix whose rows are the observed state representation vectors in which action a was chosen, up to time t. <math display="inline">E_{t,a}</math> as the vector whose elements are the observed state representation elimination signals in which action a was chosen, up to time t. Denote the solution to the regularized linear regression <math display="inline">\Vert X_{t,a}\theta_{t,a}-E_{t,a}\Vert_2^2+\lambda\Vert \theta_{t,a}\Vert_2^2 </math> (for some <math display="inline">\lambda>0</math>) by <math display="inline">\hat{\theta}_{t,a}=\bar{V}_{t,a}^{-1}X_{t,a}^TE_{t,a} </math>, where <math display="inline">\bar{V}_{t,a}=\lambda I + X_{t,a}^TX_{t,a}</math>.<br />
<br />
<br />
According to Theorem 2 in (Abbasi-Yadkori, Pal, and Szepesvari, 2011), <math display="inline">|\hat{\theta}_{t,a}^{T}x(s_t)-\theta_a^{*T}x(s_t)|\leq\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)} \forall t>0</math>, where <math display="inline">\sqrt{\beta_t(\delta)}=R\sqrt{2log(det(\bar{V}_{t,a}^{1/2})det(\lambda I)^{-1/2}/\delta)}+\lambda^{1/2}S</math>, with probability of at least <math display="inline">1-\delta</math>. If <math display="inline">\forall s \Vert x(s)\Vert_2 \leq L</math>, then <math display="inline">\beta_t</math> can be bounded by <math display="inline">\sqrt{\beta_t(\delta)} \leq R \sqrt{dlog(1+tL^2/\lambda/\delta)}+\lambda^{1/2}S</math>. Next, define <math display="inline">\tilde{\delta}=\delta/k</math> and bound this probability for all the actions. i.e., <math display="inline">\forall a,t>0</math><br />
<br />
<math display="inline">Pr(|\hat{\theta}_{t,a}^{T}x(s_t)-\theta_a^{*T}x(s_t)|\leq\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)}) \leq 1-\delta</math><br />
<br />
Recall that <math display="inline">E[e_t(s,a)]=\theta_a^{*T}x(s_t)\leq l</math> if a is a valid action. Then we can eliminate action a at state <math display="inline">s_t</math> if it satisfies:<br />
<br />
<math display="inline">\hat{\theta}_{t,a}^{T}x(s_t)-\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)})>l</math><br />
<br />
with probability <math display="inline">1-\delta</math> that we never eliminate any valid action. Note that <math display="inline">l, u</math> are not known. In practice, choosing <math display="inline">l</math> to be 0.5 should suffice.<br />
<br />
==Concurrent Learning==<br />
In fact, Q-learning and contextual bandit algorithms can learn simultaneously, resulting in the convergence of both algorithms, i.e., finding an optimal policy and a minimal valid action space. <br />
<br />
If the elimination is done based on the concentration bounds of the linear contextual bandits, it can be ensured that Action Elimination Q-learning converges, as shown in Proposition 1.<br />
<br />
'''Proposition 1:'''<br />
<br />
Assume that all state action pairs (s,a) are visited infinitely often, unless eliminated according to <math display="inline">\hat{\theta}_{t-1,a}^Tx(s)-\sqrt{\beta_{t-1}(\tilde{\delta})x(s)^T\bar{V}_{t-1,a}^{-1}x(s))}>l</math>. Then, with a probability of at least <math display="inline">1-\delta</math>, action elimination Q-learning converges to the optimal Q-function for any valid state-action pairs. In addition, actions which should be eliminated are visited at most <math display="inline">T_{s,a}(t)\leq 4\beta_t/(u-l)^2<br />
+1</math> times.<br />
<br />
Notice that when there is no noise in the elimination signal(R=0), we correctly eliminate actions with probability 1. so invalid actions will be sampled a finite number of times.<br />
<br />
=Method=<br />
The assumption that <math display="inline">e_t(s_t,a)=\theta_a^{*T}x(s_t)+\eta_t </math> might not hold when using raw features like word2vec. So the paper proposes to use the neural network's last layer as features. A practical challenge here is that the features must be fixed over time when used by the contextual bandit. So batch-updates framework(Levine et al., 2017;Riquelme, Tucker, and Snoek, 2018) is used, where a new contextual bandit model is learned for every few steps that uses the last layer activations of the AEN as features.<br />
<br />
==Architecture of action elimination framework==<br />
<br />
[[File:AEF_architecture.png]]<br />
<br />
After taking action <math display="inline">a_t</math>, the agent observes <math display="inline">(r_t,s_{t+1},e_t)</math>. The agent use it to learn two function approximation deep neural networks: A DQN and an AEN. AEN provides an admissible actions set <math display="inline">A'</math> to the DQN. The architecture for both the AEN and DQN is an NLP CNN(100 convolutional filters for AEN and 500 for DQN, with three different 1D kernels of length (1,2,3)), based on(Kim, 2014). the state is represented as a sequence of words, composed of the game descriptor and the player's inventory. these are truncated or zero padded to a length of 50 descriptor + 15 inventory words and each word is embedded into continuous vectors using word2vec in <math display="inline">R^{300}</math>. The features of the last four states are then concatenated together such that the final state representations s are in <math display="inline">R^{78000}</math>. The AEN is trained to minimize the MSE loss, using the elimination signal as a label.<br />
<br />
==Psuedocode of the Algorithm==<br />
<br />
[[File:AEF_pseudocode.png]]<br />
<br />
AE-DQN trains two networks: a DQN denoted by Q and an AEN denoted by E. The algorithm creates a linear contextual bandit model from it every L iterations with procedure AENUpdate(). This procedure uses the activations of the last hidden layer of E as features, which are then used to create a contextual linear bandit model.AENUpdate() then solved this model and plugin it into the target AEN. The contextual linear bandit model <math display="inline">(E^-,V)</math> is then used to eliminate actions via the ACT() and Target() functions. ACT() follows an <math display="inline">\epsilon</math>-greedy mechanism on the admissible actions set. For exploitation, it selects the action with highest Q-value by taking an argmax on Q-values among <math display="inline">A'</math>. For exploration, it selects an action uniformly from <math display="inline">A'</math>. The targets() procedure is estimating the value function by taking max over Q-values only among admissible actions, hence, reducing function approximation errors.<br />
<br />
<br />
=Experiment=<br />
==Zork domain==<br />
The world of Zork presents a rich environment with a large state and action space. <br />
Zork players describe their actions using natural language instructions. For example, "open the mailbox". Then their actions were processed by a sophisticated natural language parser. Based on the results, the game presents the outcome of the action. The goal of Zork is to collect the Twenty Treasures of Zork and install them in the trophy case. Points that are generated from the game's scoring system are given to the agent as the reward. For example, the player gets the points when solving the puzzles. Placing all treasures in the trophy will get 350 points. The elimination signal is given in two forms, "wrong parse" flag, and text feedback "you cannot take that". These two signals are grouped together into a single binary signal which then provided to the algorithm. <br />
<br />
Experiments begin with the two subdomains of Zork domains: Egg Quest and the Troll Quest. For these subdomains, an additional reward signal is provided to guide the agent towards solving specific tasks and make the results more visible. A reward of -1 is applied at every time step to encourage the agent to favor short paths. Each trajectory terminates is upon completing the quest or after T steps are taken. The discounted factor for training is <math display="inline">\gamma=0.8</math> and <math display="inline">\gamma=1</math> during evaluation. Also <math display="inline">\beta=0.5, l=0.6</math> in all experiments. <br />
<br />
===Egg Quest===<br />
The goal for this quest is to find and open the jewel-encrusted egg hidden on a tree in the forest. The agent will get 100 points upon completing this task. For action space, there are 9 fixed actions for navigation, and a second subset which consisting <math display="inline">N_{Take}</math> actions for taking possible objects in the game. <math display="inline">N_{Take}=200 (set A_1), N_{Take}=300 (set A_2)</math> has been tested separately.<br />
AE-DQN (blue) and a vanilla DQN agent (green) has been tested in this quest.<br />
<br />
[[File:AEF_zork_comparison.png]]<br />
<br />
Figure a) corresponds to the set <math display="inline">A_1</math>, with T=100, b) corresponds to the set <math display="inline">A_2</math>, with T=200. Both agents has performed well on these two sets. However the AE-DQN agent has learned must faster than DQN, which implies that action elimination is more robust when the action space is large.<br />
<br />
<br />
===Troll Quest===<br />
The goal of this quest is to find the troll. To do it the agent need to find the way to the house, use a lantern to expose the hidden entrance to the underworld. It will get 100 points upon achieving the goal. This quest is a larger problem than Egg Quest. The action set <math display="inline">A_1</math> is 200 take actions and 15 necessary actions, 215 in total.<br />
<br />
[[File:AEF_troll_comparison.png]]<br />
<br />
The red line above is an "optimal elimination" baseline which consists of only 35 actions(15 essential, and 20 relevant take actions). We can see that AE-DQN still outperforms DQN, and also achieving compatible performance to the "optimal elimination" baseline. <br />
<br />
<br />
===Open Zork===<br />
Lastly, the "Open Zork" domain has been tested which only the environment reward has been used. 1M steps has been trained. Each trajectory terminates after T=200 steps. Two action sets have been used:<math display="inline">A_3</math>, the "Minimal Zork" action set, which is the minimal set of actions (131) that is required to solve the game. <math display="inline">A_4</math>, the "Open Zork" action set (1227) which composed of {Verb, Object} tuples for all the verbs and objects in the game.<br />
<br />
[[File:AEF_open_zork_comparison.png]]<br />
<br />
The above Figure shows the learning curve for both AE-DQN and DQN. We can see that AE-DQN (blue) still outperform the DQN (blue) in terms of speed and cumulative reward.<br />
<br />
=Conclusion=<br />
In this paper, the authors proposed a Deep Reinforcement Learning model for sub-optimal actions while performing Q-learning. Moreover, they improved learning and reduced the action space when the model was tested on Zork, a textbased game.<br />
<br />
=Critique=<br />
<br />
=Reference=</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=policy_optimization_with_demonstrations&diff=41161policy optimization with demonstrations2018-11-23T16:43:35Z<p>Mpafla: /* Conclusion */</p>
<hr />
<div>= Introduction =<br />
<br />
==Introduction==<br />
The reinforcement learning (RL) method has made significant progress in a variety of applications, but the exploration problems regarding how to gain more experience from novel policy to improve long-term performance are still challenges, especially in environments where reward signals are sparse and rare. There are currently two ways to solve such exploration problems in RL: 1) Guide the agent to explore the state that has never been seen. 2) Guide the agent to imitate the demonstration trajectory sampled from an expert policy to learn. When guiding the agent to imitate the expert behavior for learning, there are also two methods: putting the demonstration directly into the replay memory [1] [2] [3] or using the demonstration trajectory to pre-train the policy in a supervised manner [4]. However, neither of these methods takes full advantage of the demonstration data. To address this problem, a novel policy optimization method based on demonstration (POfD) is proposed, which takes full advantage of the demonstration and there is no need to ensure that the expert policy is the optimal policy. In this paper, the authors evaluate the performance of POfD on Mujoco [5] in sparse-reward environments. The experiments results show that the performance of POfD is greatly improved compared with some strong baselines and even to the policy gradient method in dense-reward environments.<br />
<br />
==Intuition==<br />
The agent should imitate the demonstrated behavior when rewards are sparse and then explores new states on its own after acquiring sufficient skills, which is a dynamic intrinsic reward mechanism that can be reshape in terms of the native rewards in RL.<br />
<br />
=Related Work =<br />
There are some relates works in overcoming exploration difficulties by learning from demonstration [6] and imitation learning in RL.<br />
<br />
For learning from demonstration (LfD),<br />
# Most LfD methods adopt value-based RL algorithms, such as DQfD [2] that is applied into the discrete action spaces and DDPGfD [3] that is extends to the continuous spaces. But both of them underutilize the demonstration data.<br />
#There are some methods based on policy iteration [7] [8], which shapes the value function by using demonstration data. But they get the bad performance when demonstration data is imperfect.<br />
# A hybrid framework [9] that learns the policy in which the probability of taking demonstrated actions is maximized is proposed, which considers less demonstration data.<br />
# A reward reshaping mechanism [10] that encourages taking actions close to the demonstrated ones is proposed. It is similar to the method in this paper, but there exists some differences as it is defined as a potential function based on multi-variate Gaussian to model the distribution of state-actions.<br />
All of the above methods require a lot of perfect demonstrations to get satisfactory performance, which is different from POfD in this paper.<br />
<br />
For imitation learning, <br />
# Inverse Reinforce Learning [11] problems are solved by alternating between fitting the reward function and selecting the policy [12] [13]. But it cannot be extended to big-scale problems.<br />
# Generative Adversarial Imitation Learning (GAIL) [14] uses a discriminator to distinguish whether a state-action pair is from the expert or the learned policy and it can be applied into the high-dimensional continuous control problems.<br />
All of the above methods are effective for imitation learning, but they usually suffer the bad performance when the expert data is imperfect. That is different from POfD in this paper.<br />
<br />
=Background=<br />
<br />
==Preliminaries==<br />
Markov Decision Process (MDP) [15] is defined by a tuple <math>⟨S, A, P, r, \gamma⟩ </math>, where <math>S</math> is the state, <math>A </math> is the action, <math>P(s'|s,a)</math> is the transition distribution of taking action <math> a </math> at state <math>s </math>, <math> r(s,a) </math>is the reward function, and <math> \gamma </math> is discounted factor between 0 and 1. Policy <math> \pi(a|s) </math> is a mapping from state to action, the performance of <math> \pi </math> is usually evaluated by its expected discounted reward <math> \eta(\pi) </math>: <br />
\[\eta(\pi)=\mathbb{E}_{\pi}[r(s,a)]=\mathbb{E}_{(s_0,a_0,s_1,...)}[\sum_{t=0}^\infty\gamma^{t}r(s_t,a_t)] \]<br />
The value function is <math> V_{\pi}(s) =\mathbb{E}_{\pi}[r(·,·)|s_0=s] </math>, the action value function is <math> Q_{\pi}(s,a) =\mathbb{E}_{\pi}[r(·,·)|s_0=s,a_0=a] </math>, and the advantage function that reflects the expected additional reward after taking action a at state s is <math> A_{\pi}(s,a)=Q_{\pi}(s,a)-V_{\pi}(s)</math>.<br />
Then the authors define Occupancy measure, which is used to estimate the probability that state <math>s</math> and state action pairs <math>(s,a)</math> when executing a certain policy.<br />
[[File:def1.png|500px|center]]<br />
Then the performance of <math> \pi </math> can be rewritten to: <br />
[[File:equ2.png|500px|center]]<br />
At the same time, the authors propose a lemma: <br />
[[File:lemma1.png|500px|center]]<br />
<br />
==Problem Definition==<br />
In this paper, the authors aim to develop a method that can boost exploration by leveraging effectively the demonstrations <math>D^E </math>from the expert policy <math> \pi_E </math> and maximize <math> \eta(\pi) </math> in the sparse-reward environment. The authors define the demonstrations <math>D^E=\{\tau_1,\tau_2,...,\tau_N\} </math>, where <math>\tau_i=\{(s_0^i,a_0^i),(s_1^i,a_1^i),...,(s_T^i,a_T^i)\} </math> is generated from the expert policy. In addition, there is an assumption on the quality of the expert policy:<br />
[[File:asp1.png|500px|center]]<br />
Moreover, it is not necessary to ensure that the expert policy is advantageous over all the policies. It is because that POfD will learn a better policy than expert policy by exploring on its own in later learning stages. <br />
<br />
=Method=<br />
<br />
==Policy Optimization with Demonstration (POfD)==<br />
[[File:ff1.png|500px|center]]<br />
This method optimizes the policy by forcing the policy to explore in the nearby region of the expert policy that is specified by several demonstrated trajectories <math>D^E </math> (as shown in Fig.1) in order to avoid causing slow convergence or failure when the environment feedback is sparse. In addition, the authors encourage the policy π to explore by "following" the demonstrations <math>D^E </math>. Thus, a new learning objective is given:<br />
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\pi_{\theta},\pi_{E})\]<br />
where <math>D_{JS}(\pi_{\theta},\pi_{E})</math> is Jensen-Shannon divergence between current policy <math>\pi_{\theta}</math> and the expert policy <math>\pi_{E}</math> , <math>\lambda_1</math> is a trading-off parameter, and <math>\theta</math> is policy parameter. According to Lemma 1, the authors use <math>D_{JS}(\rho_{\theta},\rho_{E})</math> to instead of <math>D_{JS}(\pi_{\theta},\pi_{E})</math>, because it is easier to optimize through adversarial training on demonstrations. The learning objective is: <br />
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\rho_{\theta},\rho_{E})\]<br />
<br />
==Benefits of Exploration with Demonstrations==<br />
The authors introduce the benefits of POfD. Firstly, we consider the expression of expected return in policy gradient methods [16].<br />
\[ \eta(\pi)=\eta(\pi_{old})+\mathbb{E}_{\tau\sim\pi}[\sum_{t=0}^\infty\gamma^{t}A_{\pi_{old}}(s,a)]\]<br />
<math>\eta(\pi)</math>is the advantage over the policy πold in the previous iteration, so the expression can be rewritten by<br />
\[ \eta(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]<br />
The local approximation to <math>\eta(\pi)</math> up to first order is usually as the surrogate learning objective to be optimized by policy gradient methods due to the difficulties brought by complex dependency of <math>\rho_{\pi}(s)</math> over <math> \pi </math>:<br />
\[ J_{\pi_{old}}(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi_{old}}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]<br />
The policy gradient methods improve <math>\eta(\pi)</math> monotonically by optimizing the above <math>J_{\pi_{old}}(\pi)</math> with a sufficiently small update step from <math>\pi_{old}</math> to <math>\pi</math> such that <math>D_{KL}^{max}(\pi, \pi_{old})</math> is bounded [16] [17] [18]. For POfD, it imposes a regularization <math>D_{JS}(\pi_{\theta}, \pi_{E})</math> in order to encourage explorations around regions demonstrated by the expert policy. Theorem 1 shows such benefits,<br />
[[File:them1.png|500px|center]]<br />
<br />
==Optimization==<br />
<br />
For POfD, the authors choose to optimize the lower bound of learning objective rather than optimizing objective. This optimization method is compatible with any policy gradient methods. Theorem 2 gives the lower bound of <math>D_{JS}(\rho_{\theta}, \rho_{E})</math>：<br />
[[File:them2.png|500px|center]]<br />
Thus, the occupancy measure matching objective can be written as:<br />
[[File:eqnlm.png|500px|center]]<br />
where <math> D(s,a)=\frac{1}{1+e^{-U(s,a)}}: S\times A \rightarrow (0,1)</math>, and its supremum ranging is like a discriminator for distinguishing whether the state-action pair is a current policy or an expert policy.<br />
To avoid overfitting, the authors add causal entropy <math>−H (\pi_{\theta}) </math> as the regularization term. Thus, the learning objective is: <br />
\[\min_{\theta}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H(\pi_{\theta})+\lambda_{1} \sup_{{D\in(0,1)}^{S\times A}} \mathbb{E}_{\pi_{\theta}}[\log(D(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D(s,a))]\]<br />
At this point, the problem has been like Generative Adversarial Networks (GANs) [19]. The difference is that the discriminative model D of GANs is well-trained but the expert policy of POfD is not optimal. Then suppose D is parameterized by w. If it is from an expert policy, <math>D_w</math>is toward 1, otherwise it is toward 0. Thus, the minimax learning objective is:<br />
\[\min_{\theta}\max_{w}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H (\pi_{\theta})+\lambda_{1}( \mathbb{E}_{\pi_{\theta}}[\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))])\]<br />
The minimax learning objective can be rewritten by substituting the expression of <math> \eta(\pi) </math>:<br />
\[\min_{\theta}\max_{w}-\mathbb{E}_{\pi_{\theta}}[r'(s,a)]-\lambda_{2}H (\pi_{\theta})+\lambda_{1}\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))]\]<br />
where <math> r'(s,a)=r(a,b)-\lambda_{1}\log(D_{w}(s,a))</math> is the reshaped reward function.<br />
The above objective can be optimized efficiently by alternately updating policy parameters θ and discriminator parameters w, then the gradient is given by:<br />
\[\mathbb{E}_{\pi}[\nabla_{w}\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\nabla_{w}\log(1-D_{w}(s,a))]\]<br />
Then, fixing the discriminator <math>D_w</math>, the reshaped policy gradient is:<br />
\[\nabla_{\theta}\mathbb{E}_{\pi_{\theta}}[r'(s,a)]=\mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(a|s)Q'(s,a)]\]<br />
where <math>Q'(\bar{s},\bar{a})=\mathbb{E}_{\pi_{\theta}}[r'(s,a)|s_0=\bar{s},a_0=\bar{a}]</math>.<br />
<br />
At the end, Algorithm 1 gives the detailed process.<br />
[[File:pofd.png|500px|center]]<br />
<br />
=Discussion on Existing LfD Methods=<br />
<br />
==DQFD==<br />
DQFD [2] puts the demonstrations into a replay memory D and keeps them throughout the Q-learning process. The objective for DQFD is:<br />
\[J_{DQfD}={\hat{\mathbb{E}}}_{D}[(R_t(n)-Q_w(s_t,a_t))^2]+\alpha{\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]\]<br />
The second term can be rewritten as <math> {\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]={\hat{\mathbb{E}}}_{D^E}[(\hat{\rho}_E(s,a)-\rho_{\pi}(s,a))^{2}r^2(s,a)]</math>, which can be regarded as a regularization forcing current policy's occupancy measure to match the expert's empirical occupancy measure, weighted by the potential reward.<br />
<br />
==DDPGfD==<br />
DDPGfD [3] also puts the demonstrations into a replay memory D, but it is based on an actor-critic framework [21]. The objective for DDPGfD is the same as DQFD. Its policy gradient is:<br />
\[\nabla_{\theta}J_{DDPGfD}\approx \mathbb{E}_{s,a}[\nabla_{a}Q_w(s,a)\nabla_{\theta}\pi_{\theta}(s)], a=\pi_{\theta}(s) \]<br />
From this equation, policy is updated relying on learned Q-network <math>Q_w </math>rather than the demonstrations <math>D^{E} </math>. DDPGfD shares the same objective function for <math>Q_w </math> as DQfD, thus they have the same way of leveraging demonstrations, that is the demonstrations in DQfD and DDPGfD induce an occupancy measure matching regularization.<br />
<br />
=Experiments=<br />
<br />
==Goal==<br />
The authors aim at investigating 1) whether POfD can aid exploration by leveraging a few demonstrations, even though the demonstrations are imperfect. 2) whether POfD can succeed and achieve high empirical return, especially in environments where reward signals are sparse and rare. <br />
<br />
==Settings==<br />
The authors conduct the experiments on 8 physical control tasks, ranging from low-dimensional spaces to high-dimensional spaces and naturally sparse environments based on OpenAI Gym [20] and Mujoco [5]. Due to the uniqueness of the environments, the authors introduce 4 ways to sparsity their built-in dense rewards. TYPE1: a reward of +1 is given when the agent reaches the terminal state, and otherwisely 0. TYPE2: a reward of +1 is given when the agent survives for a while. TYPE3: a reward of +1 is given for every time the agent moves forward over a specific number of units in Mujoco environments. TYPE4: specially designed for InvertedDoublePendulum, a reward +1 is given when the second pole stays above a specific height of 0.89. The details are shown in Table 1. Moreover, only one single imperfect trajectory is used as the demonstrations in this paper. The authors collect the demonstrations by training an agent insufficiently by running TRPO in the corresponding dense environment. <br />
[[File:pofdt1.png|900px|center]]<br />
<br />
==Baselines==<br />
The authors compare POfD against 5 strong baselines:<br />
* training the policy with TRPO [17] in dense environments, which is called expert <br />
* training the policy with TRPO [17] in sparse environments<br />
* applying GAIL [14] to learn the policy from demonstrations<br />
* DQfD [2]<br />
* DDPGfD [3]<br />
<br />
==Results==<br />
Firstly, the authors test the performance of POfD in sparse control environments with discrete actions. From Table 1, POfD achieves performance comparable with the policy learned under dense environments. From Figure 2, only POfD successes to explore sufficiently and achieves great performance in both sparse environments. TRPO [17] and DQFD [2] fail to explore and GAIL [14] converes to the imperfect demonstration in MountainCar [22].<br />
<br />
[[File:pofdf2.png|500px|center]]<br />
<br />
Then, the authors test the performance of POfD under spares environments with continuous actions space. From Figure 3, POfD achieves expert-level performance in terms of cumulated rewards and surpasses other strong baselines training the policy with TRPO. By watching the learning process of different methods, we can see that TRPO consistently fails to explore the environments when the feedback is sparse, except for HalfCheetah. This may be because there is no terminal state in HalfCheetah, thus a random agent can perform reasonably well as long as the time horizon is sufficiently long. This is shown in Figure3 where the improvement of TRPO begins to show after 400 iterations. DDPGfD and GAIL have common drawback: during training process, they both converge to the imperfect demonstration data. For HalfCheetah, GAIL fails to converge and DDPGfD converges to an even worse point. This situation is expected because the policy and value networks tend to over-fit when having few data, so the training process of GAIL and DDPGfD is severely biased by the imperfect data. Finally, our proposed method can effectively explore the environment with the help of demonstration-based intrinsic reward reshaping, and succeeds consistently across different tasks both in terms of learning stability and convergence speed.<br />
[[File:pofdf3.png|900px|center]]<br />
<br />
The authors also implement a locomotion task <math>Humanoid</math>, which teaches a human-like robot to walk. The state space of dimension is 376, which is very hard to render. As a result, POfD still outperformed all three baselike methods, as they failed to learn policies in such a sparse reward environment.<br />
<br />
The reacher environment is a task that the target is to control a robot arm to touch an object. the location of the object is random for each instantiation. The authors select 15 random trajectories as demonstration data, and the performance of POfD is much better than the expert, while all other baseline methods failed.<br />
<br />
=Conclusion=<br />
A method that can acquire knowledge from a limited amount of imperfect demonstration data to aid exploration in environments with sparse feedback is proposed, that is POfD. It is compatible with any policy gradient methods. POfD induces implicit dynamic reward shaping and brings provable benefits for policy improvement. Moreover, the experiments results have shown the validity and effectivity of POfD in encouraging the agent to explore around the nearby region of the expert policy and learning better policies. The key contribution is that POfD helps the agent work with few and imperfect demonstrations in an environment with sparse rewards.<br />
<br />
=Critique=<br />
# A novel demonstration-based policy optimization method is proposed. In the process of policy optimization, POfD reshapes the reward function. This new reward function can guide the agent to imitate the expert behavior when the reward is sparse and explore on its own when the reward value can be obtained, which can take full advantage of the demonstration data and there is no need to ensure that the expert policy is the optimal policy.<br />
# POfD can be combined with any policy gradient methods. Its performance surpasses five strong baselines and can be comparable to the agents trained in the dense-reward environment.<br />
# The paper is structured and the flow of ideas is easy to follow. For related work, the authors clearly explain similarities and differences among these related works.<br />
# This paper's scalability is demonstrated. The experiments environments are ranging from low-dimensional spaces to high-dimensional spaces and from discrete action spaces to continuous actions spaces. For future work, can it be realized in the real world?<br />
# There is a doubt that whether it is a correct method to use the trajectory that was insufficiently learned in dense-reward environment as the imperfect demonstration.<br />
# In this paper, the performance only is judged by the cumulative reward, can other evaluation terms be considered? For example, the convergence rate.<br />
<br />
=References=<br />
[1] Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., and Abbeel, P. Overcoming exploration in reinforcement learning with demonstrations. arXiv preprint arXiv:1709.10089, 2017.<br />
<br />
[2] Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Sendonaris, A., Dulac-Arnold, G., Osband, I., Agapiou, J., et al. Learning from demonstrations for real world reinforcement learning. arXiv preprint arXiv:1704.03732, 2017.<br />
<br />
[3] Večerík, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rotho ̈rl, T., Lampe, T., and Riedmiller, M. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.<br />
<br />
[4] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.<br />
<br />
[5] Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Con- ference on, pp. 5026–5033. IEEE, 2012.<br />
<br />
[6] Schaal, S. Learning from demonstration. In Advances in neural information processing systems, pp. 1040–1046, 1997.<br />
<br />
[7] Kim, B., Farahmand, A.-m., Pineau, J., and Precup, D. Learning from limited demonstrations. In Advances in Neural Information Processing Systems, pp. 2859–2867, 2013.<br />
<br />
[8] Piot, B., Geist, M., and Pietquin, O. Boosted bellman resid- ual minimization handling expert demonstrations. In Joint European Conference on Machine Learning and Knowl- edge Discovery in Databases, pp. 549–564. Springer, 2014.<br />
<br />
[9] Aravind S. Lakshminarayanan, Sherjil Ozair, Y. B. Rein- forcement learning with few expert demonstrations. In NIPS workshop, 2016.<br />
<br />
[10] Brys, T., Harutyunyan, A., Suay, H. B., Chernova, S., Tay- lor, M. E., and Nowe ́, A. Reinforcement learning from demonstration through shaping. In IJCAI, pp. 3352–3358, 2015.<br />
<br />
[11] Ng, A. Y., Russell, S. J., et al. Algorithms for inverse reinforcement learning. In Icml, pp. 663–670, 2000.<br />
<br />
[12] Syed, U. and Schapire, R. E. A game-theoretic approach to apprenticeship learning. In Advances in neural informa- tion processing systems, pp. 1449–1456, 2008.<br />
<br />
[13] Syed, U., Bowling, M., and Schapire, R. E. Apprenticeship learning using linear programming. In Proceedings of the 25th international conference on Machine learning, pp. 1032–1039. ACM, 2008.<br />
<br />
[14] Ho, J. and Ermon, S. Generative adversarial imitation learn- ing. In Advances in Neural Information Processing Sys- tems, pp. 4565–4573, 2016.<br />
<br />
[15] Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.<br />
<br />
[16] Kakade, S. M. A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538, 2002.<br />
<br />
[17] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1889–1897, 2015.<br />
<br />
[18] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.<br />
<br />
[19] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.<br />
<br />
[20] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.<br />
<br />
[21] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.<br />
<br />
[22] Moore, A. W. Efficient memory-based learning for robot control. 1990.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Visual_Reinforcement_Learning_with_Imagined_Goals&diff=41160Visual Reinforcement Learning with Imagined Goals2018-11-23T16:22:32Z<p>Mpafla: /* Variational Autoencoder (VAE) */</p>
<hr />
<div>Video and details of this work is available [https://sites.google.com/site/visualrlwithimaginedgoals/ here]<br />
<br />
=Introduction and Motivation=<br />
<br />
Humans are able to accomplish many tasks without any explicit or supervised training, simply by exploring their environment. We are able to set our own goals and learn from our experiences, and thus able to accomplish specific tasks without ever having been trained explicitly for them. It would be ideal if an autonomous agent can also set its own goals and learn from its environment.<br />
<br />
In the paper “Visual Reinforcement Learning with Imagined Goals”, the authors are able to devise such an unsupervised reinforcement learning system. They introduce a system that sets abstract goals and autonomously learns to achieve those goals. They then show that the system can use these autonomously learned skills to perform a variety of user-specified goals, such as pushing objects, grasping objects, and opening doors, without any additional learning. Lastly, they demonstrate that their method is efficient enough to work in the real world on a Sawyer robot. The robot learns to set and achieve goals with only images as the input to the system.<br />
<br />
=Related Work =<br />
<br />
Many previous works on vision-based deep reinforcement learning for robotics studied a variety of behaviours such as grasping [1], pushing [2], navigation [3], and other manipulation tasks [4]. However, their assumptions on the models limit their suitability for training general-purpose robots. Some scholars proposed time-varying models which require episodic setups. There are also scholars proposed an approach that uses goal images, but it requires instrumented training simulations. There is no example that uses model-free RL that learns policies to train on real-world robotic systems without having a ground-truth information. <br />
<br />
In this paper, the authors utilize a goal-conditioned value function to tackle more general tasks through goal relabeling, which improves sample efficiency. Specifically, they use a model-free Q-learning method that operates on raw state observations and actions.<br />
<br />
Unsupervised learning has been used in a number of prior works to acquire better representations of RL. In these methods, the learned representation is used as a substitute for the state for the policy. However, these methods require additional information, such as access to the ground truth reward function based on the true state during training time [5], expert trajectories [6], human demonstrations [7], or pre-trained object-detection features [8]. In contrast, the authors learn to generate goals and use the learned representation to get a reward function for those goals without any of these extra sources of supervision.<br />
<br />
=Goal-Conditioned Reinforcement Learning=<br />
<br />
The ultimate goal in reinforcement learning is to learn a policy, that when given a state and goal, can dictate the optimal action. In this paper, goals are not explicitly defined during training. If a goal is not explicitly defined, the agent must be able to generate a set of synthetic goals automatically. Thus, suppose we let an autonomous agent explore an environment with a random policy. After executing each action, state observations are collected and stored. These state observations are structured in the form of images. The agent can randomly select goals from the set of state observations, and can also randomly select initial states from the set of state observations.<br />
<br />
[[File:human-giving-goal.png|center|thumb|400px|The task: Make the world look like this image. [9]]]<br />
<br />
Now given a set of all possible states, a goal, and an initial state, a reinforcement learning framework can be used to find the optimal policy such that the value function is maximized. However, to implement such a framework, a reward function needs to be defined. One choice for the reward is the negative distance between the current state and the goal state, so that maximizing the reward corresponds to minimizing the distance to a goal state.<br />
<br />
In reinforcement learning, a goal-conditioned Q function can be used to find a single policy to maximize rewards and therefore reach goal states. A goal-conditioned Q function Q(s,a,g) tells us how good an action a is, given the current state s and goal g. For example, a Q function tells us, “How good is it to move my hand up (action a), if I’m holding a plate (state s) and want to put the plate on the table (goal g)?” Once this Q function is trained, a goal-conditioned policy can be obtained by performing the following optimization<br />
<br />
[[File:policy-extraction.png|center|600px]]<br />
<br />
which effectively says, “choose the best action according to this Q function.” By using this procedure, one can obtain a policy that maximizes the sum of rewards, i.e. reaches various goals.<br />
<br />
The reason why Q learning is popular is that in can be train in an off-policy manner. Therefore, the only things Q function needs are samples of state, action, next state, goal, and reward: (s,a,s′,g,r). This data can be collected by any policy and can be reused across multiples tasks. So a preliminary goal-conditioned Q-learning algorithm looks like this:<br />
<br />
[[File:ql.png|center|600px]]<br />
<br />
The main drawback in this training procedure is collecting data. In theory, one could learn to solve various tasks without even interacting with the world if more data are available. Unfortunately, it is difficult to learn an accurate model of the world, so sampling are usually used to get state-action-next-state data, (s,a,s′). However, if the reward function r(s,g) can be accessed, one can retroactively relabeled goals and recompute rewards. In this way, more data can be artificially generated given a single (s,a,s′) tuple. So, the training procedure can be modified like so:<br />
<br />
[[File:qlr.png|center|600px]]<br />
<br />
This goal resampling makes it possible to simultaneously learn how to reach multiple goals at once without needing more data from the environment. Thus, this simple modification can result in substantially faster learning. However, the method described above makes two major assumptions: (1) you have access to a reward function and (2) you have access to a goal sampling distribution p(g). When moving to vision-based tasks where goals are images, both of these assumptions introduce practical concerns.<br />
<br />
For one, a fundamental problem with this reward function is that it assumes that the distance between raw images will yield semantically useful information. Images are noisy. A large amount of information in an image that may not be related to the object we analyze. Thus, the distance between two images may not correlate with their semantic distance.<br />
<br />
Second, because the goals are images, a goal image distribution p(g) is needed so that one can sample goal images. Manually designing a distribution over goal images is a non-trivial task and image generation is still an active field of research. It would be ideal if the agent can autonomously imagine its own goals and learn how to reach them.<br />
<br />
=Variational Autoencoder (VAE)=<br />
An autoencoder is a type of machine learning model that can learn to extract a robust, space-efficient feature vector from an image. This generative model converts high-dimensional observations x, like images, into low-dimensional latent variables z, and vice versa. The model is trained so that the latent variables capture the underlying factors of variation in an image. A current image x and goal image xg can be converted into latent variables z and zg, respectively. These latent variables can then be used to represent ate the state and goal for the reinforcement learning algorithm. Learning Q functions and policies on top of this low-dimensional latent space rather than directly on images results in faster learning.<br />
<br />
[[File:robot-interpreting-scene.png|center|thumb|600px|The agent encodes the current image (x) and goal image (xg) into a latent space and use distances in that latent space for reward. [9]]]<br />
<br />
Using the latent variable representations for the images and goals also solves the problem of computing rewards. Instead of using pixel-wise error as our reward, the distance in the latent space is used as the reward to train the agent to reach a goal. The paper shows that this corresponds to rewarding reaching states that maximize the probability of the latent goal zg.<br />
<br />
This generative model is also important because it allows an agent to easily generate goals in the latent space. In particular, the authors design the generative model so that latent variables are sampled from the VAE prior. This sampling mechanism is used for two reasons: First, it provides a mechanism for an agent to set its own goals. The agent simply samples a value for the latent variable from the generative model, and tries to reach that latent goal. Second, this resampling mechanism is also used to relabel goals as mentioned above. Since the VAE prior is trained by real images, meaningful latent goals can be sampled from the latent variable prior. This will help the agent set its own goals and practice towards them if no goal is provided at test time.<br />
<br />
[[File:robot-imagining-goals.png|center|thumb|600px|Even without a human providing a goal, our agent can still generate its own goals, both for exploration and for goal relabeling. [9]]]<br />
<br />
The authors summarize the purpose of the latent variable representation of images as follows: (1) captures the underlying factors of a scene, (2) provides meaningful distances to optimize, and (3) provides an efficient goal sampling mechanism which can be used by the agent to generate its own goals. The overall method is called reinforcement learning with imagined goals (RIG) by the authors.<br />
The process involves starts with collecting data through a simple exploration policy. Possible alternative explorations could be employed here including off-the-shelf exploration bonuses or unsupervised reinforcement learning methods. Then, a VAE latent variable model is trained on state observations and fine-tuned during training. The latent variable model is used for multiple purposes: sampling a latent goal <math>zg</math> from the model and conditioning the policy on this goal. All states and goals are embedded using the model’s encoder and then used to train the goal-conditioned value function. The authors then resample goals from the prior and compute rewards in the latent space.<br />
<br />
=Experiments=<br />
<br />
The authors evaluated their method against some prior algorithms and ablated versions of their approach on a suite of simulated and real-world tasks: Visual Reacher, Visual Pusher, and Visual Multi-Object Pusher. They compared their model with the following prior works: L&R, DSAE, HER, and Oracle. It is concluded that their approach substantially outperforms the previous methods and is close to the state-based "oracle" method in terms of efficiency and performance.<br />
<br />
They then investigated the effectiveness of distances in the VAE latent space for the Visual Pusher task. They observed that latent distance significantly outperforms the log probability and pixel mean-squared error. The resampling strategies are also varied while fixing other components of the algorithm to study the effect of relabeling strategy. In this experiment, the RIG, which is an equal mixture of the VAE and Future sampling strategies, performs best. Subsequently, learning with variable numbers of objects was studied by evaluating on a task where the environment, based on the Visual Multi-Object Pusher, randomly contains zero, one, or two objects during testing. The results show that their model can tackle this task successfully.<br />
<br />
Finally, the authors tested the RIG in a real-world robot for its ability to reach user-specified positions and push objects to desired locations, as indicated by a goal image. The robot is trained with access only to 84x84 RGB images and without access to joint angles or object positions. The robot first learns by settings its own goals in the latent space and autonomously practices reaching different positions without human involvement. After a reasonable amount of time of training, the robot is given a goal image. Because the robot has practiced reaching so many goals, it is able to reach this goal without additional training:<br />
<br />
[[File:reaching.JPG|center|thumb|600px|(Left) The robot setup is pictured. (Right) Test rollouts of the learned policy.]]<br />
<br />
The method for reaching only needs 10,000 samples and an hour of real-world interactions.<br />
<br />
They also used RIG to train a policy to push objects to target locations:<br />
<br />
[[File:pushing.JPG|center|thumb|600px|The robot pushing setup is<br />
pictured, with frames from test rollouts of the learned policy.]]<br />
<br />
The pushing task is more complicated and the method requires about 25,000 samples. Since the authors do not have the true position during training, so they used test episode returns as the VAE latent distance reward.<br />
<br />
=Conclusion & Future Work=<br />
<br />
In this paper, a new RL algorithm is proposed to efficiently solve goal-conditioned, vision-based tasks without any ground truth state information or reward functions. The author suggests that one could instead use other representations, such as language and demonstrations, to specify goals. Also, while the paper provides a mechanism to sample goals for autonomous exploration, one can combine the proposed method with existing work by choosing these goals in a more principled way to perform even better exploration. Lastly, there are a variety of robot tasks whose state representation would be difficult to capture with sensors, such as manipulating deformable objects or handling scenes with variable number of objects. It is interesting to see whether the RIG can be scaled up to solve these tasks.<br />
<br />
=Critique=<br />
1. This paper is novel because it uses visual data and trains in an unsupervised fashion. The algorithm has no access to a ground truth state or to a pre-defined reward function. It can perform well in a real-world environment with no explicit programming.<br />
<br />
2. From the videos, one major concern is that the output of robotic arm's position is not stable during training and test time. It is likely that the encoder reduces the image features too much so that the images in the latent space are too blury to be used goal images. It would be better if this can be investigated in future. <br />
<br />
3. The algorithm seems to perform better when there is only one object in the images. For example, in Visual Multi-Object Pusher experiment, the relative positions of two pucks do not correspond well with the relative positions of two pucks in goal images. The same situation is also observed in Variable-object experiment. We may guess that the more information contain in a image, the less likely the robot will perform well. This limits the applicability of the current algorithm to solving real-world problems.<br />
<br />
4. The instability mentioned in #2 is even more apparent in the multi-object scenario, and appears to result from the model attempting to optimize on the position of both objects at the same time. Reducing the problem to a sequence of single-object targets may reduce the amount of time the robots spends moving between the multiple objects in the scene (which it currently does quite frequently). <br />
<br />
=References=<br />
1. Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric<br />
Actor Critic for Image-Based Robot Learning. arXiv preprint arXiv:1710.06542, 2017.<br />
<br />
2. Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to Poke by<br />
Poking: Experiential Learning of Intuitive Physics. In Advances in Neural Information Processing Systems<br />
(NIPS), 2016.<br />
<br />
3. Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan<br />
Shelhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. Zero-Shot Visual Imitation. In International<br />
Conference on Learning Representations (ICLR), 2018.<br />
<br />
4. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David<br />
Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International<br />
Conference on Learning Representations (ICLR), 2016.<br />
<br />
5. Irina Higgins, Arka Pal, Andrei A Rusu, Loic Matthey, Christopher P Burgess, Alexander Pritzel, Matthew<br />
Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving zero-shot transfer in reinforcement<br />
learning. International Conference on Machine Learning (ICML), 2017.<br />
<br />
6. Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal Planning<br />
Networks. In International Conference on Machine Learning (ICML), 2018.<br />
<br />
7. Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, and Sergey<br />
Levine. Time-contrastive networks: Self-supervised learning from video. arXiv preprint arXiv:1704.06888,<br />
2017.<br />
<br />
8. Alex Lee, Sergey Levine, and Pieter Abbeel. Learning Visual Servoing with Deep Features and Fitted<br />
Q-Iteration. In International Conference on Learning Representations (ICLR), 2017.<br />
<br />
9. Online source: https://bair.berkeley.edu/blog/2018/09/06/rig/</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Unsupervised_Neural_Machine_Translation&diff=41159Unsupervised Neural Machine Translation2018-11-23T16:06:14Z<p>Mpafla: /* Methodology */</p>
<hr />
<div>This paper was published in ICLR 2018, authored by Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho.<br />
<br />
= Introduction =<br />
The paper presents an unsupervised Neural Machine Translation(NMT) method to machine translation using only monolingual corpora without any alignment between sentences or documents. Monolingual corpora are text corpora that are made up of one language only. This contrasts with the usual Supervised NMT approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as NMT often requires large parallel corpora to achieve good results, however, in reality, there are a number of languages that lack parallel pairing, e.g. for German-Russian.<br />
<br />
Other authors have recently tried to address this problem as well as semi-supervised approaches but these methods still require a strong cross-lingual signal. The proposed method eliminates the need for a cross-lingual information, relying solely on monolingual data.<br />
<br />
The general approach of the methodology is to:<br />
<br />
# Use monolingual corpora in the source and target languages to learn source and target word embeddings.<br />
# Align the 2 sets of word embeddings in the same latent space.<br />
Then iteratively perform:<br />
# Train an encoder-decoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.<br />
# Tune the decoder in each language by back-translating between the source and target language.<br />
<br />
= Background =<br />
<br />
===Word Embedding Alignment===<br />
<br />
The paper uses word2vec [Mikolov, 2013] to convert each monolingual corpora to vector embeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so, in theory, there could exist a linear map that maps the embeddings from language L1 to language L2. <br />
<br />
Figure 1 shows an example of aligning the word embeddings in English and French.<br />
<br />
[[File:Figure1_lwali.png|frame|400px|center|Figure 1: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]<br />
<br />
Most cross-lingual word embedding methods use bilingual signals in the form of parallel corpora. Usually, the embedding mapping methods train the embeddings in different languages using monolingual corpora, then use a linear transformation to map them into a shared space based on a bilingual dictionary.<br />
<br />
The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration. <br />
<br />
===Other related work and inspirations===<br />
====STATISTICAL DECIPHERMENT FOR MACHINE TRANSLATION====<br />
There has been significant work in statistical deciphering technique to induce a machine translation model from monolingual data, which similar to the noisy-channel model used by SMT(Ravi & Knight, 2011; Dou & Knight, 2012). These techniques treat the source language as ciphertext and models the distribution of the ciphertext. This approach is able to take advantage of the incorporation of syntactic knowledge of the languages. It shows that word embeddings implementation improves statistical decipherment in machine translation.<br />
<br />
====LOW-RESOURCE NEURAL MACHINE TRANSLATION====<br />
There are also proposals that use techniques other than direct parallel corpora to do neural machine translation(NMT). Some use a third intermediate language that is well connected to 2 other languages that otherwise have little direct resources. For example, we want to translate German into Russian, but little direct-source for these two languages, we can use English as an intermediate language(German-English and English-Russian) since there are plenty of resources to connect English and other languages. Johnson et al. (2017) show that a multilingual extension of a standard NMT architecture performs reasonably well even for language pairs which have no direct data was given.<br />
<br />
Other works use monolingual data in combination with scarce parallel corpora. Creating a synthetic parallel corpus by backtranslating a monolingual corpus in the target language is one of simple but effective approach.<br />
<br />
The most important contribution to the problem of training an NMT model with monolingual data was from [He, 2016], which trains two agents to translate in opposite directions (e.g. French → English and English → French) and teach each other through reinforcement learning. However, this approach still required a large parallel corpus for a warm start, while our paper does not use parallel data.<br />
<br />
= Methodology =<br />
<br />
The corpora data is first processed in a standard way to tokenize and case the words. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations. To test the effectiveness of BPE, they limited the vocabulary to the most frequent 50,000 BPE tokens.<br />
<br />
The words or BPEs are then converted to word embeddings using word2vec with 300 dimensions and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.<br />
<br />
The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units while the dimensionality of the embeddings is set to 300. The encoder is shared by the source and target language, while the decoder is different by language.<br />
<br />
Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:<br />
<br />
#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.<br />
#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language. <br />
#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in the background.<br />
<br />
[[File:Figure2_lwali.png|600px|center]]<br />
<br />
The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.<br />
<br />
===Denoising===<br />
<br />
Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both<br />
languages in a language-independent fashion, and then be decoded by the language dependent decoder.<br />
<br />
Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:<br />
<br />
# Construct <math>C(x)</math>, noisy version of <math>x</math>,<br />
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.<br />
<br />
The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.<br />
<br />
In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.<br />
<br />
The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.<br />
<br />
===Back-Translation===<br />
<br />
With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:<br />
<br />
# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,<br />
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,<br />
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.<br />
<br />
The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.<br />
<br />
Contrary to standard back-translation that uses an independent model to back-translate the entire corpus at one time, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.<br />
<br />
===Training===<br />
<br />
Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence. <br />
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.<br />
<br />
Optimizer choice and other hyperparameters can be found in the paper.<br />
<br />
=Experiments and Results=<br />
<br />
The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (ground-truth) translation.<br />
<br />
The paper trains translation model under 3 different settings to compare the performance (Table 1). All training and testing data used was from a standard NMT dataset, WMT'14.<br />
<br />
[[File:Table1_lwali.png|600px|center]]<br />
<br />
===Unsupervised===<br />
<br />
The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.<br />
<br />
The paper adds each component piece-wise when doing an evaluation to test the impact each piece has on the final score. As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise, back-translation would translate nonsensical sentences.<br />
<br />
For the BPE experiment, results show it helps in some language pairs but detract in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.<br />
<br />
===Semi-supervised===<br />
<br />
Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.<br />
<br />
Table1 shows that the model can greatly benefit from the addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both the semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.<br />
<br />
===Supervised===<br />
<br />
This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014, which includes Europarl, Common Crawl and News Commentary for both language pairs plus the UN and the Gigaword corpus for French- English. Moreover, the authors use the same subsets of News Commentary alone to run the separate experiments in order to compare with the semi-supervised scenario.<br />
<br />
The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently, it was trained without denoising and back-translation. The proposed model under a supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance. To improve these results, the authors also suggest to use larger models, longer training times, and incorporating several well-known NMT techniques.<br />
<br />
===Qualitative Analysis===<br />
<br />
[[File:Table2_lwali.png|600px|center]]<br />
<br />
Table 2 shows 4 examples of French to English translations, which shows that the high-quality translations are produces by the proposed system, and this system adequately models non-trivial translation relations. Example 1 and 2 show that the model is able to not only go beyond a literal word-by-word substitution but also model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high-quality translations of long and more complex sentences. However, in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures, which means that the proposed system has limitations. Specially, the authors points that the proposed model has difficulties to preserve some concrete details from source sentences.<br />
<br />
=Conclusions and Future Work=<br />
<br />
The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention-based encoder-decoder system and training using denoise and back-translation.<br />
<br />
Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:<br />
*Instead of using fixed cross-lingual word embeddings at the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.<br />
*Decouple the shared encoder into 2 independent encoders at some point during training<br />
*Progressively reduce the noise level<br />
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis<br />
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.<br />
<br />
= Critique =<br />
<br />
While the idea is interesting and the results are impressive for an unsupervised approach, much of the model had actually already been proposed by other papers that are referenced. The paper doesn't add a lot of new ideas but only builds on existing techniques and combines them in a different way to achieve good experimental results. The paper is not a significant algorithmic contribution. <br />
<br />
The results showed that the proposed system performed far worse than the state of the art when used in a supervised setting, which is concerning and shows that the techniques used creates a limitation and a ceiling for performance.<br />
<br />
Additionally, there was no rigorous hyperparameter exploration/optimization for the model. As a result, it is difficult to conclude whether the performance limit observed in the constrained supervised model is the absolute limit, or whether this could be overcome in both supervised/unsupervised models with the right constraints to achieve more competitive results. <br />
<br />
The best results shown are between two very closely related languages(English and French), and does much worse for English - German, even though English and German are also closely related (but less so than English and French) which suggests that the model may not be successful at translating between distant language pairs. More testing would be interesting to see.<br />
<br />
The results comparison could have shown how the semi-supervised version of the model scores compared to other semi-supervised approaches as touched on in the other works section.<br />
<br />
* (As pointed out by an annonymous reviewer [https://openreview.net/forum?id=Sy2ogebAW])Future work is vague: “we would like to detect and mitigate the specific causes…” “We also think that a better handling of rare words…” That’s great, but how will you do these things? Do you have specific reasons to think this, or ideas on how to approach them? Otherwise, this is just hand-waving.<br />
<br />
= References =<br />
#'''[Mikolov, 2013]''' Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."<br />
#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".<br />
#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."<br />
#'''[He, 2016]''' Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. "Dual learning for machine translation."<br />
#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."<br />
#'''[Ravi & Knight, 2011]''' Sujith Ravi and Kevin Knight, "Deciphering foreign language."<br />
#'''[Dou & Knight, 2012]''' Qing Dou and Kevin Knight, "Large scale decipherment for out-of-domain machine translation."<br />
#'''[Johnson et al. 2017]''' Melvin Johnson,et al, "Google’s multilingual neural machine translation system: Enabling zero-shot translation."</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_neural_representation_of_sketch_drawings&diff=41151a neural representation of sketch drawings2018-11-23T14:38:11Z<p>Mpafla: /* Experiments */</p>
<hr />
<div><br />
== Introduction ==<br />
In this paper, The authors present a recurrent neural network, sketch-rnn, that can be used to construct stroke-based drawings. Besides new robust training methods, they also outline a framework for conditional and unconditional sketch generation.<br />
<br />
Neural networks have been heavily used as image generation tools. For example, Generative Adversarial Networks, Variational Inference, and Autoregressive models have been used. Most of those models are designed to generate pixels to construct images. However, people learn to draw using sequences of strokes, beginning when they are young. The authors propose a new generative model that creates vector images so that it might generalize abstract concepts in a manner more similar to how humans do. <br />
<br />
The model is trained with hand-drawn sketches as input sequences. The model is able to produce sketches in vector format. In the conditional generation model, they also explore the latent space representation for vector images and discuss a few future applications of this model. The model and dataset are now available as an open source project ([https://magenta.tensorflow.org/sketch_rnn link]).<br />
<br />
=== Terminology ===<br />
Pixel images, also referred to as raster or bitmap images are files that encode image data as a set of pixels. These are the most common image type, with extensions such as .png, .jpg, .bmp. <br />
<br />
Vector images are files that encode image data as paths between points. SVG and EPS file types are used to store vector images. <br />
<br />
For a visual comparison of raster and vector images, see this [https://www.youtube.com/watch?v=-Fs2t6P5AjY video]. As mentioned, vector images are generally simpler and more abstract, whereas raster images generally are used to store detailed images. <br />
<br />
For this paper, the important distinction between the two is that the encoding of images in the model will be inherently more abstract because of the vector representation. The intuition is that generating abstract representations is more effective using a vector representation. <br />
<br />
== Related Work ==<br />
There are some works in the history that used a similar approach to generate images such as Portrait Drawing by Paul the Robot and some reinforcement learning approaches. They work more like a mimic of digitized photographs. There are also some Neural networks based approaches, but those are mostly dealing with pixel images. Little work is done on vector images generation. There are models that use Hidden Markov Models or Mixture Density Networks to generate human sketches, continuous data points or vectorized Kanji characters.<br />
<br />
The model also allows us to explore the latent space representation of vector images. There are previous works that achieved similar functions as well, such as combining Sequence-to-Sequence models with Variational Autoencoder to model sentences into latent space and using probabilistic program induction to model Omniglot dataset.<br />
<br />
The dataset they use contains 50 million vector sketches. Before this paper, there is a Sketch data with 20k vector sketches, a Sketchy dataset with 70k vector sketches along with pixel images, and a ShadowDraw system that used 30k raster images along with extracted vectorized features. They are all comparatively small.<br />
<br />
== Methodology ==<br />
=== Dataset ===<br />
QuickDraw is a dataset with 50 million vector drawings collected by an online game Quick Draw! where the players are required to draw objects belonging to a particular object class in less than 20 seconds. It contains hundreds of classes, each class has 70k training samples, 2.5k validation samples and 2.5k test samples.<br />
<br />
The data format of each sample is a representation of a pen stroke action event. The Origin is the initial coordinate of the drawing. The sketches are points in a list. Each point consists of 5 elements <math> (\Delta x, \Delta y, p_{1}, p_{2}, p_{3})</math> where x and y are the offset distance in x and y directions from the previous point. The parameters <math>p_{1}, p_{2}, p_{3}</math> represent three possible states in binary one-hot representation where <math>p_{1}</math> indicates the pen is touching the paper, <math>p_{2}</math> indicates the pen will be lifted from here, and <math>p_{3}</math> represents the drawing has ended.<br />
<br />
=== Sketch-RNN ===<br />
[[File:sketchfig2.png|700px|center]]<br />
<br />
The model is a Sequence-to-Sequence Variational Autoencoder(VAE). The encoder is a bidirectional RNN, the input is a sketch sequence denoted by <math>S</math> and a reversed sketch sequence denoted by <math>S_{reverse}</math>, so there will be two final hidden states. The output is a size <math>N_{z}</math> latent vector.<br />
<br />
\begin{align*}<br />
h_{ \rightarrow} = encode_{ \rightarrow }(S), <br />
h_{ \leftarrow} = encode_{ \leftarrow }(S_{reverse}), <br />
h = [h_{\rightarrow}; h_{\leftarrow}].<br />
\end{align*}<br />
<br />
Then the authors project <math>h</math> into to <math>\mu</math> and <math>\hat{\sigma}</math>. Both <math>\mu</math> and <math>\hat{\sigma}</math> are vectors of size <math>N_{z}</math>. The projection is performed using a fully connected layer. Then, using the exponential function the authors convert <math>\hat{\sigma}</math> into a non-negative standard deviation parameter denoted by <math>\sigma</math>. The Authors then use <math>\mu</math> and <math>\sigma</math> with <math>\mathcal{N}(0,I)</math> to construct a random vector <math>z\in\mathbb{R}^{N_{z}}</math>.<br />
<br />
\begin{align*}<br />
\mu = W_\mu h + b_\mu, <br />
\hat \sigma = W_\sigma h + b_\sigma, <br />
\sigma = exp( \frac{\hat \sigma}{2}), <br />
z = \mu + \sigma \odot \mathcal{N}(0,I).<br />
\end{align*}<br />
<br />
<br />
Note that <math>z</math> is not deterministic but a conditioned random vector.<br />
<br />
The decoder is an autoregressive RNN. The initial hidden states are generated using <math>[h_0;c_0] = \tanh(W_z z+b_z)</math> where <math>c_0</math> is utilized if applicable. <math>S_0</math> is defined as <math>(0,0,1,0,0)</math>. For each step i in the decoder, the input <math>x_i</math> is the concatenation of previous point <math>S_{i-1}</math> and latent vector <math>z</math>. The output are probability distribution parameters for the next data point <math>S_i</math>. The authors model <math>(\Delta x,\Delta y)</math> as a Gaussian mixture model (GMM) with <math>M</math> normal distributions and model the ground truth data <math>(p_1, p_2, p_3)</math> as categorical distribution <math>(q_1, q_2, q_3)</math> where <math>q_1, q_2\ \text{and}\ q_3</math> sum up to 1. The generated sequence is conditioned from the latent vector <math>z</math> that is sampled from the encoder, which is end-to-end trained together with the decoder.<br />
<br />
\begin{align*}<br />
p(\Delta x, \Delta y) = \sum_{j=1}^{M} \Pi_j \mathcal{N}(\Delta x,\Delta y | \mu_{x,j}, \mu_{y,j}, \sigma_{x,j},\sigma_{y,j}, \rho _{xy,j}), where \sum_{j=1}^{M}\Pi_j = 1<br />
\end{align*}<br />
<br />
Here the <math>\mathcal{N}(\Delta x,\Delta y | \mu_{x,j}, \mu_{y,j}, \sigma_{x,j},\sigma_{y,j}, \rho _{xy,j})</math> is the probability distribution function for <math>x,y</math>. Each of the M bivariate normal distributions being summed have five parameters: <math>(\mu_x, \mu_y, \sigma_x, \sigma_y, \rho_{xy})</math>, where <math>\mu_x, \mu_y</math> are means, <math>\sigma_x, \sigma_y</math> are the standard deviations, and <math>\rho_{xy}</math>is the correlation parameter for these bivariate normal distribution. The <math>\Pi</math> is a length M categorical distribution vector are the mixture weights of the Gaussian mixture model.<br />
<br />
The output vector <math>y_i</math> is generated using a fully-connected forward propagation in the hidden state of the RNN.<br />
<br />
\begin{align*}<br />
x_i = [S_{i-1}; z], <br />
[h_i; c_i] = forward(x_i,[h_{i-1}; c_{i-1}]), <br />
y_i = W_y h_i + b_y,<br />
y_i \in \mathbb{R}^{6M+3}.<br />
\end{align*}<br />
<br />
The output consists the probability distribution of the next data point.<br />
<br />
\begin{align*}<br />
[(\hat\Pi_1\ \mu_x\ \mu_y\ \hat\sigma_x\ \hat\sigma_y\ \hat\rho_{xy})_1\ (\hat\Pi_1\ \mu_x\ \mu_y\ \hat\sigma_x\ \hat\sigma_y\ \hat\rho_{xy})_2\ ...\ (\hat\Pi_1\ \mu_x\ \mu_y\ \hat\sigma_x\ \hat\sigma_y\ \hat\rho_{xy})_M\ (\hat{q_1}\ \hat{q_2}\ \hat{q_3})] = y_i<br />
\end{align*}<br />
<br />
<math>\exp</math> and <math>\tanh</math> operations are applied to ensure that the standard deviations are non-negative and the correlation value is between -1 and 1.<br />
<br />
\begin{align*}<br />
\sigma_x = \exp (\hat \sigma_x),\ <br />
\sigma_y = \exp (\hat \sigma_y),\ <br />
\rho_{xy} = \tanh(\hat \rho_{xy}). <br />
\end{align*}<br />
<br />
Categorical distribution probabilities for <math>(p_1, p_2, p_3)</math> using <math>(q_1, q_2, q_3)</math> can be obtained as :<br />
<br />
\begin{align*}<br />
q_k = \frac{\exp{(\hat q_k)}}{ \sum\nolimits_{j = 1}^{3} \exp {(\hat q_j)}},<br />
k \in \left\{1,2,3\right\}, <br />
\Pi _k = \frac{\exp{(\hat \Pi_k)}}{ \sum\nolimits_{j = 1}^{M} \exp {(\hat \Pi_j)}},<br />
k \in \left\{1,...,M\right\}.<br />
\end{align*}<br />
<br />
It is hard for the model to decide when to stop drawing because the probabilities of the three events <math>(p_1, p_2, p_3)</math> are very unbalanced. Researchers in the past have used different weights for each pen event probability, but the authors found this approach lacking elegance and inadequate. They define a hyperparameter representing the max length of the longest sketch in the training set denoted by <math>N_{max}</math>, and set the <math>S_i</math> to be <math>(0, 0, 0, 0, 1)</math> for <math>i > N_s</math>.<br />
<br />
The outcome sample <math>S_i^{'}</math> can be generated in each time step during sample process and fed as input for the next time step. The process will stop when <math>p_3 = 1</math> or <math>i = N_{max}</math>. The output is not deterministic but conditioned random sequences. The level of randomness can be controlled using a temperature parameter <math>\tau</math>.<br />
<br />
\begin{align*}<br />
\hat q_k \rightarrow \frac{\hat q_k}{\tau}, <br />
\hat \Pi_k \rightarrow \frac{\hat \Pi_k}{\tau}, <br />
\sigma_x^2 \rightarrow \sigma_x^2\tau, <br />
\sigma_y^2 \rightarrow \sigma_y^2\tau. <br />
\end{align*}<br />
<br />
The <math>\tau</math> ranges from 0 to 1. When <math>\tau = 0</math> the output will be deterministic as the sample will consist of the points on the peak of the probability density function.<br />
<br />
=== Unconditional Generation ===<br />
There is a special case that only the decoder RNN module is trained. The decoder RNN could work as a standalone autoregressive model without latent variables. In this case, initial states are 0, the input <math>x_i</math> is only <math>S_{i-1}</math> or <math>S_{i-1}^{'}</math>. In the Figure 3, generating sketches unconditionally from the temperature parameter τ = 0.2 at the top in blue, to τ = 0.9 at the bottom in red.<br />
<br />
[[File:sketchfig3.png|700px|center]]<br />
<br />
=== Training ===<br />
The training process is the same as a Variational Autoencoder. The loss function is the sum of Reconstruction Loss <math>L_R</math> and the Kullback-Leibler Divergence Loss <math>L_{KL}</math>. The reconstruction loss <math>L_R</math> can be obtained with generated parameters of pdf and training data <math>S</math>. It is the sum of the <math>L_s</math> and <math>L_p</math>, which are the log loss of the offset <math>(\Delta x, \Delta y)</math> and the pen state <math>(p_1, p_2, p_3)</math>.<br />
<br />
\begin{align*}<br />
L_s = - \frac{1 }{N_{max}} \sum_{i = 1}^{N_s} \log(\sum_{i = 1}^{M} \Pi_{j,i} \mathcal{N}(\Delta x,\Delta y | \mu_{x,j,i}, \mu_{y,j,i}, \sigma_{x,j,i},\sigma_{y,j,i}, \rho _{xy,j,i})), <br />
\end{align*}<br />
\begin{align*}<br />
L_p = - \frac{1 }{N_{max}} \sum_{i = 1}^{N_{max}} \sum_{k = 1}^{3} p_{k,i} \log (q_{k,i}), <br />
L_R = L_s + L_p.<br />
\end{align*}<br />
<br />
<br />
Both terms are normalized by <math>N_{max}</math>.<br />
<br />
<math>L_{KL}</math> measures the difference between the distribution of the latent vector <math>z</math> and an IID Gaussian vector with zero mean and unit variance.<br />
<br />
\begin{align*}<br />
L_{KL} = - \frac{1}{2 N_z} (1+\hat \sigma - \mu^2 - \exp(\hat \sigma))<br />
\end{align*}<br />
<br />
The overall loss is weighted as:<br />
<br />
\begin{align*}<br />
Loss = L_R + w_{KL} L_{KL}<br />
\end{align*}<br />
<br />
When <math>w_{KL} = 0</math>, the model becomes a standalone unconditional generator. Specially, there will be no <math>L_{KL} </math> term as we only optimize for <math>L_{R} </math>. By removing the <math>L_{KL} </math> term the model approaches a pure autoencoder, meaning it sacrifices the ability to enforce a prior over the latent space and gains better reconstruction loss metrics.<br />
<br />
== Experiments ==<br />
The authors experiment with the sketch-rnn model using different settings and recorded both losses. They used a Long Short-Term Memory(LSTM) model as an encoder and a HyperLSTM as a decoder. They also conduct multi-class datasets. The result is as follows.<br />
<br />
[[File:sketchtable1.png|700px]]<br />
<br />
We could see the trade-off between <math>L_R</math> and <math>L_{KL}</math> in this table clearly. Furthermore, <math>L_R</math> decreases as <math>w_{KL} </math> is halfed. <br />
<br />
=== Conditional Reconstruction ===<br />
The authors assess the reconstructed sketch with a given sketch with different <math>\tau</math> values. We could see that with high <math>\tau</math> value on the right, the reconstructed sketches are more random.<br />
<br />
[[File:sketchfig5.png|700px|center]]<br />
<br />
They also experiment on inputting a sketch from a different class. The output will still keep some features from the class that the model is trained on.<br />
<br />
=== Latent Space Interpolation ===<br />
The authors visualize the reconstruction sketches while interpolating between latent vectors using different <math>w_{KL}</math> values. With high <math>w_{KL}</math> values, the generated images are more coherently interpolated.<br />
<br />
[[File:sketchfig6.png|700px|center]]<br />
<br />
=== Sketch Drawing Analogies ===<br />
Since the latent vector <math>z</math> encode conceptual features of a sketch, those features can also be used to augment other sketches that do not have these features. This is possible when models are trained with low <math>L_{KL}</math> values. The authors are able to perform vector arithmetic on latent vectors from different sketches and explore how the model generates sketches base on these latent spaces.<br />
<br />
=== Predicting Different Endings of Incomplete Sketches === <br />
This model is able to predict an incomplete sketch by encoding the sketch into hidden state <math>h</math> using the decoder and then using <math>h</math> as an initial hidden state to generate the remaining sketch. The authors train on individual classes by using decoder-only models and set τ = 0.8 to complete samples. Figure 7 shows the results.<br />
<br />
[[File:sketchfig7.png|700px|center]]<br />
<br />
== Applications and Future Work ==<br />
The authors believe this model can assist artists by suggesting how to finish a sketch, helping them to find interesting intersections between different drawings or objects, or generating a lot of similar but different designs. In the simplest use, pattern designers can apply sketch-rnn to generate a large number of similar, but unique designs for textile or wallpaper prints. The creative designers can also come up with abstract designs which enables them to resonate more with their target audience<br />
<br />
This model may also find its place on teaching students how to draw. Even with the simple sketches in QuickDraw, the authors of this work have become much more proficient at drawing animals, insects, and various sea creatures after conducting these experiments. <br />
When the model is trained with a high <math>w_{KL}</math> and sampled with a low <math>\tau</math>, it may help to turn a poor sketch into a more aesthetical one. Latent vector augmentation could also help to create a better drawing by inputting user-rating data during training processes.<br />
<br />
The authors conclude by providing the following future directions to this work:<br />
# Investigate using user-rating data to augmenting the latent vector in the direction that maximizes the aesthetics of the drawing.<br />
# Look into combining variations of sequence-generation models with unsupervised, cross-domain pixel image generation models.<br />
<br />
It's exciting that they manage to combine this model with other unsupervised, cross-domain pixel image generation models to create photorealistic images from sketches.<br />
<br />
The authors have also mentioned the opposite direction of converting a photograph of an object into an unrealistic, but similar looking<br />
sketch of the object composed of a minimal number of lines to be a more interesting problem.<br />
<br />
<br />
== Conclusion ==<br />
This paper introduced an interesting model sketch-rnn that can encode and decode sketches, generate and complete unfinished sketches. The authors demonstrated how to interpolate between latent spaces from a different class and how to use it to augment sketches or generate similar looking sketches. They also showed that it's important to enforce a prior distribution on latent vector while interpolating coherent sketch generations. Finally, they created a large sketch drawings dataset to be used in future research.<br />
<br />
== Critique ==<br />
* The performance of the decoder model can hardly be evaluated. The authors present the performance of the decoder by showing the generated sketches, it is clear and straightforward, however, not very efficient. It would be great if the authors could present a way, or a metric to evaluate how well the sketches are generated rather than printing them out and evaluate with human judgment.<br />
<br />
* Same problem as the output, the authors didn't present an evaluation for the algorithms either. They provided <math>L_R</math> and <math>L_{KL}</math> for reference, however, a lower loss doesn't represent a better performance.<br />
<br />
* I understand that using strokes as inputs is a novel and innovative move, however, the paper does not provide a baseline or any comparison with other methods or algorithms. Some other researches were mentioned in the paper, using similar and smaller datasets. It would be great if the authors could use some basic or existing methods a baseline and compare with the new algorithm.<br />
<br />
* Besides the comparison with other algorithms, it would also be great if the authors could remove or replace some component of the algorithm in the model to show if one part is necessary, or what made them decide to include a specific component in the algorithm.<br />
<br />
* The authors proposed a few future applications for the model, however, the current output seems somehow not very close to their descriptions. But I do believe that this is a very good beginning, with the release of the sketch dataset, it must attract more scholars to research and improve with it!<br />
<br />
* ([https://openreview.net/forum?id=Hy6GHpkCW]) The paper presents both a novel large dataset of sketches and a new RNN architecture to generate new sketches.<br />
<br />
+ new and large dataset<br />
<br />
+ novel algorithm<br />
<br />
+ well written<br />
<br />
- no evaluation of dataset<br />
<br />
- virtually no evaluation of the algorithm<br />
<br />
- no baselines or comparison<br />
<br />
== References == <br />
# Jimmy L. Ba, Jamie R. Kiros, and Geoffrey E. Hinton. Layer normalization. NIPS, 2016.<br />
# Christopher M. Bishop. Mixture density networks. Technical Report, 1994. URL http://publications.aston.ac.uk/373/.<br />
# Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Józefowicz, and Samy Bengio. Generating Sentences from a Continuous Space. CoRR, abs/1511.06349, 2015. URL http://arxiv.org/abs/1511.06349.<br />
# H. Dong, P. Neekhara, C. Wu, and Y. Guo. Unsupervised Image-to-Image Translation with Generative Adversarial Networks. ArXiv e-prints, January 2017.<br />
# David H. Douglas and Thomas K. Peucker. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica: The International Journal for Geographic Information and Geovisualization, 10(2):112–122, October 1973. doi: 10.3138/fm57-6770-u75u-7727. URL http://dx.doi.org/10.3138/fm57-6770-u75u-7727.<br />
# Mathias Eitz, James Hays, and Marc Alexa. How Do Humans Sketch Objects? ACM Trans. Graph.(Proc. SIGGRAPH), 31(4):44:1–44:10, 2012.<br />
# I. Goodfellow. NIPS 2016 Tutorial: Generative Adversarial Networks. ArXiv e-prints, December 2016.<br />
# Alex Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.<br />
# David Ha. Recurrent Net Dreams Up Fake Chinese Characters in Vector Format with TensorFlow, 2015.<br />
# David Ha, Andrew M. Dai, and Quoc V. Le. HyperNetworks. In ICLR, 2017.<br />
# Sepp Hochreiter and Juergen Schmidhuber. Long short-term memory. Neural Computation, 1997.<br />
# P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-Image Translation with Conditional Adversarial Networks. ArXiv e-prints, November 2016.<br />
# Jonas Jongejan, Henry Rowley, Takashi Kawashima, Jongmin Kim, and Nick Fox-Gieg. The Quick, Draw! - A.I. Experiment. https://quickdraw.withgoogle.com/, 2016. URL https: //quickdraw.withgoogle.com/.<br />
# C. Kaae Sønderby, T. Raiko, L. Maaløe, S. Kaae Sønderby, and O. Winther. Ladder Variational Autoencoders. ArXiv e-prints, February 2016.<br />
# T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to Discover cross-domain Relations with Generative Adversarial Networks. ArXiv e-prints, March 2017.<br />
# D. P Kingma and M. Welling. Auto-Encoding Variational Bayes. ArXiv e-prints, December 2013.<br />
# Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.<br />
# Diederik P. Kingma, Tim Salimans, and Max Welling. Improving variational inference with inverse autoregressive flow. CoRR, abs/1606.04934, 2016. URL http://arxiv.org/abs/1606.04934.<br />
# Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Human level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, December 2015. ISSN 1095-9203. doi: 10.1126/science.aab3050. URL http://dx.doi.org/10.1126/science.aab3050.<br />
# Yong Jae Lee, C. Lawrence Zitnick, and Michael F. Cohen. Shadowdraw: Real-time user guidance for freehand drawing. In ACM SIGGRAPH 2011 Papers, SIGGRAPH ’11, pp. 27:1–27:10, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0943-1. doi: 10.1145/1964921.1964922. URL http://doi.acm.org/10.1145/1964921.1964922.<br />
# M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised Image-to-Image Translation Networks. ArXiv e-prints, March 2017.<br />
# S. Reed, A. van den Oord, N. Kalchbrenner, S. Gómez Colmenarejo, Z. Wang, D. Belov, and N. de Freitas. Parallel Multiscale Autoregressive Density Estimation. ArXiv e-prints, March 2017.<br />
# Patsorn Sangkloy, Nathan Burnell, Cusuh Ham, and James Hays. The Sketchy Database: Learning to Retrieve Badly Drawn Bunnies. ACM Trans. Graph., 35(4):119:1–119:12, July 2016. ISSN 0730-0301. doi: 10.1145/2897824.2925954. URL http://doi.acm.org/10.1145/2897824.2925954.<br />
# Mike Schuster, Kuldip K. Paliwal, and A. General. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 1997.<br />
# Saul Simhon and Gregory Dudek. Sketch interpretation and refinement using statistical models. In Proceedings of the Fifteenth Eurographics Conference on Rendering Techniques, EGSR’04, pp. 23–32, Aire-la-Ville, Switzerland, Switzerland, 2004. Eurographics Association. ISBN 3-905673-12-6. doi: 10.2312/EGWR/EGSR04/023-032. URL http://dx.doi.org/10.2312/EGWR/EGSR04/023-032.<br />
# Patrick Tresset and Frederic Fol Leymarie. Portrait drawing by paul the robot. Comput. Graph.,37(5):348–363, August 2013. ISSN 0097-8493. doi: 10.1016/j.cag.2013.01.012. URL http://dx.doi.org/10.1016/j.cag.2013.01.012.<br />
# T. White. Sampling Generative Networks. ArXiv e-prints, September 2016.<br />
#Ning Xie, Hirotaka Hachiya, and Masashi Sugiyama. Artist agent: A reinforcement learning approach to automatic stroke generation in oriental ink painting. In ICML. icml.cc / Omnipress, 2012. URL http://dblp.uni-trier.de/db/conf/icml/icml2012.html#XieHS12.<br />
# Xu-Yao Zhang, Fei Yin, Yan-Ming Zhang, Cheng-Lin Liu, and Yoshua Bengio. Drawing and Recognizing Chinese Characters with Recurrent Neural Network. CoRR, abs/1606.06539, 2016. URL http://arxiv.org/abs/1606.06539.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Mapping_Images_to_Scene_Graphs_with_Permutation-Invariant_Structured_Prediction&diff=40740Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction2018-11-22T01:37:55Z<p>Mpafla: /* Structured prediction */</p>
<hr />
<div>The paper ''Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction'' was written by Roei Herzig* from Tel Aviv University, Moshiko Raboh* from Tel Aviv University, Gal Chechik from Google Brain, Bar-Ilan University, Jonathan Berant from Tel Aviv University, and Amir Globerson from Tel Aviv University. This paper is part of the NIPS 2018 conference to be hosted in December 2018 at Montréal, Canada. This paper summary is based on version 3 of the pre-print (as of May 2018) obtained from [https://arxiv.org/pdf/1802.05451v3.pdf arXiv] <br />
<br />
(*) Equal contribution<br />
<br />
=Motivation=<br />
In the field of artificial intelligence, a major goal is to enable machines to understand complex images, such as the underlying relationships between objects that exist in each scene. Although there are models today that capture both complex labels and interactions between labels, there is a disconnect for what guidelines should be used when leveraging deep learning. This paper introduces a design principle for such models that stem from the concept of permutation invariance and proves state of the art performance on models that follow this principle.<br />
<br />
The primary contributions that this paper makes include:<br />
# Deriving sufficient and necessary conditions for respecting graph-permutation invariance in deep structured prediction architectures<br />
# Empirically proving the benefit of graph-permutation invariance<br />
# Developing a state-of-the-art model for scene graph predictions over a large set of complex visual scenes<br />
<br />
=Introduction=<br />
In order for a machine to interpret complex visual scenes, it must recognize and understand both objects and relationships between the objects in the scene. A '''scene graph''' is a representation of the set of objects and relations that exist in the scene, where objects are represented as nodes and relations are represented as edges connecting the different nodes. Hence, the prediction of the scene graph is analogous to inferring the joint set of objects and relations of a visual scene.<br />
<br />
[[File:scene_graph_example.png|600px|center]]<br />
<br />
Given that objects in scenes are interdependent on each other, joint prediction of the objects and relations is necessary. The field of structured prediction, which involves the general problem of inferring multiple inter-dependent labels, is of interest for this problem.<br />
<br />
In structured prediction models, a score function <math>s(x, y)</math> is defined to evaluate the compatibility between label <math>y</math> and input <math>x</math>. For instance, when interpreting the scene of an image, <math>x</math> refers to the image itself, and <math>y</math> refers to a complex label, which contains both the objects and the relations between objects. As with most other inference methods, the goal is to find the label <math>y*</math> such that <math>s(x,y)</math> is maximized. However, the major concern is that the space for possible label assignments grows exponentially with respect to input size. For example, although an image may seem very simple, the corpus containing possible labels for objects may be very large, rendering it difficult to optimize the scoring function. <br />
<br />
The paper presents an alternative approach, for which input <math>x</math> is mapped to structured output <math>y</math> using a "black box" neural network, omitting the definition of a score function. The main concern for this approach is the determination of the network architecture.<br />
<br />
=Structured prediction=<br />
This paper further considers structured predictions using score-based methods. For structured predictions that follow a score-based approach, a score function <math>s(x, y)</math> is used to measure how compatible label <math>y</math> is for input <math>x</math>. To optimize the score function, previous works have decomposed <math>s(x,y) = \sum_i f_i(x,y)</math> in order to facilitate efficient optimization which is done by optimizing the local score function, <math>\max_y f_i(x,y)</math>, with a small subset of the <math>y</math> variables.<br />
<br />
In the area of structured predictions, the most commonly-used score functions include the singleton score function <math>f_i(y_i, x)</math> and pairwise score function <math>f_{ij} (y_i, y_j, x)</math>. Previous works explored a two-stage architectures (learn local scores independently of the structured prediction goal), and end-to-end architectures (to include the inference algorithm within the computation graph). <br />
<br />
==Advantages of using score-based methods==<br />
# Allow for intuitive specification of local dependencies between labels, and how they map to global dependencies<br />
# Linear score functions offer natural convex surrogates<br />
# Inference in large label space is sometimes possible via exact algorithms or empirically accurate approximations<br />
<br />
The concern for modelling score functions using deep networks is that learning may no longer be convex. Hence, the paper presents properties for how deep networks can be used for structured predictions by considering architectures that do not require explicit maximization of a score function.<br />
<br />
=Background, Notations, and Definitions=<br />
We denote <math>y</math> as a structured label where <math>y = [y_1, \dots, y_n]</math><br />
<br />
'''Score functions:''' for score-based methods, the score is defined as either the sum of a set of singleton scores <math>f_i = f_i(y_i, x)</math> or the sum of pairwise scores <math>f_{ij} = f_{ij}(y_i, y_j, x)</math>.<br />
<br />
Let <math>s(x,y)</math> be the score of a score-based method. Then:<br />
<br />
<div align="center"><br />
<math>s(x,y) = \begin{cases}<br />
\sum_i f_i ~ \text{if we have a set of singleton scores}\\<br />
\sum_{ij} f_{ij} ~ \text{if we have a set of pairwise scores } \\<br />
\end{cases}</math><br />
</div><br />
<br />
'''Inference algorithm:''' an inference algorithm takes input set of local scores (either <math>f_i</math> or <math>f_{ij}</math>) and outputs an assignment of labels <math>y_1, \dots, y_n</math> that maximizes score function <math>s(x,y)</math><br />
<br />
'''Graph labeling function:''' a graph labeling function <math>\mathcal{F} : (V,E) \rightarrow Y</math> is a function that takes input of: an ordered set of node features <math>V = [z_1, \dots, z_n]</math> and an ordered set of edge features <math>E = [z_{1,2},\dots,z_{i,j},\dots,z_{n,n-1}]</math> to output set of node labels <math>\mathbf{y} = [y_1, \dots, y_n]</math>. For instance, <math>z_i</math> can be set equal to <math>f_i</math> and <math>z_{ij}</math> can be set equal to <math>f_{ij}</math>.<br />
<br />
For convenience, the joint set of nodes and edges will be denoted as <math>\mathbf{z}</math> to be a size <math>n^2</math> vector (<math>n</math> nodes and <math>n(n-1)</math> edges).<br />
<br />
'''Permutation:''' Let <math>z</math> be a set of node and edge features. Given a permutation <math>\sigma</math> of <math>\{1,\dots,n\}</math>, let <math>\sigma(z)</math> be a new set of node and edge features given by [<math>\sigma(z)]_i = z_{\sigma(i)}</math> and <math>[\sigma(z)]_{i,j} = z_{\sigma(i), \sigma(j)}</math><br />
<br />
'''One-hot representation:''' <math>\mathbf{1}[j]</math> be a one-hot vector with 1 in the <math>j^{th}</math> coordinate<br />
<br />
=Permutation-Invariant Structured prediction=<br />
<br />
With permutation-invariant structured prediction, we would expect the algorithm to produce the same result given the same score function. For instance, consider the case where we have label space for 3 variables <math>y_1, y_2, y_3</math> with input <math>\mathbf{z} = (f_1, f_2, f_3, f_{12}, f_{13}, f_{23})</math> that outputs label <math>\mathbf{y} = (y_1^*, y_2^*, y_3^*)</math>. Then if the algorithm is run on a permuted version input <math>z' = (f_2, f_1, f_3, f_{21}, f_{23}, f_{13})</math>, we would expect <math>\mathbf{y} = (y_2^*, y_1^*, y_3^*)</math> given the same score function.<br />
<br />
'''Graph permutation invariance (GPI):''' a graph labeling function <math>\mathcal{F}</math> is graph-permutation invariant, if for all permutations <math>\sigma</math> of <math>\{1, \dots, n\}</math> and for all nodes <math>z</math>, <math>\mathcal{F}(\sigma(\mathbf{z})) = \sigma(\mathcal{F}(\mathbf{z}))</math><br />
<br />
The paper presents a theorem on the necessary and sufficient conditions for a function <math>\mathcal{F}</math> to be graph permutation invariant. Intuitively, because <math>\mathcal{F}</math> is a function that takes an ordered set <math>z</math> as input, the output on <math>\mathbf{z}</math> could very well be different from <math>\sigma(\mathbf{z})</math>, which means <math>\mathcal{F}</math> needs to have some sort of symmetry in order to sustain <math>[\mathcal{F}(\sigma(\mathbf{z}))]]_k = [\mathcal{F}(\mathbf{z})]_{\sigma(k)}</math>.<br />
<br />
[[File:graph_permutation_invariance.jpg|400px|center]]<br />
<br />
==Theorem 1==<br />
Let <math>\mathcal{F}</math> be a graph labeling function. Then <math>\mathcal{F}</math> is graph-permutation invariant if and only if there exist functions <math>\alpha, \rho, \phi</math> such that for all <math>k=1, .., n</math>:<br />
\begin{align}<br />
[\mathcal{F}(\mathbf{z})]_k = \rho(\mathbf{z}_k, \sum_{i=1}^n \alpha(\mathbf{z}_i, \sum_{i\neq j} \phi(\mathbf{z}_i, \mathbf{z}_{i,j}, \mathbf{z}_j)))<br />
\end{align}<br />
where <math>\phi: \mathbb{R}^{2d+e} \rightarrow \mathbb{R}^L, \alpha: \mathbb{R}^{d + L} \rightarrow \mathbb{R}^{W}, p: \mathbb{R}^{W+d} \rightarrow \mathbb{R}</math>.<br />
<br />
Notice that for the dimensions of inputs and outputs, <math>d</math> refers to the number of singleton features in <math>z</math> and <math>e</math> refers to the number of edges. <br />
<br />
[[File:GPI_architecture.jpg|thumb|A schematic representation of the GPI architecture. Singleton features <math>z_i</math> are omitted for simplicity. First, the features <math>z_{i,j}</math> are processed element-wise by <math>\phi</math>. Next, they are summed to create a vector <math>s_i</math>, which is concatenated with <math>z_i</math>. Third, a representation of the entire graph is created by applying <math>\alpha\ n</math> times and summing the created vector. The graph representation is then finally processed by <math>\rho</math> together with <math>z_k</math>.|600px|center]]<br />
<br />
==Proof Sketch for Theorem 1==<br />
The proof of this theorem can be found in the paper. A proof sketch is provided below:<br />
<br />
'''For the forward direction''' (function that follows the form set out in equation (1) is GPI):<br />
# Using definition of permutation <math>\sigma</math>, and rewriting <math>[F(z)]_{\sigma(k)}</math> in the form from equation (1)<br />
# Second argument of <math>\rho</math> is invariant under <math>\sigma</math>, since it takes the sum of all indices <math>i</math> and all other indices <math>j \neq i </math>.<br />
<br />
'''For the backward direction''' (any black-box GPI function can be expressed in the form of equation 1):<br />
# Construct <math>\phi, \alpha</math> such that second argument of <math>\rho</math> contains all information about graph features of <math>z</math>, including edges that the features originate from<br />
# Assume each <math>z_k</math> uniquely identifies the node and <math>\mathcal{F}</math> is a function only of pairwise features <math>z_{i,j}</math><br />
# Construct <math>H</math> be a perfect hash function with <math>L</math> buckets, and <math>\phi</math> which maps '''pairwise features''' to a vector of size <math>L</math><br />
# <math>*</math>Construct <math>\phi(z_i, z_{i,j}, z_j) = \mathbf{1}[H(z_j)] z_{i,j}</math>, which intuitively means that <math>\phi</math> stores <math>z_{i,j}</math> in the unique bucket for node <math>j</math><br />
# Construct function <math>\alpha</math> to output a matrix <math>\mathbb{R}^{L \times L}</math> that maps each pairwise feature into unique positions (<math>\alpha(z_i, s_i) = \mathbf{1}[H(z_i)]s_i^T</math>)<br />
# Construct matrix <math>M = \sum_i \alpha(z_i,s_i)</math> by discarding rows/columns in <math>M</math> that do not correspond to original nodes (which reduces dimension to <math>n\times n</math>; set <math>\rho</math> to have same outcome as <math>\mathcal{F}</math>, and set the output of <math>\mathcal{F}</math> on <math>M</math> to be the labels <math>\mathbf{y} = y_1, \dots, y_n</math><br />
<br />
<math>*</math>The paper presents the proof for the edge features <math>z_{ij}</math> being scalar (<math>e = 1</math>) for simplicity, which can be extended easily to vectors with additional indexing.<br />
<br />
Although the results discussed previously apply to complete graphs (edges apply to all feature pairs), it can be easily extended to incomplete graphs. However, in place of permutation-invariance, it is now an automorphism-invariance.<br />
<br />
==Implications and Applications of Theorem 1==<br />
===Key Implications of Theorem 1===<br />
# Architecture "collects" information from the different edges of the graph, and does so in an invariant fashion using <math>\alpha</math> and <math>\phi</math><br />
# Architecture is parallelizable, since all <math>\phi</math> functions can be applied simultaneously<br />
<br />
===Some applications of Theorem 1===<br />
# '''Attention:''' the concept of attention can be implemented in the GPI characterization, with slight alterations to the functions <math>\alpha</math> and <math>\phi</math>. In attention each node aggregates features of neighbours through a function of neighbour's relevance. Which means the lable of an entity could depend strongly on its close entity. The complete details can be found in the supplementary materials of the paper.<br />
<br />
# '''RNN:''' recurrent architectures can maintain GPI property, since all GPI function <math>\mathcal{F}</math> are closed under composition. The output of one step after running <math>\mathcal{F}</math> will act as input for the next step, but maintain the GPI property throughout.<br />
<br />
=Related Work=<br />
# '''Architectural invariance:''' suggested recently in a 2017 paper called Deep Sets by Zaheer et al., which considers the case of invariance that is more restrictive.<br />
# '''Deep structured prediction:''' previous work applied deep learning to structured prediction, for instance, semantic segmentation. Some algorithms include message passing algorithms, gradient descent for maximizing score functions, greedy decoding (inference of labels based on time of previous labels). Apart from those algorithms, deep learning has been applied to other graph-based problems such as the Travelling Salesman Problem (Bello et al., 2016; Gilmer et al., 2017; Khalil et al., 2017). However, none of the previous work specifically address the notion of invariance in the general architecture, but rather focus on message passing architectures that can be generalized by this paper.<br />
# '''Scene graph prediction:''' scene graph extraction allows for reasoning, question answering, and image retrieval (Johnson et al., 2015; Lu et al., 2016; Raposo et al., 2017). Some other works in this area include object detection, action recognition, and even detection of human-object interactions (Liao et al., 2016; Plummer et al., 2017). Additional work has been done with the use of message passing algorithms (Xu et al., 2017), word embeddings (Lu et al., 2016), and end-to-end prediction directly from pixels (Newell & Deng, 2017). A notable mention is NeuralMotif (Zellers et al., 2017), which the authors describe as the current state-of-the-art model for scene graph predictions on Visual Genome dataset.<br />
# '''Burst Image Deblurring Using Permutation Invariant Convolutional Neural Networks:''' similar ideas were applied, where Permutation Invariant CNN, are used to restore sharp and noise-free images from bursts of photographs affected by hand tremor and noise. This presented good quality images with lots of details for challenging datasets.<br />
<br />
=Experimental Results=<br />
==Synthetic Graph Labeling==<br />
The authors created a synthetic problem to study GPI. This involved using an input graph <math>G = (V,E)</math> where each node <math>i</math> belongs to the set <math>\Gamma(i) \in \{1, \dots, K\}</math> where <math>K</math> is the number of samples. The task is to compute for each node, the number of neighbours that belong to the same set (i.e. finding the label of the node <math>i</math> if <math>y_i = \sum_{j \in N(i)} \mathbf{1}[\Gamma(i) = \Gamma(j)]</math>) . Then, random graphs (each with 10 nodes) were generated by sampling edges, and the set <math>\Gamma(i) \in \{1, \dots, K\}</math>for each node independently and uniformly.<br />
The node features of the graph <math>z_i \in \{0,1\}^K</math> are one-hot vectors of <math>\Gamma(i)</math>, and each pairwise edge feature <math>z_{ij} \in \{0, 1\}</math> denote whether the edge <math>ij</math> is in the edge set <math>E</math>. <br />
3 architectures were studied in this paper:<br />
# '''GPI-architecture for graph prediction''' (without attention and RNN)<br />
# '''LSTM''': replacing <math>\sum \phi(\cdot)</math> and <math>\sum \alpha(\cdot)</math> in the form of Theorem 1 using two LSTMs with state size 200, reading their input in random order<br />
# '''Fully connected feed-forward network''': with 2 hidden layers, each layer containing 1,000 nodes; the input is a concatenation of all nodes and pairwise features, and the output is all node predictions<br />
<br />
The results show that the GPI architecture requires far fewer samples to converge to the correct solution.<br />
[[File:GPI_synthetic_example.jpg|450px|center]]<br />
<br />
==Scene-Graph Classification==<br />
Applying the concept of GPI to Scene-Graph Prediction (SGP) is the main task of this paper. The input to this problem is an image, along with a set of annotated bounding boxes for the entities in the image. The goal is to correctly label each entity within the bounding boxes and the relationship between every pair of entities, resulting in a coherent scene graph.<br />
<br />
The authors describe two different types of variables to predict. The first type is entity variables <math>[y_1, \dots, y_n]</math> for all bounding boxes, where each <math>y_i</math> can take one of L values and refers to objects such as "dog" or "man". The second type is relation variables <math>[y_{n+1}, \cdots, y_{n^2}]</math>, where each <math>y_i</math> represents the relation (e.g. "on", "below") between a pair of bounding boxes (entities).<br />
<br />
The scene graph and contain two types of edges:<br />
# '''Entity-entity edge''': connecting two entities <math>y_i</math> and <math>y_j</math> for <math>1 \leq i \neq j \leq n</math><br />
# '''Entity-relation edges''': connecting every relation variable <math>y_k</math> for <math>k > n</math> to two entities<br />
<br />
The feature set <math>\mathbf{z}</math> is based on the baseline model from Zellers et al. (2017). For entity variables <math>y_i</math>, the vector <math>\mathbf{z}_i \in \mathbb{R}^L</math> models the probability of the entity appearing in <math>y_i</math>. <math>\mathbf{z}_i</math> is augmented by the coordinates of the bounding box. Similarly for relation variables <math>y_j</math>, the vector <math>\mathbf{z}_j \in \mathbb{R}^R</math>, models the probability of the relations between the two entities in <math>j</math>. For entity-entity pairwise features <math>\mathbf{z}_{i,j}</math>, there is a similar representation of the probabilities for the pair. The SGP outputs probability distributions over all entities and relations, which will then be used as input recurrently to maintain GPI. Finally, word embeddings are used and concatenated for the most probable entity-relation labels.<br />
<br />
'''Components of the GPI architecture''' (ent for entity, rel for relation)<br />
# <math>\phi_{ent}</math>: network that integrates two entity variables <math>y_i</math> and <math>y_j</math>, with input <math>z_i, z_j, z_{i,j}</math> and output vector of <math>\mathbb{R}^{n_1}</math> <br />
# <math>\alpha_{ent}</math>: network with inputs from <math>\phi_{ent}</math> for all neighbours of an entity, and uses attention mechanism to output vector <math>\mathbb{R}^{n_2}</math> <br />
# <math>\rho_{ent}</math>: network with inputs from the various <math>\mathbb{R}^{n_2}</math> vectors, and outputs <math>L</math> logits to predict entity value<br />
# <math>\rho_{rel}</math>: network with inputs <math>\alpha_{ent}</math> of two entities and <math>z_{i,j}</math>, and output into <math>R</math> logits<br />
<br />
==Set-up and Results==<br />
'''Dataset''': based on Visual Genome (VG) by (Krishna et al., 2017), which contains a total of 108,077 images annotated with bounding boxes, entities, and relations. An average of 12 entities and 7 relations exist per image. For a fair comparison with previous works, data from (Xu et al., 2017) for train and test splits were used. The authors used the same 150 entities and 50 relations as in (Xu et al., 2017; Newell & Deng, 2017; Zellers et al., 2017). Hyperparameters were tuned using a 70K/5K/32K split for training, validation, and testing respectively.<br />
<br />
'''Training''': all networks were trained using the Adam optimizer, with a batch size of 20. The loss function was the sum of cross-entropy losses over all of entities and relations. Penalties for misclassified entities were 4 times stronger than that of relations. Penalties for misclassified negative relations were 10 times weaker than that of positive relations.<br />
<br />
'''Evaluation''': there are three major tasks when inferring from the scene graph. The authors focus on the following:<br />
# '''SGCIs''': given ground-truth entity bounding boxes, predict all entity and relations categories<br />
# '''PredCIs''': given annotated bounding boxes with entity labels, predict all relations<br />
<br />
The evaluation metric Recall@K (shortened to R@K) is drawn from (Lu et al., 2016). This metric is the fraction of correct ground-truth triplets that appear within the <math>K</math> most confident triplets predicted by the model. Graph-constrained protocol requires the top-<math>K</math> triplets to assign one consistent class per entity and relation. The unconstrained protocol does not enforce such constraint.<br />
<br />
'''Models and baselines''': The authors compared variants of the GPI approach against four baselines, state-of-the-art models on completing scene graph sub-tasks. To maintain consistency, all models used the same training/testing data split, in addition to the preprocessing as per (Xu et al., 2017).<br />
<br />
'''Baselines from existing state-of-the-art models'''<br />
# (Lu et al., 2016): use of word embeddings to fine-tune the likelihood of predicted relations<br />
# (Xu et al., 2017): message passing algorithm between entities and relations to iteratively improve feature map for prediction<br />
# (Newell & Deng, 2017): Pixel2Graph, uses associative embeddings to produce a full graph from image<br />
# (Zellers et al., 2017): NeuralMotif method, encodes global context to capture higher-order motif in scene graphs; Baseline outputs entities and relations distributions without using global context<br />
<br />
'''GPI models'''<br />
# '''GPI with no attention mechanism''': simply following Theorem 1's functional form, with summation over features<br />
# '''GPI NeighborAttention''': same GPI model, but considers attention over neighbours features<br />
# '''GPI Linguistic''': similar to NeighborAttention model, but concatenates word embedding vectors<br />
<br />
'''Key Results''': The GPI Linguistic approach outperforms all baseline for SGCIs, and has similar performance to the state of the art NeuralMotifs method. The authors argue that PredCI is an easier task with less structure, yielding high performance for the existing state of the art models.<br />
<br />
[[File:GPI_table_results.png|700px|center]]<br />
<br />
=Conclusion=<br />
<br />
A deep learning approach was presented in this paper to structured prediction, which constrains the architecture to be invariant to structurally identical inputs. This approach relies on pairwise features which are capable of describing inter-label correlations and inherits the intuitive aspect of score-based approaches. The output produced is invariant to equivalent representation of the pairwise terms. <br />
<br />
As future work, the axiomatic approach can be extended; for example in image labeling, geometric variances such as shifts or rotations may be desired (or in other cases invariance to feature permutations may be desired). Additionally, exploring algorithms that discover symmetries for deep structured prediction when invariant structure is unknown and should be discovered from data is also an interesting extension of this work.<br />
<br />
=Critique=<br />
The paper's contribution comes from the novelty of the permutation invariance as a design guideline for structured prediction. Although not explicitly considered in many of the previous works, the idea of invariance in architecture has already been considered in Deep Sets by (Zaheer et al., 2017). This paper characterizes relaxes the condition on the invariance as compared to that of previous works. In the evaluation of the benefit of GPI models, the paper used a synthetic problem to illustrate the fact that far fewer samples are required for the GPI model to converge to 100% accuracy. However, when comparing the true task of scene graph prediction against the state-of-the-art baselines, the GPI variants had only marginal higher Recall@K scores. The true benefit of this paper's discovery is the avoidance of maximizing a score function (leading computationally difficult problem), and instead directly producing output invariant to how we represent the pairwise terms.<br />
<br />
=References=<br />
Roei Herzig, Moshiko Raboh, Gal Chechik, Jonathan Berant, Amir Globerson, Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction, 2018.<br />
<br />
Additional resources from Moshiko Raboh's [https://github.com/shikorab/SceneGraph GitHub]</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_to_Navigate_in_Cities_Without_a_Map&diff=40687Learning to Navigate in Cities Without a Map2018-11-21T15:27:00Z<p>Mpafla: /* Goal-dependent Actor-Critic Reinforcement Learning */</p>
<hr />
<div>Paper: <br />
Learning to Navigate in Cities Without a Map[https://arxiv.org/pdf/1804.00168.pdf]<br />
A video of the paper is available here[https://sites.google.com/view/streetlearn].<br />
<br />
== Introduction ==<br />
Navigation is an attractive topic in many research disciplines and technology related domains such as neuroscience and robotics. The majority of algorithms are based on the following steps.<br />
<br />
1. Building an explicit map<br />
<br />
2. Planning and acting using that map. <br />
<br />
In this article, based on this fact that human can learn to navigate through cities without using any special tool such as maps or GPS, authors propose new methods to show that a neural network agent can do the same thing by using visual observations. To do so, an interactive environment using Google StreetView Images and a dual pathway agent architecture is designed. As shown in figure 1, some parts of the environment are built using Google StreetView images of New York City (Times Square, Central Park) and London (St. Paul’s Cathedral). The green cone represents the agent’s location and orientation. Although learning to navigate using visual aids is shown to be successful in some domains such as games and simulated environments using deep reinforcement learning (RL), it suffers from data inefficiency and sensitivity to changes in the environment. Thus, it is unclear whether this method could be used for large-scale navigation. That’s why it became the subject of investigation in this paper.<br />
[[File:figure1-soroush.png|600px|thumb|center|Figure 1. Our environment is built of real-world places from StreetView. The figure shows diverse views and corresponding local maps in New York City (Times Square, Central Park) and London (St. Paul’s Cathedral). The green cone represents the agent’s location and orientation.]]<br />
<br />
==Contribution==<br />
This paper has made the following contributions:<br />
<br />
1. Designing a dual pathway agent architecture. This agent can navigate through a real city. The agent is trained with end-to-end reinforcement learning to handle real-world navigations.<br />
<br />
2. Using Goal-dependent learning. This means that the policy and value functions must adapt themselves to a sequence of goals that are provided as input.<br />
<br />
3. Leveraging a recurrent neural architecture. Using that, not only could navigation through a city be possible, but also the model is scalable for navigation in new cities. This architecture supports both locale-specific learnings and general transferable navigations. The authors achieved these by separating a recurrent neural pathway. This pathway receives and interprets the current goal as well as encapsulates and memorizes features of a single region.<br />
<br />
4. Using a new environment which is built on top of Google StreetView images. This provides real-world images for agent’s observation. Using this environment, the agent can navigate from an arbitrary starting point to a goal and then to another goal etc. Also, London, Paris, and New York City are chosen for navigation.<br />
<br />
==Related Work==<br />
<br />
1. Localization from real-world imagery. For example, (Weyand et al., 2016), a CNN was able to achieve excellent results on geolocation task. This paper provides novel work by not including supervised training with ground-truth labels, and by including planning as a goal.<br />
<br />
2. Deep RL methods for navigation. For instance, (Mirowski et al., 2016; Jaderberg et al., 2016) used self-supervised auxiliary tasks to produce visual navigation in several created mazes. This paper makes use of real-world data, in contrast to many related papers in this area.<br />
<br />
3. Deep RL for path planning and mapping. For example, (Zhang et al., 2017) created an agent that represented a global map.<br />
<br />
==Environment==<br />
Google StreetView consists of both high-resolution 360-degree imagery and graph connectivity. Also, it provides a public API. These features make it a valuable resource. In this work, large areas of New York, Paris, and London that contain between 7,000 and 65,500 nodes<br />
(and between 7,200 and 128,600 edges, respectively), have a mean node spacing of 10m and cover a range of up to<br />
5km chosen (Figure 2), without simplifying the underlying connections. This means that there are many areas 'congested' with nodes, occlusions, available footpaths, etc. The agent only sees RGB images that are visible in StreetView images (Figure 1) and is not aware of the underlying graph.<br />
<br />
[[File:figure2-soroush.png|700px|thumb|center|Figure 2. Map of the 5 environments in New York City; our experiments focus on the NYU area as well as on transfer learning from the other areas to Wall Street (see Section 5.3). In the zoomed in area, each green dot corresponds to a unique panorama, the goal is marked in blue, and landmark locations are marked with red pins.]]<br />
<br />
==Agent Interface and the Courier Task==<br />
In RL environment, we need to define observations and actions in addition to tasks. The inputs to the agent are the image <math>x_t</math> and the goal <math>g_t</math>. Also, a first-person view of the 3D environment is simulated by cropping <math>x_t</math> to a 60-degree square RGB image that is scaled to 84*84 pixels. Furthermore, the action space consists of 5 movements: “slow” rotate left or right (±22:5), “fast” rotate left or right (±67.5), or move forward (implemented as a ''noop'' in the case where this is not a viable action).<br />
<br />
There are lots of ways to specify the goal to the agent. In this paper, the current goal is chosen to be represented in terms of its proximity to a set L of fixed landmarks <math> L={(Lat_k, Long_k)}</math> which are specified using Latitude and Longitude coordinate system. For distance to the <math> k_{th}</math> landmark <math>{(d_{(t,k)}^g})_k</math> the goal vector contains <math> g_{(t,i)}=\tfrac{exp(-αd_{(t,i)}^g)}{∑_k exp(-αd_{(t,k)}^g)} </math>for <math>i_{th}</math> landmark with <math>α=0.002</math> (Figure 3).<br />
<br />
[[File:figure3-soroush.PNG|400px|thumb|center|Figure 3. We illustrate the goal description by showing a goal and a set of 5 landmarks that are nearby, plus 4 that are more distant. The code <math>g_i</math> is a vector with a softmax-normalised distance to each landmark.]]<br />
<br />
This form of representation has several advantages: <br />
<br />
1. It could easily be extended to new environments.<br />
<br />
2. It is intuitive. Even humans and animals use landmarks to be able to move from one place to another.<br />
<br />
3. It does not rely on arbitrary map coordinates, and provides an absolute (as opposed to relative) goal.<br />
<br />
In this work, 644 landmarks for New York, Paris, and London are manually defined. The courier task is the problem of navigating to a list of random locations within a city. In each episode, which consists of 1000 steps, the agent starts from a random place with random orientation. when an agent gets within 100 meters of goal, the next goal is randomly chosen. An episode ends after 1000 agent steps. Finally, the reward is proportional to the shortest path between agent and goal when the goal is first assigned (providing more reward for longer journeys). Thus the agent needs to learn the mapping between the images observed at the goal location and the goal vector in order to solve the courier task problem. Furthermore, the agent must learn the association between the images observed at its current location and the policy to reach the goal destination.<br />
<br />
==Methods==<br />
<br />
===Goal-dependent Actor-Critic Reinforcement Learning===<br />
In this paper, the learning problem is based on Markov Decision Process, with state space S, action space A, environment Ɛ, and a set of possible goals G. The reward function depends on the current goal and state: <math>R: S×G×A → R</math>. maximize the expected sum of<br />
discounted rewards starting from state <math>s_0</math> with discount Ƴ. Also the expected return from <math>s_t</math> depends on the goals that are sampled. So, policy and value functions are as follows.<br />
<br />
\begin{align}<br />
g_t:π(α|s,g)=Pr(α_t=α|s_t=s, g_t=g)<br />
\end{align}<br />
<br />
\begin{align}<br />
V^π(s,g)=E[R_t]=E[Σ_{k=0}^∞Ƴ^kr_{t+k}|s_t=s, g_t=g]<br />
\end{align}<br />
<br />
Also, an architecture with multiple pathways is designed to support two types of learning that is required for this problem. First, an agent needs an internal representation which is general and gives an understanding of a scene. Second, to better understand a scene the agent needs to remember unique features of the scene which then help the agent to organize and remember the scenes.<br />
<br />
===Architectures===<br />
<br />
[[File:figure4-soroush.png|400px|thumb|center|Figure 4. Comparison of architectures. Left: GoalNav is a convolutional encoder plus policy LSTM with goal description input. Middle: CityNav is a single-city navigation architecture with a separate goal LSTM and optional auxiliary heading (θ). Right: MultiCityNav is a multi-city architecture with individual goal LSTM pathways for each city.]]<br />
<br />
The agent takes image pixels as input. Then, These pixels are passed through a convolutional network. The output of the Convolution network is fed to a Long Short-Term Memory (LSTM) as well as the past reward <math>r_{t-1}</math><br />
and previous action <math>α_{t-1}</math>.<br />
<br />
Three different architectures are described below.<br />
<br />
The '''GoalNav''' architecture (Fig. 4a) which consists of a convolutional architecture and policy LSTM. Goal description <math>g_t</math>, previous action, and reward are the inputs of this LSTM.<br />
<br />
The '''CityNav''' architecture (Fig. 4b) consists of the previous architecture alongside an additional LSTM, called the goal LSTM. Inputs of this LSTM are visual features and the goal description. The CityNav agent also adds an auxiliary heading (θ) prediction task which is defined as an angle between the north direction and the agent’s pose. This auxiliary task can speed up learning and provides relevant information. <br />
<br />
The '''MultiCityNav''' architecture (Fig. 4c) is an extension of City-Nav for learning in different cities. This is done using the parallel connection of goal LSTMs for encapsulating locale-specific features, for each city. Moreover, the convolutional architecture and the policy LSTM become general after training on a number of cities. So, new goal LSTMs are required to be trained in new cities.<br />
<br />
===Curriculum Learning===<br />
In curriculum learning, the model is trained using simple examples in first steps. As soon as the model learns those examples, more complex and difficult examples would be fed to the model. In this paper, this approach is used to teach agent to navigate to further destinations. This courier task suffers from a common problem of RL tasks which is sparse rewarding very sparse rewards. To overcome this problem, a natural curriculum scheme is defined, in which sampling each new goal would be within 500m of the agent’s position. Then, the maximum range increases gradually to cover the full range(3.5km in the smaller New York areas, or 5km for central London or Downtown Manhattan)<br />
<br />
==Results==<br />
In this section, the performance of the proposed architectures on the courier task is shown.<br />
<br />
[[File:figure5-2.png|600px|thumb|center|Figure 5. Average per-episode goal rewards (y-axis) are plotted vs. learning steps (x-axis) for the courier task in the NYU (New York City) environment (top), and in central London (bottom). We compare the GoalNav agent, the CityNav agent, and the CityNav agent without skip connection on the NYU environment, and the CityNav agent in London. We also compare the Oracle performance and a Heuristic agent, described below. The London agents were trained with a 2-phase curriculum– we indicate the end of phase 1 (500m only) and the end of phase 2 (500m to 5000m). Results on the Rive Gauche part of Paris (trained in the same way<br />
as in London) are comparable and the agent achieved mean goal reward 426.]]<br />
<br />
It is first shown that the CityNav agent, trained with curriculum learning, succeeds in learning the courier task in New York, London and Paris. Figure 5 compares the following agents:<br />
<br />
1. Goal Navigation agent.<br />
<br />
2. City Navigation Agent.<br />
<br />
3. A City Navigation agent without the skip connection from the vision layers to the policy LSTM. This is needed to regularise the interface between the goal LSTM and the policy LSTM in multi-city transfer scenario.<br />
<br />
Also, a lower bound (Heuristic) and an upper bound(Oracle) on the performance is considered. As it is said in the paper: "Heuristic is a random walk on the street graph, where the agent turns in a random direction if it cannot move forward; if at an intersection it will turn with a probability <math>P=0.95</math>. Oracle uses the full graph to compute the optimal path using breadth-first search.". As it is clear in Figure 5, CityNav architecture with the previously mentioned architecture attains a higher performance and is more stable than the simpler GoalNav agent.<br />
<br />
The trajectories of the trained agent over two 1000 step episodes and the value function of the agent during navigation to a destination is shown in Figure 6.<br />
<br />
[[File:figure6-soroush.png|400px|thumb|center|Figure 6. Trained CityNav agent’s performance in two environments: Central London (left panes), and NYU (right panes). Top: examples of the agent’s trajectory during one 1000-step episode, showing successful consecutive goal acquisitions. The arrows show the direction of travel of the agent. Bottom: We visualize the value function of the agent during 100 trajectories with random starting points and the same goal (respectively St Paul’s Cathedral and Washington Square). Thicker and warmer color lines correspond to higher value functions.]]<br />
<br />
Figure 7 shows that navigation policy is learned by agent successfully in St Paul’s Cathedral in London and Washington Square in New York.<br />
[[File:figure7-soroush.png|400px|thumb|center|Figure 7. Number of steps required for the CityNav agent to reach<br />
a goal (Washington Square in New York or St Paul’s Cathedral in<br />
London) from 100 start locations vs. the straight-line distance to<br />
the goal in meters. One agent step corresponds to a forward movement<br />
of about 10m or a left/right turn by 22.5 or 67.5 degrees.]]<br />
<br />
A critical test for this article is to transfer model to new cities by learning a new set of landmarks, but without re-learning visual representation, behaviors, etc. Therefore, the MultiCityNav agent is trained on a number of cities besides freezing both the policy LSTM and the convolutional encoder. Then a new locale-specific goal LSTM is trained. The performance is compared using three different training regimes, illustrated in Fig. 9: Training on only the target city (single training); training on multiple cities, including the target city, together (joint training); and joint training on all but the target city, followed by training on the target city with the rest of the architecture frozen (pre-train and transfer). Figure 10 shows that transferring to other cities is possible. Also, training the model on more cities would increase its effectiveness. According to the paper: "Remarkably, the agent that is pre-trained on 4 regions and then transferred to Wall Street achieves comparable performance to an agent trained jointly on all the regions, and only slightly worse than single-city training on Wall Street alone". Training the model in a single city using skip connection is useful. However, it is not useful in multi-city transferring.<br />
[[File:figure9-soroush.png|400px|thumb|center|Figure 9. Illustration of training regimes: (a) training on a single city (equivalent to CityNav); (b) joint training over multiple cities with a dedicated per-city pathway and shared convolutional net and policy LSTM; (c) joint pre-training on a number of cities followed by training on a target city with convolutional net and policy LSTM frozen (only the target city pathway is optimized).]]<br />
[[File:figure10-soroush.png|400px|thumb|center|Figure 10. Joint multi-city training and transfer learning performance of variants of the MultiCityNav agent evaluated only on the target city (Wall Street). We compare single-city training on the target environment alone vs. joint training on multiple cities (3, 4, or 5-way joint training including Wall Street), vs. pre-training on multiple cities and then transferring to Wall Street while freezing the entire agent except for the new pathway (see Fig. 10). One variant has skip connections between the convolutional encoder and the policy LSTM, the other does not (no-skip).]]<br />
<br />
Giving early rewards before agent reaches the goal or adding random rewards (coins) to encourage exploration is investigated in this article. Figure 11a suggests that coins by themselves are ineffective as our task does not benefit from wide explorations. Also, as it is clear from Figure 11b, reducing the density of the landmarks does not seem to reduce the performance. Based on the results, authors chose to start sampling the goal within a radius of 500m from the agent’s location, and then progressively extend it to the maximum distance an agent could travel within the environment. In addition, to asses the importance of the goal-conditioned agents, a Goal-less CityNav agent is trained by removing inputs gt. The poor performance of this agent is clear in Figure 11b. Furthermore, reducing the density of the landmarks by the ratio of 50%, 25%, and 12:5% does not reduce the performance that much. Finally, some alternative for goal representation is investigated:<br />
<br />
a) Latitude and longitude scalar coordinates normalized to be between 0 and 1.<br />
<br />
b) Binned representation. <br />
<br />
The latitude and longitude scalar goal representations perform the best. However, since the all landmarks representation performs well while remaining independent of the coordinate system, we use this representation as the canonical one.<br />
<br />
[[File:figure11-soroush.PNG|300px|thumb|center|Figure 11. Top: Learning curves of the CityNav agent on NYU, comparing reward shaping with different radii of early rewards (ER) vs. ER with random coins vs. curriculum learning with ER 200m and no coins (ER 200m, Curr.). Bottom: Learning curves for CityNav agents with different goal representations: landmark-based, as well as latitude and longitude classification-based and regression-based.]]<br />
<br />
==Conclusion==<br />
In this paper, a deep reinforcement learning approach that enables navigation in cities is presented. Furthermore, a new courier task and a multi-city neural network agent architecture that is able to be transferred to new cities is discussed.<br />
<br />
==Critique==<br />
1. It is not clear that how this model is applicable in the real world. A real-world navigation problem needs to detect objects, people, and cars. However, it is not clear whether they are modeling them or not. From what I understood, they did not care about the collision, which is against their claim that it is a real-world problem.<br />
<br />
2. This paper is only using static google street view images as its primary source of data. But the authors must at least complement this with other dynamic data like traffic and road blockage information for a realistic model of navigation in the world.<br />
<br />
3. The 'Transfer in Multi-City Experiments' results could strengthened significantly from cross-validation (only Wall Street, which covers the smallest area of the four regions, is used as the test case). Additionally, the results do not show true 'multi-city' transfer learning, since all regions are within New York City. It is stated in the paper that not having to re-learn visual representations when transferring between cities is one of the outcomes, but the tests do not actually check for this. There are likely significant differences in the features that would be learned in NYC vs. Waterloo, for example, and this type of transfer has not been evaluated.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Obfuscated_Gradients_Give_a_False_Sense_of_Security_Circumventing_Defenses_to_Adversarial_Examples&diff=39776Obfuscated Gradients Give a False Sense of Security Circumventing Defenses to Adversarial Examples2018-11-18T14:44:14Z<p>Mpafla: /* Conclusion */</p>
<hr />
<div>= Introduction =<br />
Over the past few years, neural network models have been the source of major breakthroughs in a variety of computer vision problems. However, these networks have been shown to be susceptible to adversarial attacks. In these attacks, small humanly-imperceptible changes are made to images (that are correctly classified) which causes these models to misclassify with high confidence. These attacks pose a major threat that needs to be addressed before these systems can be deployed on a large scale, especially in safety-critical scenarios. <br />
<br />
The seriousness of this threat has generated major interest in both the design and defense against them. In this paper, the authors identify a common technique employed by several recently proposed defenses and design a set of attacks that can be used to overcome them. The use of this technique, masking gradients, is so prevalent, that 7 out of the 9 defenses proposed in the ICLR 2018 conference employed them. The authors were able to circumvent the proposed defenses and successfully brought down the accuracy of their models to below 10%.<br />
Their reimplementation of each of the defenses and implementations of the attacks are available [https://github.com/anishathalye/obfuscated-gradients here].<br />
<br />
= Methodology =<br />
<br />
The paper assumes a lot of familiarity with adversarial attack literature. The section below briefly explains some key concepts.<br />
<br />
== Background ==<br />
<br />
==== Adversarial Images Mathematically ====<br />
Given an image <math>x</math> and a classifier <math>f(x)</math>, an adversarial image <math>x'</math> satisfies two properties:<br />
# <math>D(x,x') < \epsilon </math><br />
# <math>c(x') \neq c^*(x) </math><br />
<br />
Where <math>D</math> is some distance metric, <math>\epsilon </math> is a small constant, <math>c(x')</math> is the output ''class'' predicted by the model, and <math>c^*(x)</math> is the true class for input x. In words, the adversarial image is a small distance from the original image, but the classifier classifies it incorrectly.<br />
<br />
==== Adversarial Attacks Terminology ====<br />
#Adversarial attacks can be either '''black''' or '''white-box'''. In black box attacks, the attacker has access to the network output only, while white-box attackers have full access to the network, including its gradients, architecture and weights. This makes white-box attackers much more powerful. Given access to gradients, white-box attacks use back propagation to modify inputs (as opposed to the weights) with respect to the loss function.<br />
#In '''untargeted''' attacks, the objective is to ''maximize'' the loss of the true class, <math>x'=x \mathbf{+} \lambda(sign(\nabla_xL(x,c^*(x))))</math>. While in '''targeted''' attacks, the objective is to ''minimize loss for a target class'' <math>c^t(x)</math> that is different from the true class, <math>x'=x \mathbf{-} \epsilon(sign(\nabla_xL(x,c^t(x))))</math>. Here, <math>\nabla_xL()</math> is the gradient of the loss function with respect to the input, <math>\lambda</math> is a small gradient step and <math>sign()</math> is the sign of the gradient.<br />
# An attacker may be allowed to use a single step of back-propagation ('''single step''') or multiple ('''iterative''') steps. Iterative attackers can generate more powerful adversarial images. Typically, to bound iterative attackers a distance measure is used.<br />
<br />
In this paper the authors focus on the more difficult attacks; white-box iterative targeted and untargeted attacks.<br />
<br />
== Obfuscated Gradients ==<br />
As gradients are used in the generation of white-box adversarial images, many defense strategies have focused on methods that mask gradients. If gradients are masked, they cannot be followed to generate adversarial images. The authors argue against this general approach by showing that it can be easily circumvented. To emphasize their point, they looked at white-box defenses proposed in ICLR 2018. Three types of gradient masking techniques were found:<br />
<br />
# '''Shattered gradients''': Non-differentiable operations are introduced into the model, causing a gradient to be nonexistent or incorrect.<br />
# '''Stochastic gradients''': A stochastic process is added into the model at test time, causing the gradients to become randomized.<br />
# '''Vanishing Gradients ''': Very deep neural networks or those with recurrent connections are used. Because of the vanishing or exploding gradient problem common in these deep networks, effective gradients at the input are small and not very useful.<br />
<br />
== The Attacks ==<br />
To circumvent these gradient masking techniques, the authors propose:<br />
# '''Backward Pass Differentiable Approximation (BPDA)''': For defenses that introduce non-differentiable components, the authors replace it with an approximate function that is differentiable on the backward pass. In a white-box setting, the attacker has full access to any added non-linear transformation and can find its approximation. <br />
# '''Expectation over Transformation [Athalye, 2017]''': For defenses that add some form of test time randomness, the authors propose to use expectation over transformation technique in the backward pass. Rather than moving along the gradient every step, several gradients are sampled and the step is taken in the average direction. This can help with any stochastic misdirection from individual gradients. The technique is similar to using mini-batch gradient descent but applied in the construction of adversarial images.<br />
# '''Re-parameterize the exploration space''': For very deep networks that rely on vanishing or exploding gradients, the authors propose to re-parameterize and search over the range where the gradient does not explode/vanish.<br />
<br />
= Main Results =<br />
[[File:Summary_Table.png|600px|center]]<br />
<br />
The table above summarizes the results of their attacks. Attacks are mounted on the same dataset each defense targeted. If multiple datasets were used, attacks were performed on the largest one. Two different distance metrics (<math>\ell_{\infty}</math> and <math>\ell_{2}</math>) were used in the construction of adversarial images. Distance metrics specify how much an adversarial image can vary from an original image. For <math>\ell_{\infty}</math> adversarial images, each pixel is allowed to vary by a maximum amount. For example, <math>\ell_{\infty}=0.031</math> specifies that each pixel can vary by <math>256*0.031=8</math> from its original value. <math>\ell_{2}</math> distances specify the magnitude of the total distortion allowed over all pixels. For MNIST and CIFAR-10, untargeted adversarial images were constructed using the entire test set, while for Imagenet, 1000 test images were randomly selected and used to generate targeted adversarial images. <br />
<br />
Standard models were used in evaluating the accuracy of defense strategies under the attacks,<br />
# MNIST: 5-layer Convolutional Neural Network (99.3% top-1 accuracy)<br />
# CIFAR-10: Wide-Resnet (95.0% top-1 accuracy)<br />
# Imagenet: InceptionV3 (78.0% top-1 accuracy)<br />
<br />
The last column shows the accuracies each defense method achieved over the adversarial test set. Except for [Madry, 2018], all defense methods could only achieve an accuracy of <10%. Furthermore, the accuracy of most methods was 0%. The results of [Samangoui,2018] (double asterisk), show that their approach was not as successful. The authors claim that is is a result of implementation imperfections but theoretically the defense can be circumvented using their proposed method.<br />
<br />
==== The defense that worked - Adversarial Training [Madary, 2018] ====<br />
<br />
As a defense mechanism, [Madry, 2018] proposes training the neural networks with adversarial images. Although this approach is previously known [Szegedy, 2013] in their formulation, the problem is setup in a more systematic way using a min-max formulation:<br />
\begin{align}<br />
\theta^* = \arg \underset{\theta} \min \mathop{\mathbb{E_x}} \bigg{[} \underset{\delta \in [-\epsilon,\epsilon]}\max L(x+\delta,y;\theta)\bigg{]} <br />
\end{align}<br />
<br />
where <math>\theta</math> is the parameter of the model, <math>\theta^*</math> is the optimal set of parameters and <math>\delta</math> is a small perturbation to the input image <math>x</math> and is bounded by <math>[-\epsilon,\epsilon]</math>. <br />
<br />
Train proceeds in the following way. For each clean input image, a distorted version of the image is found by maximizing the inner maximization problem for a fixed number of iterations. Gradient steps are constrained to fall within the allowed range (projected gradient descent). Next, the classification problem is solved by minimizing the outer minimization problem.<br />
<br />
This approach was shown to provide resilience to all types of adversarial attacks.<br />
<br />
==== How to check for Obfuscated Gradients ====<br />
For future defense proposals, it is recommended to avoid using masked gradients. To assist with this, the authors propose a set of conditions that can help identify if defense is relying on masked gradients:<br />
# If weaker one-step attacks are performing better than iterative attacks.<br />
# Black-box attacks can find stronger adversarial images compared with white-box attacks.<br />
# Unbounded iterative attacks do not reach 100% success.<br />
# If random brute force attempts are better than gradient based methods at finding adversarial images.<br />
<br />
==== Recommendations for future defense methods to encourage reproducibility ====<br />
<br />
= Detailed Results =<br />
<br />
== Non-obfuscated Gradients ==<br />
==== Adversarial Training [Madry 2018] ====<br />
'''Defence''': Proposed by Goodfellow et al. (2014b), adversarial training solves a min-max game through a conceptually simple process: train on adversarial examples until the model learns to classify them correctly. The authors study the adversarial training approach of Madry et al. (2018) which for a given <math>\epsilon</math>-ball solves <br />
<br />
<div style="text-align: center;font-size:100%">[[File:sb.png]]</div><br />
<br />
which the authors of original paper, solve by the inner maximization problem by generating adversarial examples using projected gradient descent. The author's experiments were not able to invalidate the claims of the paper. The authors also mention that this approach does not cause obfuscated gradients and the original authors’ evaluation of this defense performs all of the tests for characteristic behaviours of obfuscated gradients that the authors of this paper list. Also, the authors note that (1) adversarial retraining has been shown to be difficult at a large scale like ImageNet, and (2) training exclusively on <math> l_\infty</math> adversarial examples provides only limited robustness to adversarial examples under other distortion metrics.<br />
<br />
==== Cascade Adversarial Training [Na 2018] ====<br />
'''Defence''': Cascade adversarial machine learning approach is similar to the adversarial training approach mentioned above. The main difference is that instead of using iterative methods to generate adversarial examples at each mini-batch, a model is first trained, generate adversarial examples with iterative methods on that model, add those examples to training set, and then train a new model on the augmented dataset. Again, as above, the authors were unable to reduce the claims made by the paper even though the claims are a bit weaker in this case with 16% accuracy with <math>\epsilon</math> = .015, compared to over 70% at the same perturbation budget with adversarial training as in Madry et al. (2018).<br />
<br />
== Gradient Shattering ==<br />
<br />
==== Thermometer Coding, [Buckman, 2018] ====<br />
'''Defense''': Inspired by the observation that neural networks learn linear boundaries between classes [Goodfellow, 2014] , [Buckman, 2018] sought to break this linearity by explicitly adding a highly non-linear transform at the input of their model. The non-linear transformation they chose was quantizing inputs to binary vectors. The quantization performed was termed thermometer encoding,<br />
<br />
Given an image, for each pixel value <math>x_{i,j,c}</math>, if an <math>l</math> dimensional thermometer code, the <math>kth</math> bit is given by:<br />
\begin{align}<br />
\tau(x_{i,j,c})_k = \bigg{\{}\begin{array}{ll}<br />
1 \space if \thinspace x_{i,j,c} >k/l \\<br />
0 \space otherwise \\<br />
\end{array}<br />
\end{align}<br />
Here it is assumed <math>x_{i,j,c} \in [0, 1] </math> and <math>i, j, c</math> are the row, column and channel index of the pixel respectively. This encoding is essentially like one-hot encoding, except all the points (not just one) greater than the target value are set to 1. This quantization technique preserves pairwise ordering between pixels.<br />
<br />
On CIFAR-10, the model gave 50% accuracy against <math>\ell_\infty</math> adversarial images with <math>\epsilon=0.031</math> attacks.<br />
<br />
'''Attack''': The authors attack this model using there BPDA approach. Given the non-linear transformation performed in the forward pass, <math>\tau(x)</math>, they develop a differentiable counterpart,<br />
\begin{align}<br />
\hat{\tau}(x_{i,j,c})_k = \min ( \max (x_{i,j,c} - \frac{k}{l}), 1 )<br />
\end{align}<br />
and use it in place of <math>\tau(x)</math> on the backward pass. With their modifications they were able to bring the accuracy of the model down to 0%.<br />
<br />
==== Input Transformation, [Guo, 2018] ====<br />
'''Defense''':[Gou, 2018] investigated the effect of including different input transformation on the robustness to adversarial images. In particular, they found two techniques provided the greatest resistance: total variance minimization and image quilting. Total variance minimization is a technique that removes high frequency noise while preserving legitimate edges (good high frequency components). In image quilting, a large database of image patches from clean images is collected. At test time, input patches, that contain a lot of noise, are replaced with similar but clean patches from the data base.<br />
<br />
Both techniques, removed perturbations from adversarial images which provides some robustness to adversarial attacks. Moreover, both approaches are non-differentiable which makes constructing white-box adversarial images difficult. Moreover, the techniques also include test time randomness as the modifications make are input dependent. The best model achieved 60% accuracy on adversarial images with <math>l_{2}=0.05</math> perturbations.<br />
<br />
'''Attack''': The authors used the BPDA attack where the input transformations were replaced by an identity function. They were able to bring the accuracy of the model down to 0% under the same type of adversarial attacks.<br />
<br />
==== Local Intrinsic Dimensionality, [Ma, 2018] ====<br />
'''Defense''' Local intrinsic dimensionality (LID) is a distance-based metric that measures the similarity between points in a high dimensional space. Given a set of points, let the distance between sample <math>x</math> and its <math>ith</math> neighbor be <math>r_i(x)</math>, then the LID under the choose distance metric is given by,<br />
<br />
\begin{align}<br />
LID(x) = - \bigg{(} \frac{1}{k}\sum^k_{i=1}log \frac{r_i(x)}{r_k(x)} \bigg{)}^{-1}<br />
\end{align}<br />
where k is the number of nearest neighbors considered, <math>r_k(x)</math> is the maximum distance to any of the neighbors in the set k. <br />
<br />
First, <math>L_2</math> distances for all training and adversarial images. Next, the LID scores for each train and adversarial images were calculated. It was found that LID scores for adversarial images were significantly larger than those of clean images. Base on these results, the a separate classifier was created that can be used to detect adversarial inputs. [Ma, 2018] claim that this is not a defense method, but a method to study the properties of adversarial images.<br />
<br />
'''Attack''': Instead of attacking this method, the authors show that this method is not able to detect, and is therefore venerable to, attacks of the [Carlini and Wagner, 2017a] variety.<br />
<br />
== Stochastic Gradients ==<br />
<br />
==== Stochastic Activation Pruning, [Dhillon, 2018] ====<br />
'''Defense''': [Dhillon, 2018] use test time randomness in their model to guard against adversarial attacks. Within a layer, the activities of component nodes are randomly dropped with a probability proportional to its absolute value. The rest of the activation are scaled up to preserve accuracies. This is akin to test time drop-out. This technique was found to drop accuracy slightly on clean images, but improved performance on adversarial images.<br />
<br />
'''Attack''': The authors used the expectation over transformation attack to get useful gradients out of the model. With their attack they were able to reduce the accuracy of this method down to 0% on CIFAR-10.<br />
<br />
==== Mitigation Through Randomization, [Xie, 2018] ====<br />
'''Defense''': [Xie, 2018] Add a randomization layer to their model to help defend against adversarial attacks. For an input image of size [299,299], first the image is randomly re-scaled to <math>r \in [299,331]</math>. Next the image is zero-padded to fix the dimension of the modified input. This modified input is then fed into a regular classifier. The authors claim that is strategy can provide an accuracy of 32.8% against ensemble attack patterns (fixed distortions, but many of them which are picked randomly). Because of the introduced randomness, the authors claim the model builds some robustness to other types of attacks as well.<br />
<br />
'''Attack''': The EOT method was used to build adversarial images to attack this model. With their attack, the authors were able to bring the accuracy of this model down to 0% using <math>L_{\infty}(\epsilon=0.031)</math> perturbations.<br />
<br />
== Vanishing and Exploding Gradients ==<br />
<br />
==== Pixel Defend, [Song, 2018] ====<br />
'''Defense''': [Song, 2018] argues that adversarial images lie in low probability regions of the data manifold. Therefore, one way to handle adversarial attacks is to project them back in the high probability regions before feeding them into a classifier. They chose to do this by using a generative model (pixelCNN) in a denoising capacity. A PixelCNN model directly estimates the conditional probability of generating an image pixel by pixel [Van den Oord, 2016],<br />
<br />
\begin{align}<br />
p(\mathbf{x}= \prod_{i=1}^{n^2} p(x_i|x_0,x_1 ....x_{i-1}))<br />
\end{align}<br />
<br />
The reason for choosing this model is the long iterative process of generation. In the backward pass, following the gradient all the way to the input would not be possible because of the vanishing/exploding gradient<br />
problem of deep networks. The proposed model was able to obtain an accuracy of 46% on CIFAR-10 images with <math>l_{\infty} (\epsilon=0.031) </math> perturbations.<br />
<br />
'''Attack''': The model was attacked using the BPDA technique where back-propagating though the pixelCNN was replaced with an identity function. With this apporach, the authors were able to bring down the accuracy to 9% under the same kind of perturbations.<br />
<br />
==== Defense-GAN, [Samangouei, 2018] ====<br />
<br />
= Conclusion =<br />
In this paper, it was found that gradient masking is a common technique used by many defense proposals that claim to be robust against a very difficult class of adversarial attacks; white-box, iterative attacks. However, the authors found that they can be easily circumvented. Three attack methods are presented that were able to defeat 7 out of the 8 defense proposal accepted in the 2018 ICLR conference for these types of attacks.<br />
<br />
Some future work that can come out of this paper includes avoiding relying on obfuscated gradients for perceived robustness and use the evaluation approach to detect when the attach occurs. Early categorization of attacks using some supervised techniques can also help in critical evaluation of incoming data.<br />
<br />
= Critique =<br />
# The third attack method, reparameterization of the input distortion search space was presented very briefly and at a very high level. Moreover, the one defense proposal they chose to use it against, [Samangouei, 2018] prove to be resilient against the attack. The authors had to resort to one of their other methods to circumvent the defense.<br />
# The BPDA and reparameterization attacks require intrinsic knowledge of the networks. This information is not likely to be available to external users of a network. Most likely, the use-case for these attacks will be in-house to develop more robust networks. This also means that it is still possible to guard against adversarial attack using gradient masking techniques, provided the details of the network are kept secret.<br />
# The BPDA algorithm requires replacing a non-linear part of the model with a differentiable approximation. Since different networks are likely to use different transformations, this technique is not plug-and-play. For each network, the attack needs to be manually constructed.<br />
<br />
= Other Sources =<br />
= References =<br />
#'''[Madry, 2018]''' Madry, A., Makelov, A., Schmidt, L., Tsipras, D. and Vladu, A., 2017. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083.<br />
#'''[Buckman, 2018]''' Buckman, J., Roy, A., Raffel, C. and Goodfellow, I., 2018. Thermometer encoding: One hot way to resist adversarial examples.<br />
#'''[Guo, 2018]''' Guo, C., Rana, M., Cisse, M. and van der Maaten, L., 2017. Countering adversarial images using input transformations. arXiv preprint arXiv:1711.00117.<br />
#'''[Xie, 2018]''' Xie, C., Wang, J., Zhang, Z., Ren, Z. and Yuille, A., 2017. Mitigating adversarial effects through randomization. arXiv preprint arXiv:1711.01991.<br />
#'''[song, 2018]''' Song, Y., Kim, T., Nowozin, S., Ermon, S. and Kushman, N., 2017. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. arXiv preprint arXiv:1710.10766.<br />
#'''[Szegedy, 2013]''' Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. and Fergus, R., 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.<br />
#'''[Samangouei, 2018]''' Samangouei, P., Kabkab, M. and Chellappa, R., 2018. Defense-GAN: Protecting classifiers against adversarial attacks using generative models. arXiv preprint arXiv:1805.06605.<br />
#'''[van den Oord, 2016]''' van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O. and Graves, A., 2016. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems (pp. 4790-4798).<br />
#'''[Athalye, 2017]''' Athalye, A. and Sutskever, I., 2017. Synthesizing robust adversarial examples. arXiv preprint arXiv:1707.07397.<br />
#'''[Ma, 2018]''' Ma, Xingjun, Bo Li, Yisen Wang, Sarah M. Erfani, Sudanthi Wijewickrema, Michael E. Houle, Grant Schoenebeck, Dawn Song, and James Bailey. "Characterizing adversarial subspaces using local intrinsic dimensionality." arXiv preprint arXiv:1801.02613 (2018).</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Obfuscated_Gradients_Give_a_False_Sense_of_Security_Circumventing_Defenses_to_Adversarial_Examples&diff=39775Obfuscated Gradients Give a False Sense of Security Circumventing Defenses to Adversarial Examples2018-11-18T14:43:35Z<p>Mpafla: /* Introduction */</p>
<hr />
<div>= Introduction =<br />
Over the past few years, neural network models have been the source of major breakthroughs in a variety of computer vision problems. However, these networks have been shown to be susceptible to adversarial attacks. In these attacks, small humanly-imperceptible changes are made to images (that are correctly classified) which causes these models to misclassify with high confidence. These attacks pose a major threat that needs to be addressed before these systems can be deployed on a large scale, especially in safety-critical scenarios. <br />
<br />
The seriousness of this threat has generated major interest in both the design and defense against them. In this paper, the authors identify a common technique employed by several recently proposed defenses and design a set of attacks that can be used to overcome them. The use of this technique, masking gradients, is so prevalent, that 7 out of the 9 defenses proposed in the ICLR 2018 conference employed them. The authors were able to circumvent the proposed defenses and successfully brought down the accuracy of their models to below 10%.<br />
Their reimplementation of each of the defenses and implementations of the attacks are available [https://github.com/anishathalye/obfuscated-gradients here].<br />
<br />
= Methodology =<br />
<br />
The paper assumes a lot of familiarity with adversarial attack literature. The section below briefly explains some key concepts.<br />
<br />
== Background ==<br />
<br />
==== Adversarial Images Mathematically ====<br />
Given an image <math>x</math> and a classifier <math>f(x)</math>, an adversarial image <math>x'</math> satisfies two properties:<br />
# <math>D(x,x') < \epsilon </math><br />
# <math>c(x') \neq c^*(x) </math><br />
<br />
Where <math>D</math> is some distance metric, <math>\epsilon </math> is a small constant, <math>c(x')</math> is the output ''class'' predicted by the model, and <math>c^*(x)</math> is the true class for input x. In words, the adversarial image is a small distance from the original image, but the classifier classifies it incorrectly.<br />
<br />
==== Adversarial Attacks Terminology ====<br />
#Adversarial attacks can be either '''black''' or '''white-box'''. In black box attacks, the attacker has access to the network output only, while white-box attackers have full access to the network, including its gradients, architecture and weights. This makes white-box attackers much more powerful. Given access to gradients, white-box attacks use back propagation to modify inputs (as opposed to the weights) with respect to the loss function.<br />
#In '''untargeted''' attacks, the objective is to ''maximize'' the loss of the true class, <math>x'=x \mathbf{+} \lambda(sign(\nabla_xL(x,c^*(x))))</math>. While in '''targeted''' attacks, the objective is to ''minimize loss for a target class'' <math>c^t(x)</math> that is different from the true class, <math>x'=x \mathbf{-} \epsilon(sign(\nabla_xL(x,c^t(x))))</math>. Here, <math>\nabla_xL()</math> is the gradient of the loss function with respect to the input, <math>\lambda</math> is a small gradient step and <math>sign()</math> is the sign of the gradient.<br />
# An attacker may be allowed to use a single step of back-propagation ('''single step''') or multiple ('''iterative''') steps. Iterative attackers can generate more powerful adversarial images. Typically, to bound iterative attackers a distance measure is used.<br />
<br />
In this paper the authors focus on the more difficult attacks; white-box iterative targeted and untargeted attacks.<br />
<br />
== Obfuscated Gradients ==<br />
As gradients are used in the generation of white-box adversarial images, many defense strategies have focused on methods that mask gradients. If gradients are masked, they cannot be followed to generate adversarial images. The authors argue against this general approach by showing that it can be easily circumvented. To emphasize their point, they looked at white-box defenses proposed in ICLR 2018. Three types of gradient masking techniques were found:<br />
<br />
# '''Shattered gradients''': Non-differentiable operations are introduced into the model, causing a gradient to be nonexistent or incorrect.<br />
# '''Stochastic gradients''': A stochastic process is added into the model at test time, causing the gradients to become randomized.<br />
# '''Vanishing Gradients ''': Very deep neural networks or those with recurrent connections are used. Because of the vanishing or exploding gradient problem common in these deep networks, effective gradients at the input are small and not very useful.<br />
<br />
== The Attacks ==<br />
To circumvent these gradient masking techniques, the authors propose:<br />
# '''Backward Pass Differentiable Approximation (BPDA)''': For defenses that introduce non-differentiable components, the authors replace it with an approximate function that is differentiable on the backward pass. In a white-box setting, the attacker has full access to any added non-linear transformation and can find its approximation. <br />
# '''Expectation over Transformation [Athalye, 2017]''': For defenses that add some form of test time randomness, the authors propose to use expectation over transformation technique in the backward pass. Rather than moving along the gradient every step, several gradients are sampled and the step is taken in the average direction. This can help with any stochastic misdirection from individual gradients. The technique is similar to using mini-batch gradient descent but applied in the construction of adversarial images.<br />
# '''Re-parameterize the exploration space''': For very deep networks that rely on vanishing or exploding gradients, the authors propose to re-parameterize and search over the range where the gradient does not explode/vanish.<br />
<br />
= Main Results =<br />
[[File:Summary_Table.png|600px|center]]<br />
<br />
The table above summarizes the results of their attacks. Attacks are mounted on the same dataset each defense targeted. If multiple datasets were used, attacks were performed on the largest one. Two different distance metrics (<math>\ell_{\infty}</math> and <math>\ell_{2}</math>) were used in the construction of adversarial images. Distance metrics specify how much an adversarial image can vary from an original image. For <math>\ell_{\infty}</math> adversarial images, each pixel is allowed to vary by a maximum amount. For example, <math>\ell_{\infty}=0.031</math> specifies that each pixel can vary by <math>256*0.031=8</math> from its original value. <math>\ell_{2}</math> distances specify the magnitude of the total distortion allowed over all pixels. For MNIST and CIFAR-10, untargeted adversarial images were constructed using the entire test set, while for Imagenet, 1000 test images were randomly selected and used to generate targeted adversarial images. <br />
<br />
Standard models were used in evaluating the accuracy of defense strategies under the attacks,<br />
# MNIST: 5-layer Convolutional Neural Network (99.3% top-1 accuracy)<br />
# CIFAR-10: Wide-Resnet (95.0% top-1 accuracy)<br />
# Imagenet: InceptionV3 (78.0% top-1 accuracy)<br />
<br />
The last column shows the accuracies each defense method achieved over the adversarial test set. Except for [Madry, 2018], all defense methods could only achieve an accuracy of <10%. Furthermore, the accuracy of most methods was 0%. The results of [Samangoui,2018] (double asterisk), show that their approach was not as successful. The authors claim that is is a result of implementation imperfections but theoretically the defense can be circumvented using their proposed method.<br />
<br />
==== The defense that worked - Adversarial Training [Madary, 2018] ====<br />
<br />
As a defense mechanism, [Madry, 2018] proposes training the neural networks with adversarial images. Although this approach is previously known [Szegedy, 2013] in their formulation, the problem is setup in a more systematic way using a min-max formulation:<br />
\begin{align}<br />
\theta^* = \arg \underset{\theta} \min \mathop{\mathbb{E_x}} \bigg{[} \underset{\delta \in [-\epsilon,\epsilon]}\max L(x+\delta,y;\theta)\bigg{]} <br />
\end{align}<br />
<br />
where <math>\theta</math> is the parameter of the model, <math>\theta^*</math> is the optimal set of parameters and <math>\delta</math> is a small perturbation to the input image <math>x</math> and is bounded by <math>[-\epsilon,\epsilon]</math>. <br />
<br />
Train proceeds in the following way. For each clean input image, a distorted version of the image is found by maximizing the inner maximization problem for a fixed number of iterations. Gradient steps are constrained to fall within the allowed range (projected gradient descent). Next, the classification problem is solved by minimizing the outer minimization problem.<br />
<br />
This approach was shown to provide resilience to all types of adversarial attacks.<br />
<br />
==== How to check for Obfuscated Gradients ====<br />
For future defense proposals, it is recommended to avoid using masked gradients. To assist with this, the authors propose a set of conditions that can help identify if defense is relying on masked gradients:<br />
# If weaker one-step attacks are performing better than iterative attacks.<br />
# Black-box attacks can find stronger adversarial images compared with white-box attacks.<br />
# Unbounded iterative attacks do not reach 100% success.<br />
# If random brute force attempts are better than gradient based methods at finding adversarial images.<br />
<br />
==== Recommendations for future defense methods to encourage reproducibility ====<br />
<br />
= Detailed Results =<br />
<br />
== Non-obfuscated Gradients ==<br />
==== Adversarial Training [Madry 2018] ====<br />
'''Defence''': Proposed by Goodfellow et al. (2014b), adversarial training solves a min-max game through a conceptually simple process: train on adversarial examples until the model learns to classify them correctly. The authors study the adversarial training approach of Madry et al. (2018) which for a given <math>\epsilon</math>-ball solves <br />
<br />
<div style="text-align: center;font-size:100%">[[File:sb.png]]</div><br />
<br />
which the authors of original paper, solve by the inner maximization problem by generating adversarial examples using projected gradient descent. The author's experiments were not able to invalidate the claims of the paper. The authors also mention that this approach does not cause obfuscated gradients and the original authors’ evaluation of this defense performs all of the tests for characteristic behaviours of obfuscated gradients that the authors of this paper list. Also, the authors note that (1) adversarial retraining has been shown to be difficult at a large scale like ImageNet, and (2) training exclusively on <math> l_\infty</math> adversarial examples provides only limited robustness to adversarial examples under other distortion metrics.<br />
<br />
==== Cascade Adversarial Training [Na 2018] ====<br />
'''Defence''': Cascade adversarial machine learning approach is similar to the adversarial training approach mentioned above. The main difference is that instead of using iterative methods to generate adversarial examples at each mini-batch, a model is first trained, generate adversarial examples with iterative methods on that model, add those examples to training set, and then train a new model on the augmented dataset. Again, as above, the authors were unable to reduce the claims made by the paper even though the claims are a bit weaker in this case with 16% accuracy with <math>\epsilon</math> = .015, compared to over 70% at the same perturbation budget with adversarial training as in Madry et al. (2018).<br />
<br />
== Gradient Shattering ==<br />
<br />
==== Thermometer Coding, [Buckman, 2018] ====<br />
'''Defense''': Inspired by the observation that neural networks learn linear boundaries between classes [Goodfellow, 2014] , [Buckman, 2018] sought to break this linearity by explicitly adding a highly non-linear transform at the input of their model. The non-linear transformation they chose was quantizing inputs to binary vectors. The quantization performed was termed thermometer encoding,<br />
<br />
Given an image, for each pixel value <math>x_{i,j,c}</math>, if an <math>l</math> dimensional thermometer code, the <math>kth</math> bit is given by:<br />
\begin{align}<br />
\tau(x_{i,j,c})_k = \bigg{\{}\begin{array}{ll}<br />
1 \space if \thinspace x_{i,j,c} >k/l \\<br />
0 \space otherwise \\<br />
\end{array}<br />
\end{align}<br />
Here it is assumed <math>x_{i,j,c} \in [0, 1] </math> and <math>i, j, c</math> are the row, column and channel index of the pixel respectively. This encoding is essentially like one-hot encoding, except all the points (not just one) greater than the target value are set to 1. This quantization technique preserves pairwise ordering between pixels.<br />
<br />
On CIFAR-10, the model gave 50% accuracy against <math>\ell_\infty</math> adversarial images with <math>\epsilon=0.031</math> attacks.<br />
<br />
'''Attack''': The authors attack this model using there BPDA approach. Given the non-linear transformation performed in the forward pass, <math>\tau(x)</math>, they develop a differentiable counterpart,<br />
\begin{align}<br />
\hat{\tau}(x_{i,j,c})_k = \min ( \max (x_{i,j,c} - \frac{k}{l}), 1 )<br />
\end{align}<br />
and use it in place of <math>\tau(x)</math> on the backward pass. With their modifications they were able to bring the accuracy of the model down to 0%.<br />
<br />
==== Input Transformation, [Guo, 2018] ====<br />
'''Defense''':[Gou, 2018] investigated the effect of including different input transformation on the robustness to adversarial images. In particular, they found two techniques provided the greatest resistance: total variance minimization and image quilting. Total variance minimization is a technique that removes high frequency noise while preserving legitimate edges (good high frequency components). In image quilting, a large database of image patches from clean images is collected. At test time, input patches, that contain a lot of noise, are replaced with similar but clean patches from the data base.<br />
<br />
Both techniques, removed perturbations from adversarial images which provides some robustness to adversarial attacks. Moreover, both approaches are non-differentiable which makes constructing white-box adversarial images difficult. Moreover, the techniques also include test time randomness as the modifications make are input dependent. The best model achieved 60% accuracy on adversarial images with <math>l_{2}=0.05</math> perturbations.<br />
<br />
'''Attack''': The authors used the BPDA attack where the input transformations were replaced by an identity function. They were able to bring the accuracy of the model down to 0% under the same type of adversarial attacks.<br />
<br />
==== Local Intrinsic Dimensionality, [Ma, 2018] ====<br />
'''Defense''' Local intrinsic dimensionality (LID) is a distance-based metric that measures the similarity between points in a high dimensional space. Given a set of points, let the distance between sample <math>x</math> and its <math>ith</math> neighbor be <math>r_i(x)</math>, then the LID under the choose distance metric is given by,<br />
<br />
\begin{align}<br />
LID(x) = - \bigg{(} \frac{1}{k}\sum^k_{i=1}log \frac{r_i(x)}{r_k(x)} \bigg{)}^{-1}<br />
\end{align}<br />
where k is the number of nearest neighbors considered, <math>r_k(x)</math> is the maximum distance to any of the neighbors in the set k. <br />
<br />
First, <math>L_2</math> distances for all training and adversarial images. Next, the LID scores for each train and adversarial images were calculated. It was found that LID scores for adversarial images were significantly larger than those of clean images. Base on these results, the a separate classifier was created that can be used to detect adversarial inputs. [Ma, 2018] claim that this is not a defense method, but a method to study the properties of adversarial images.<br />
<br />
'''Attack''': Instead of attacking this method, the authors show that this method is not able to detect, and is therefore venerable to, attacks of the [Carlini and Wagner, 2017a] variety.<br />
<br />
== Stochastic Gradients ==<br />
<br />
==== Stochastic Activation Pruning, [Dhillon, 2018] ====<br />
'''Defense''': [Dhillon, 2018] use test time randomness in their model to guard against adversarial attacks. Within a layer, the activities of component nodes are randomly dropped with a probability proportional to its absolute value. The rest of the activation are scaled up to preserve accuracies. This is akin to test time drop-out. This technique was found to drop accuracy slightly on clean images, but improved performance on adversarial images.<br />
<br />
'''Attack''': The authors used the expectation over transformation attack to get useful gradients out of the model. With their attack they were able to reduce the accuracy of this method down to 0% on CIFAR-10.<br />
<br />
==== Mitigation Through Randomization, [Xie, 2018] ====<br />
'''Defense''': [Xie, 2018] Add a randomization layer to their model to help defend against adversarial attacks. For an input image of size [299,299], first the image is randomly re-scaled to <math>r \in [299,331]</math>. Next the image is zero-padded to fix the dimension of the modified input. This modified input is then fed into a regular classifier. The authors claim that is strategy can provide an accuracy of 32.8% against ensemble attack patterns (fixed distortions, but many of them which are picked randomly). Because of the introduced randomness, the authors claim the model builds some robustness to other types of attacks as well.<br />
<br />
'''Attack''': The EOT method was used to build adversarial images to attack this model. With their attack, the authors were able to bring the accuracy of this model down to 0% using <math>L_{\infty}(\epsilon=0.031)</math> perturbations.<br />
<br />
== Vanishing and Exploding Gradients ==<br />
<br />
==== Pixel Defend, [Song, 2018] ====<br />
'''Defense''': [Song, 2018] argues that adversarial images lie in low probability regions of the data manifold. Therefore, one way to handle adversarial attacks is to project them back in the high probability regions before feeding them into a classifier. They chose to do this by using a generative model (pixelCNN) in a denoising capacity. A PixelCNN model directly estimates the conditional probability of generating an image pixel by pixel [Van den Oord, 2016],<br />
<br />
\begin{align}<br />
p(\mathbf{x}= \prod_{i=1}^{n^2} p(x_i|x_0,x_1 ....x_{i-1}))<br />
\end{align}<br />
<br />
The reason for choosing this model is the long iterative process of generation. In the backward pass, following the gradient all the way to the input would not be possible because of the vanishing/exploding gradient<br />
problem of deep networks. The proposed model was able to obtain an accuracy of 46% on CIFAR-10 images with <math>l_{\infty} (\epsilon=0.031) </math> perturbations.<br />
<br />
'''Attack''': The model was attacked using the BPDA technique where back-propagating though the pixelCNN was replaced with an identity function. With this apporach, the authors were able to bring down the accuracy to 9% under the same kind of perturbations.<br />
<br />
==== Defense-GAN, [Samangouei, 2018] ====<br />
<br />
= Conclusion =<br />
In this paper, it was found that gradient masking is a common technique used by many defense proposal that claim to be robust against a very difficult class of adversarial attacks; white-box, iterative attacks. However, the authors found that they can be easily circumvented. Three attack methods are presented that were able to defeat 7 out of the 8 defense proposal accepted in the 2018 ICLR conference for these types of attacks.<br />
<br />
Some future work that can come out of this paper includes avoiding relying on obfuscated gradients for perceived robustness and use the evaluation approach to detect when the attach occurs. Early categorization of attacks using some supervised techniques can also help in critical evaluation of incoming data. <br />
<br />
= Critique =<br />
# The third attack method, reparameterization of the input distortion search space was presented very briefly and at a very high level. Moreover, the one defense proposal they chose to use it against, [Samangouei, 2018] prove to be resilient against the attack. The authors had to resort to one of their other methods to circumvent the defense.<br />
# The BPDA and reparameterization attacks require intrinsic knowledge of the networks. This information is not likely to be available to external users of a network. Most likely, the use-case for these attacks will be in-house to develop more robust networks. This also means that it is still possible to guard against adversarial attack using gradient masking techniques, provided the details of the network are kept secret.<br />
# The BPDA algorithm requires replacing a non-linear part of the model with a differentiable approximation. Since different networks are likely to use different transformations, this technique is not plug-and-play. For each network, the attack needs to be manually constructed.<br />
<br />
= Other Sources =<br />
= References =<br />
#'''[Madry, 2018]''' Madry, A., Makelov, A., Schmidt, L., Tsipras, D. and Vladu, A., 2017. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083.<br />
#'''[Buckman, 2018]''' Buckman, J., Roy, A., Raffel, C. and Goodfellow, I., 2018. Thermometer encoding: One hot way to resist adversarial examples.<br />
#'''[Guo, 2018]''' Guo, C., Rana, M., Cisse, M. and van der Maaten, L., 2017. Countering adversarial images using input transformations. arXiv preprint arXiv:1711.00117.<br />
#'''[Xie, 2018]''' Xie, C., Wang, J., Zhang, Z., Ren, Z. and Yuille, A., 2017. Mitigating adversarial effects through randomization. arXiv preprint arXiv:1711.01991.<br />
#'''[song, 2018]''' Song, Y., Kim, T., Nowozin, S., Ermon, S. and Kushman, N., 2017. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. arXiv preprint arXiv:1710.10766.<br />
#'''[Szegedy, 2013]''' Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. and Fergus, R., 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.<br />
#'''[Samangouei, 2018]''' Samangouei, P., Kabkab, M. and Chellappa, R., 2018. Defense-GAN: Protecting classifiers against adversarial attacks using generative models. arXiv preprint arXiv:1805.06605.<br />
#'''[van den Oord, 2016]''' van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O. and Graves, A., 2016. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems (pp. 4790-4798).<br />
#'''[Athalye, 2017]''' Athalye, A. and Sutskever, I., 2017. Synthesizing robust adversarial examples. arXiv preprint arXiv:1707.07397.<br />
#'''[Ma, 2018]''' Ma, Xingjun, Bo Li, Yisen Wang, Sarah M. Erfani, Sudanthi Wijewickrema, Michael E. Houle, Grant Schoenebeck, Dawn Song, and James Bailey. "Characterizing adversarial subspaces using local intrinsic dimensionality." arXiv preprint arXiv:1801.02613 (2018).</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Countering_Adversarial_Images_Using_Input_Transformations&diff=39774Countering Adversarial Images Using Input Transformations2018-11-18T14:38:14Z<p>Mpafla: /* Critiques */</p>
<hr />
<div>The code for this paper is available here[https://github.com/facebookresearch/adversarial_image_defenses]<br />
<br />
==Motivation ==<br />
As the use of machine intelligence has increased, robustness has become a critical feature to guarantee the reliability of deployed machine-learning systems. However, recent research has shown that existing models are not robust to small, adversarial designed perturbations of the input. Adversarial examples are inputs to Machine Learning models that an attacker has intentionally designed to cause the model to make a mistake. Adversarially perturbed examples have been deployed to attack image classification services (Liu et al., 2016)[11], speech recognition systems (Cisse et al., 2017a)[12], and robot vision (Melis et al., 2017)[13]. The existence of these adversarial examples has motivated proposals for approaches that increase the robustness of learning systems to such examples<br />
Below example (Goodfellow et. al) [17], a small perturbation when applied to the original image of a panda, the prediction is changed to a gibbon.<br />
<br />
[[File:Panda.png|center]]<br />
<br />
==Introduction==<br />
The paper studies strategies that defend against adversarial-example attacks on image-classification systems by transforming the images before feeding them to a Convolutional Network Classifier. <br />
Generally, defenses against adversarial examples fall into two main categories -<br />
<br />
# Model-Specific – They enforce model properties such as smoothness and in-variance via the learning algorithm. <br />
# Model-Agnostic – They try to remove adversarial perturbations from the input. <br />
<br />
This paper focuses on increasing the effectiveness of Model Agnostic defense strategies. Specifically, they investigate the following image transformations as a means for protecting against adversarial images:<br />
<br />
# Image Cropping and Re-scaling (Graese et al, 2016). <br />
# Bit Depth Reduction (Xu et. al, 2017) <br />
# JPEG Compression (Dziugaite et al, 2016) <br />
# Total Variance Minimization (Rudin et al, 1992) <br />
# Image Quilting (Efros & Freeman, 2001). <br />
<br />
These image transformations have been studied against Adversarial attacks such as - fast gradient sign method (Goodfelow et. al., 2015), its iterative extension (Kurakin et al., 2016a), Deepfool (Moosavi-Dezfooli et al., 2016), and the Carlini & Wagner (2017) <math>L_2</math>attack. <br />
<br />
From their experiments, the strongest defenses are based on Total Variance Minimization and Image Quilting: as these defenses are non-differentiable and inherently random which makes it difficult for an adversary to get around them.<br />
<br />
==Previous Work==<br />
Recently, a lot of research has focused on countering adversarial threats. Wang et al [4], proposed a new adversary resistant technique that obstructs attackers from constructing impactful adversarial images. This is done by randomly nullifying features within images. Tramer et al [2], showed the state-of-the-art Ensemble Adversarial Training Method, which augments the training process but not only included adversarial images constructed from their model but also including adversarial images generated from an ensemble of other models. Their method implemented on an Inception V2 classifier finished 1st among 70 submissions of NIPS 2017 competition on Defenses against Adversarial Attacks. Graese, et al. [3], showed how input transformation such as shifting, blurring and noise can render the majority of the adversarial examples as non-adversarial. Xu et al.[5] demonstrated, how feature squeezing methods, such as reducing the color bit depth of each pixel and spatial smoothing, defends against attacks. Dziugaite et al [6], studied the effect of JPG compression on adversarial images.<br />
<br />
==Terminology==<br />
<br />
'''Gray Box Attack''' : Model Architecture and parameters are Public<br />
<br />
'''Black Box Attack''': Adversary does not have access to the model.<br />
<br />
'''Non Targeted Adversarial Attack''': The goal of the attack is to modify a source image in a way such that the image will be classified incorrectly by the network.<br />
<br />
'''Targeted Adversarial Attack''': The goal of the attack is to modify a source image in way such that image will be classified as a ''target'' class by the network.<br />
<br />
'''Defense''': A defense is a strategy that aims make the prediction on an adversarial example h(x') equal to the prediction on the corresponding clean example h(x).<br />
<br />
== Problem Definition ==<br />
The paper discusses non-targeted adversarial attacks for image recognition systems. Given image space X, a source image x ∈ X, and a classifier h(.), a non-targeted adversarial example of x is a perturbed image x', such that h(x) ≠ h(x') and d(x, x') ≤ ρ for some dissimilarity function d(·, ·) and ρ ≥ 0 (Euclidean distance is often used).<br />
<br />
From a set of N clean images <math>[{x_{1}, …, x_{n}}]</math>, an adversarial attack aims to generate <math>[{x^{'}_{1}, …, x^{'}_{n}}]</math> images, such that (<math>x^{'}_{n}</math>) is an adversary of (<math>x_{n}</math>).<br />
<br />
The success rate of an attack is given as: <br />
<br />
[[File:Attack.PNG|200px |]],<br />
<br />
which is the proportions of predictions that were altered by an attack.<br />
<br />
The success rate is generally measured as a function of the magnitude of perturbations performed by the attack. In this paper, L2 perturbations are used and are quantified using the normalized L2-dissimilarity metric:<br />
[[File:diss.png|150px |]]<br />
<br />
A strong adversarial attack has a high rate, while its normalized L2-dissimilarity given by the above equation is less.<br />
<br />
==Adversarial Attacks==<br />
<br />
For the experimental purposes, below 4 attacks have been studied.<br />
<br />
1. '''Fast Gradient Sign Method (FGSM; Goodfellow et al. (2015)) [17]''': Given a source input x, and true label y, and let l be the differentiable loss function used to train the classifier h(.). Then the corresponding adversarial example is given by:<br />
<br />
[[File:FGSM.PNG|200px |]]<br />
<br />
2. '''Iterative FGSM ((I-FGSM; Kurakin et al. (2016b))[14]''': iteratively applies the FGSM update, where M is the number of iterations.It is given as:<br />
<br />
[[File:IFGSM.PNG|300px |]]<br />
<br />
3. '''DeepFool ((Moosavi-Dezfooliet al., 2016) [15]''': projects x onto a linearization of the decision boundary defined by h(.) for M iterations. It is given as:<br />
<br />
[[File:DeepFool.PNG|400px |]]<br />
<br />
4. '''Carlini-Wagner's L2 attack(CW-L2; Carlini & Wagner (2017))[16]''': propose an optimization-based attack that combines a differentiable surrogate for the model’s classification accuracy with an L2-penalty term which encourages the adversary image to be close to the original image. Let Z(x) be the operation that computes the logit vector (i.e., the output before the softmax layer) for an input x, and Z(x)k be the logit value corresponding to class k. The untargeted variant<br />
of CW-L2 finds a solution to the unconstrained optimization problem. It is given as:<br />
<br />
[[File:Carlini.PNG|500px |]]<br />
<br />
Below figure shows adversarial images and corresponding perturbations at five levels of normalized L2-dissimilarity for all four attacks, mentioned above.<br />
<br />
[[File:Strength.PNG|center|600px |]]<br />
<br />
==Defenses==<br />
Defense is a strategy that aims make the prediction on an adversarial example equal to the prediction on the corresponding clean example. <br />
Five image transformations that alter the structure of these perturbations have been studied:<br />
# Image Cropping and Re-scaling, <br />
# Bit Depth Reduction, <br />
# JPEG Compression, <br />
# Total Variance Minimization, <br />
# Image Quilting.<br />
<br />
'''Image cropping and Rescaling''' has the effect of altering the spatial positioning of the adversarial perturbation. In this study, images are cropped and re-scaled during training time. At test time, prediction of randomly cropped are averaged.<br />
<br />
'''Bit Depth Reduction( Xu et. al)''' performs a simple type of quantization that can remove small (adversarial) variations in pixel values from an image. Images are reduced to 3 bits in the experiment.<br />
<br />
'''JPEG Compression and Decompression (Dziugaite etal., 2016)''' removes small perturbations by performing simple quantization.<br />
<br />
'''Total Variance Minimization [9]''' :<br />
This combines pixel dropout with total variance minimization. This approach randomly selects a small set of pixels, and reconstructs the “simplest” image that is consistent with the selected pixels. The reconstructed image does not contain the adversarial perturbations because these perturbations tend to be small and localized.Specifically, we first select a random set of pixels by sampling a Bernoulli random variable <math>X(i; j; k)</math> for each pixel location <math>(i; j; k)</math>;we maintain a pixel when <math>(i; j; k)</math>= 1. Next, we use total variation, minimization to constructs an image z that is similar to the (perturbed) input image x for the selected<br />
set of pixels, whilst also being “simple” in terms of total variation by solving:<br />
<br />
[[File:TV!.png|300px|]] , <br />
<br />
where <math>TV_{p}(z)</math> represents <math>L_{p}</math> total variation of '''z''' :<br />
<br />
[[File:TV2.png|500px|]]<br />
<br />
The total variation (TV) measures the amount of fine-scale variation in the image z, as a result of which TV minimization encourages removal of small (adversarial) perturbations in the image.<br />
<br />
'''Image Quilting(Efros & Freeman, 2001)[8]'''<br />
Image Quilting is a non-parametric technique that synthesizes images by piecing together small patches that are taken from a database of image patches. The algorithm places appropriate patches in the database for a predefined set of grid points and computes minimum graph cuts in all overlapping boundary regions to remove edge artifacts. Image Quilting can be used to remove adversarial perturbations by constructing a patch database that only contains patches from "clean" images ( without adversarial perturbations); the patches used to create the synthesized image are selected by finding the K nearest neighbors ( in pixel space) of the corresponding patch from the adversarial image in the patch database, and picking one of these neighbors uniformly at random. The motivation for this defense is that resulting image only contains pixels that were not modified by the adversary - the database of real patches is unlikely to contain the structures that appear in adversarial images.<br />
<br />
=Experiments=<br />
<br />
Five experiments were performed to test the efficacy of defenses. <br />
<br />
'''Set up:'''<br />
Experiments are performed on the ImageNet image classification dataset. The dataset comprises 1.2 million training images and 50,000 test images that correspond to one of 1000 classes. The adversarial images are produced by attacking a ResNet-50 model, with different kinds of attacks mentioned in Section5. The strength of an adversary is measured in terms of its normalized L2-dissimilarity. To produce the adversarial images, L2 dissimilarity for each of the attack was set as below:<br />
<br />
- FGSM. Increasing the step size <math>\epsilon</math>, increases the normalized L2-dissimilarity.<br />
<br />
- I-FGSM. We fix M=10, and increase <math>\epsilon</math> to increase the normalized L2-dissimilarity.<br />
<br />
- DeepFool. We fix M=5, and increase <math>\epsilon</math> to increase the normalized L2-dissimilarity.<br />
<br />
- CW-L2. We fix <math>k</math>=0 and <math>\lambda_{f}</math> =10, and multiply the resulting perturbation <br />
<br />
The hyperparameters of the defenses have been fixed in all the experiments. Specifically the pixel dropout probability was set to <math>p</math>=0.5 and regularization parameter of total variation minimizer <math>\lambda_{TV}</math>=0.03.<br />
<br />
Below figure shows the difference between the set up in different experiments below. The network is either trained on a) regular images or b) transformed images. The different settings are marked by 8.1, 8.2 and 8.3 <br />
[[File:models3.png]] <br />
<br />
==GrayBox- Image Transformation at Test Time== <br />
This experiment applies a transformation on adversarial images at test time before feeding them to a ResNet -50 which was trained to classify clean images. Below figure shows the results for five different transformations applied and their corresponding Top-1 accuracy. Few of the interesting observations from the plot are: All of the image transformations partly eliminate the effects of the attack, Crop ensemble gives the best accuracy around 40-60 percent. The accuracy of Image Quilting Defense hardly deteriorates as the strength of the adversary increases.<br />
<br />
[[File:sFig4.png|center|600px |]]<br />
<br />
==BlackBox - Image Transformation at Training and Test Time==<br />
ResNet-50 model was trained on transformed ImageNet Training images. Before feeding the images to the network for training, standard data augmentation (from He et al) along with bit depth reduction, JPEG Compression, TV Minimization, or Image Quilting were applied on the images. The classification accuracy on the same adversarial images as in the previous case is shown Figure below. (Adversary cannot get this trained model to generate new images - Hence this is assumed as a Black Box setting!). Below figure concludes that training Convolutional Neural Networks on images that are transformed in the same way at test time, dramatically improves the effectiveness of all transformation defenses. Nearly 80 -90 % of the attacks are defended successfully, even when the L2- dissimilarity is high.<br />
<br />
<br />
[[File:sFig5.png|center|600px |]]<br />
<br />
<br />
==Blackbox - Ensembling==<br />
Four networks ResNet-50, ResNet-10, DenseNet-169, and Inception-v4 along with an ensemble of defenses were studied, as shown in Table 1. The adversarial images are produced by attacking a ResNet-50 model. The results in the table conclude that Inception-v4 performs best. This could be due to that network having a higher accuracy even in non-adversarial settings. The best ensemble of defenses achieves an accuracy of about 71% against all the other attacks. The attacks deteriorate the accuracy of the best defenses by at most 6%.<br />
<br />
<br />
[[File:sTab1.png|600px|thumb|center|Table 1. Top-1 classification accuracy of ensemble and model transfer defenses (columns) against four black-box attacks (rows). The four networks we use to classify images are ResNet-50 (RN50), ResNet-101 (RN101), DenseNet-169 (DN169), and Inception-v4 (Iv4). Adversarial images are generated by running attacks against the ResNet-50 model, aiming for an average normalized <math>L_2</math>-dissimilarity of 0.06. Higher is better. The best defense against each attack is typeset in boldface.]]<br />
<br />
==GrayBox - Image Transformation at Training and Test Time ==<br />
In this experiment, the adversary has access to the network and the related parameters (but does not have access to the input transformations applied at test time). From the network trained in-(BlackBox: Image Transformation at Training and Test Time), novel adversarial images were generated by the four attack methods. The results show that Bit-Depth Reduction and JPEG Compression are weak defenses in such a gray box setting. <br />
The results for this experiment are shown in below figure. Networks using these defenses classify up to 50 % of images correctly.<br />
<br />
[[File:sFig6.png|center| 600px |]]<br />
<br />
==Comparison With Ensemble Adversarial Training==<br />
The results of the experiment are compared with the state of the art ensemble adversarial training approach proposed by Tramer et al. [2] 2017. Ensemble Training fits the parameters of a Convolutional Neural Network on adversarial examples that were generated to attack an ensemble of pre-trained models. The model release by Tramer et al [2]: an Inception-Resnet-v2, trained on adversarial examples generated by FGSM against Inception-Resnet-v2 and Inception-v3<br />
models. The results of ensemble training and the preprocessing techniques mentioned in this paper are shown in Table 2.<br />
The results show that ensemble adversarial training works better on FGSM attacks (which it uses at training time), but is outperformed by each of the transformation-based defenses all other attacks.<br />
<br />
<br />
<br />
[[File:sTab2.png|600px|thumb|center|Table 2. Top-1 classification accuracy on images perturbed using attacks against ResNet-50 models trained on input-transformed images and an Inception-v4 model trained using ensemble adversarial. Adversarial images are generated by running attacks against the models, aiming for an average normalized <math>L_2</math>-dissimilarity of 0.06. The best defense against each attack is typeset in boldface.]]<br />
<br />
=Discussion/Conclusions=<br />
The paper proposed reasonable approaches to countering adversarial images. The authors evaluated Total Variance Minimization and Image Quilting and compared it with already proposed ideas like Image Cropping- Rescaling, Bit Depth Reduction, JPEG Compression, and Decompression on the challenging ImageNet dataset.<br />
Previous work by (Wang et al) [10], shows that a strong input defense should, be nondifferentiable and randomized. Two of the defenses - namely Total Variation Minimization and Image Quilting, both possess this property. Future work suggests applying the same techniques to other domains such as speech recognition and image segmentation. For example, in speech recognition, total variance minimization can be used to remove perturbations from waveforms and "spectrogram quilting" techniques that reconstruct a spectrogram could be developed. The input transformations can also be studied with ensemble adversarial training by Tramèr et al.[2]<br />
<br />
<br />
=Critiques=<br />
1. The terminology of Black Box, White Box, and Grey Box attack is not exactly given and clear.<br />
<br />
2. White Box attacks could have been considered where the adversary has a full access to the model as well as the pre-processing techniques.<br />
<br />
3. Though the authors did a considerable work in showing the effect of four attacks on ImageNet database, much stronger attacks (Madry et al) [7], could have been evaluated.<br />
<br />
4. Authors claim that the success rate is generally measured as a function of the magnitude of perturbations, performed by the attack using the L2- dissimilarity, but the claim is not supported by any references. None of the previous work has used these metrics.<br />
<br />
=References=<br />
<br />
1. Chuan Guo , Mayank Rana & Moustapha Ciss´e & Laurens van der Maaten , Countering Adversarial Images Using Input Transformations<br />
<br />
2. Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, Patrick McDaniel, Ensemble Adversarial Training: Attacks and defenses.<br />
<br />
3. Abigail Graese, Andras Rozsa, and Terrance E. Boult. Assessing threat of adversarial examples of deep neural networks. CoRR, abs/1610.04256, 2016. <br />
<br />
4. Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G. Ororbia II, Xinyu Xing, C. Lee Giles, and Xue Liu. Adversary resistant deep neural networks with an application to malware detection. CoRR, abs/1610.01239, 2016a.<br />
<br />
5.Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. CoRR, abs/1704.01155, 2017. <br />
<br />
6. Gintare Karolina Dziugaite, Zoubin Ghahramani, and Daniel Roy. A study of the effect of JPG compression on adversarial images. CoRR, abs/1608.00853, 2016.<br />
<br />
7.Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu .Towards Deep Learning Models Resistant to Adversarial Attacks, arXiv:1706.06083v3<br />
<br />
8.Alexei Efros and William Freeman. Image quilting for texture synthesis and transfer. In Proc. SIGGRAPH, pp. 341–346, 2001.<br />
<br />
9.Leonid Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D, 60:259–268, 1992.<br />
<br />
10.Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G. Ororbia II, Xinyu Xing, C. Lee Giles, and Xue Liu. Learning adversary-resistant deep neural networks. CoRR, abs/1612.01401, 2016b.<br />
<br />
11. Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. CoRR, abs/1611.02770, 2016.<br />
<br />
12. Moustapha Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet. Houdini: Fooling deep structured prediction models. CoRR, abs/1707.05373, 2017 <br />
<br />
13. Marco Melis, Ambra Demontis, Battista Biggio, Gavin Brown, Giorgio Fumera, and Fabio Roli. Is deep learning safe for robot vision? adversarial examples against the icub humanoid. CoRR,abs/1708.06939, 2017.<br />
<br />
14. Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. CoRR, abs/1607.02533, 2016b.<br />
<br />
15. Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: A simple and accurate method to fool deep neural networks. In Proc. CVPR, pp. 2574–2582, 2016.<br />
<br />
16. Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, pp. 39–57, 2017.<br />
<br />
17. Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In Proc. ICLR, 2015.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Countering_Adversarial_Images_Using_Input_Transformations&diff=39773Countering Adversarial Images Using Input Transformations2018-11-18T14:37:40Z<p>Mpafla: /* Problem Definition */</p>
<hr />
<div>The code for this paper is available here[https://github.com/facebookresearch/adversarial_image_defenses]<br />
<br />
==Motivation ==<br />
As the use of machine intelligence has increased, robustness has become a critical feature to guarantee the reliability of deployed machine-learning systems. However, recent research has shown that existing models are not robust to small, adversarial designed perturbations of the input. Adversarial examples are inputs to Machine Learning models that an attacker has intentionally designed to cause the model to make a mistake. Adversarially perturbed examples have been deployed to attack image classification services (Liu et al., 2016)[11], speech recognition systems (Cisse et al., 2017a)[12], and robot vision (Melis et al., 2017)[13]. The existence of these adversarial examples has motivated proposals for approaches that increase the robustness of learning systems to such examples<br />
Below example (Goodfellow et. al) [17], a small perturbation when applied to the original image of a panda, the prediction is changed to a gibbon.<br />
<br />
[[File:Panda.png|center]]<br />
<br />
==Introduction==<br />
The paper studies strategies that defend against adversarial-example attacks on image-classification systems by transforming the images before feeding them to a Convolutional Network Classifier. <br />
Generally, defenses against adversarial examples fall into two main categories -<br />
<br />
# Model-Specific – They enforce model properties such as smoothness and in-variance via the learning algorithm. <br />
# Model-Agnostic – They try to remove adversarial perturbations from the input. <br />
<br />
This paper focuses on increasing the effectiveness of Model Agnostic defense strategies. Specifically, they investigate the following image transformations as a means for protecting against adversarial images:<br />
<br />
# Image Cropping and Re-scaling (Graese et al, 2016). <br />
# Bit Depth Reduction (Xu et. al, 2017) <br />
# JPEG Compression (Dziugaite et al, 2016) <br />
# Total Variance Minimization (Rudin et al, 1992) <br />
# Image Quilting (Efros & Freeman, 2001). <br />
<br />
These image transformations have been studied against Adversarial attacks such as - fast gradient sign method (Goodfelow et. al., 2015), its iterative extension (Kurakin et al., 2016a), Deepfool (Moosavi-Dezfooli et al., 2016), and the Carlini & Wagner (2017) <math>L_2</math>attack. <br />
<br />
From their experiments, the strongest defenses are based on Total Variance Minimization and Image Quilting: as these defenses are non-differentiable and inherently random which makes it difficult for an adversary to get around them.<br />
<br />
==Previous Work==<br />
Recently, a lot of research has focused on countering adversarial threats. Wang et al [4], proposed a new adversary resistant technique that obstructs attackers from constructing impactful adversarial images. This is done by randomly nullifying features within images. Tramer et al [2], showed the state-of-the-art Ensemble Adversarial Training Method, which augments the training process but not only included adversarial images constructed from their model but also including adversarial images generated from an ensemble of other models. Their method implemented on an Inception V2 classifier finished 1st among 70 submissions of NIPS 2017 competition on Defenses against Adversarial Attacks. Graese, et al. [3], showed how input transformation such as shifting, blurring and noise can render the majority of the adversarial examples as non-adversarial. Xu et al.[5] demonstrated, how feature squeezing methods, such as reducing the color bit depth of each pixel and spatial smoothing, defends against attacks. Dziugaite et al [6], studied the effect of JPG compression on adversarial images.<br />
<br />
==Terminology==<br />
<br />
'''Gray Box Attack''' : Model Architecture and parameters are Public<br />
<br />
'''Black Box Attack''': Adversary does not have access to the model.<br />
<br />
'''Non Targeted Adversarial Attack''': The goal of the attack is to modify a source image in a way such that the image will be classified incorrectly by the network.<br />
<br />
'''Targeted Adversarial Attack''': The goal of the attack is to modify a source image in way such that image will be classified as a ''target'' class by the network.<br />
<br />
'''Defense''': A defense is a strategy that aims make the prediction on an adversarial example h(x') equal to the prediction on the corresponding clean example h(x).<br />
<br />
== Problem Definition ==<br />
The paper discusses non-targeted adversarial attacks for image recognition systems. Given image space X, a source image x ∈ X, and a classifier h(.), a non-targeted adversarial example of x is a perturbed image x', such that h(x) ≠ h(x') and d(x, x') ≤ ρ for some dissimilarity function d(·, ·) and ρ ≥ 0 (Euclidean distance is often used).<br />
<br />
From a set of N clean images <math>[{x_{1}, …, x_{n}}]</math>, an adversarial attack aims to generate <math>[{x^{'}_{1}, …, x^{'}_{n}}]</math> images, such that (<math>x^{'}_{n}</math>) is an adversary of (<math>x_{n}</math>).<br />
<br />
The success rate of an attack is given as: <br />
<br />
[[File:Attack.PNG|200px |]],<br />
<br />
which is the proportions of predictions that were altered by an attack.<br />
<br />
The success rate is generally measured as a function of the magnitude of perturbations performed by the attack. In this paper, L2 perturbations are used and are quantified using the normalized L2-dissimilarity metric:<br />
[[File:diss.png|150px |]]<br />
<br />
A strong adversarial attack has a high rate, while its normalized L2-dissimilarity given by the above equation is less.<br />
<br />
==Adversarial Attacks==<br />
<br />
For the experimental purposes, below 4 attacks have been studied.<br />
<br />
1. '''Fast Gradient Sign Method (FGSM; Goodfellow et al. (2015)) [17]''': Given a source input x, and true label y, and let l be the differentiable loss function used to train the classifier h(.). Then the corresponding adversarial example is given by:<br />
<br />
[[File:FGSM.PNG|200px |]]<br />
<br />
2. '''Iterative FGSM ((I-FGSM; Kurakin et al. (2016b))[14]''': iteratively applies the FGSM update, where M is the number of iterations.It is given as:<br />
<br />
[[File:IFGSM.PNG|300px |]]<br />
<br />
3. '''DeepFool ((Moosavi-Dezfooliet al., 2016) [15]''': projects x onto a linearization of the decision boundary defined by h(.) for M iterations. It is given as:<br />
<br />
[[File:DeepFool.PNG|400px |]]<br />
<br />
4. '''Carlini-Wagner's L2 attack(CW-L2; Carlini & Wagner (2017))[16]''': propose an optimization-based attack that combines a differentiable surrogate for the model’s classification accuracy with an L2-penalty term which encourages the adversary image to be close to the original image. Let Z(x) be the operation that computes the logit vector (i.e., the output before the softmax layer) for an input x, and Z(x)k be the logit value corresponding to class k. The untargeted variant<br />
of CW-L2 finds a solution to the unconstrained optimization problem. It is given as:<br />
<br />
[[File:Carlini.PNG|500px |]]<br />
<br />
Below figure shows adversarial images and corresponding perturbations at five levels of normalized L2-dissimilarity for all four attacks, mentioned above.<br />
<br />
[[File:Strength.PNG|center|600px |]]<br />
<br />
==Defenses==<br />
Defense is a strategy that aims make the prediction on an adversarial example equal to the prediction on the corresponding clean example. <br />
Five image transformations that alter the structure of these perturbations have been studied:<br />
# Image Cropping and Re-scaling, <br />
# Bit Depth Reduction, <br />
# JPEG Compression, <br />
# Total Variance Minimization, <br />
# Image Quilting.<br />
<br />
'''Image cropping and Rescaling''' has the effect of altering the spatial positioning of the adversarial perturbation. In this study, images are cropped and re-scaled during training time. At test time, prediction of randomly cropped are averaged.<br />
<br />
'''Bit Depth Reduction( Xu et. al)''' performs a simple type of quantization that can remove small (adversarial) variations in pixel values from an image. Images are reduced to 3 bits in the experiment.<br />
<br />
'''JPEG Compression and Decompression (Dziugaite etal., 2016)''' removes small perturbations by performing simple quantization.<br />
<br />
'''Total Variance Minimization [9]''' :<br />
This combines pixel dropout with total variance minimization. This approach randomly selects a small set of pixels, and reconstructs the “simplest” image that is consistent with the selected pixels. The reconstructed image does not contain the adversarial perturbations because these perturbations tend to be small and localized.Specifically, we first select a random set of pixels by sampling a Bernoulli random variable <math>X(i; j; k)</math> for each pixel location <math>(i; j; k)</math>;we maintain a pixel when <math>(i; j; k)</math>= 1. Next, we use total variation, minimization to constructs an image z that is similar to the (perturbed) input image x for the selected<br />
set of pixels, whilst also being “simple” in terms of total variation by solving:<br />
<br />
[[File:TV!.png|300px|]] , <br />
<br />
where <math>TV_{p}(z)</math> represents <math>L_{p}</math> total variation of '''z''' :<br />
<br />
[[File:TV2.png|500px|]]<br />
<br />
The total variation (TV) measures the amount of fine-scale variation in the image z, as a result of which TV minimization encourages removal of small (adversarial) perturbations in the image.<br />
<br />
'''Image Quilting(Efros & Freeman, 2001)[8]'''<br />
Image Quilting is a non-parametric technique that synthesizes images by piecing together small patches that are taken from a database of image patches. The algorithm places appropriate patches in the database for a predefined set of grid points and computes minimum graph cuts in all overlapping boundary regions to remove edge artifacts. Image Quilting can be used to remove adversarial perturbations by constructing a patch database that only contains patches from "clean" images ( without adversarial perturbations); the patches used to create the synthesized image are selected by finding the K nearest neighbors ( in pixel space) of the corresponding patch from the adversarial image in the patch database, and picking one of these neighbors uniformly at random. The motivation for this defense is that resulting image only contains pixels that were not modified by the adversary - the database of real patches is unlikely to contain the structures that appear in adversarial images.<br />
<br />
=Experiments=<br />
<br />
Five experiments were performed to test the efficacy of defenses. <br />
<br />
'''Set up:'''<br />
Experiments are performed on the ImageNet image classification dataset. The dataset comprises 1.2 million training images and 50,000 test images that correspond to one of 1000 classes. The adversarial images are produced by attacking a ResNet-50 model, with different kinds of attacks mentioned in Section5. The strength of an adversary is measured in terms of its normalized L2-dissimilarity. To produce the adversarial images, L2 dissimilarity for each of the attack was set as below:<br />
<br />
- FGSM. Increasing the step size <math>\epsilon</math>, increases the normalized L2-dissimilarity.<br />
<br />
- I-FGSM. We fix M=10, and increase <math>\epsilon</math> to increase the normalized L2-dissimilarity.<br />
<br />
- DeepFool. We fix M=5, and increase <math>\epsilon</math> to increase the normalized L2-dissimilarity.<br />
<br />
- CW-L2. We fix <math>k</math>=0 and <math>\lambda_{f}</math> =10, and multiply the resulting perturbation <br />
<br />
The hyperparameters of the defenses have been fixed in all the experiments. Specifically the pixel dropout probability was set to <math>p</math>=0.5 and regularization parameter of total variation minimizer <math>\lambda_{TV}</math>=0.03.<br />
<br />
Below figure shows the difference between the set up in different experiments below. The network is either trained on a) regular images or b) transformed images. The different settings are marked by 8.1, 8.2 and 8.3 <br />
[[File:models3.png]] <br />
<br />
==GrayBox- Image Transformation at Test Time== <br />
This experiment applies a transformation on adversarial images at test time before feeding them to a ResNet -50 which was trained to classify clean images. Below figure shows the results for five different transformations applied and their corresponding Top-1 accuracy. Few of the interesting observations from the plot are: All of the image transformations partly eliminate the effects of the attack, Crop ensemble gives the best accuracy around 40-60 percent. The accuracy of Image Quilting Defense hardly deteriorates as the strength of the adversary increases.<br />
<br />
[[File:sFig4.png|center|600px |]]<br />
<br />
==BlackBox - Image Transformation at Training and Test Time==<br />
ResNet-50 model was trained on transformed ImageNet Training images. Before feeding the images to the network for training, standard data augmentation (from He et al) along with bit depth reduction, JPEG Compression, TV Minimization, or Image Quilting were applied on the images. The classification accuracy on the same adversarial images as in the previous case is shown Figure below. (Adversary cannot get this trained model to generate new images - Hence this is assumed as a Black Box setting!). Below figure concludes that training Convolutional Neural Networks on images that are transformed in the same way at test time, dramatically improves the effectiveness of all transformation defenses. Nearly 80 -90 % of the attacks are defended successfully, even when the L2- dissimilarity is high.<br />
<br />
<br />
[[File:sFig5.png|center|600px |]]<br />
<br />
<br />
==Blackbox - Ensembling==<br />
Four networks ResNet-50, ResNet-10, DenseNet-169, and Inception-v4 along with an ensemble of defenses were studied, as shown in Table 1. The adversarial images are produced by attacking a ResNet-50 model. The results in the table conclude that Inception-v4 performs best. This could be due to that network having a higher accuracy even in non-adversarial settings. The best ensemble of defenses achieves an accuracy of about 71% against all the other attacks. The attacks deteriorate the accuracy of the best defenses by at most 6%.<br />
<br />
<br />
[[File:sTab1.png|600px|thumb|center|Table 1. Top-1 classification accuracy of ensemble and model transfer defenses (columns) against four black-box attacks (rows). The four networks we use to classify images are ResNet-50 (RN50), ResNet-101 (RN101), DenseNet-169 (DN169), and Inception-v4 (Iv4). Adversarial images are generated by running attacks against the ResNet-50 model, aiming for an average normalized <math>L_2</math>-dissimilarity of 0.06. Higher is better. The best defense against each attack is typeset in boldface.]]<br />
<br />
==GrayBox - Image Transformation at Training and Test Time ==<br />
In this experiment, the adversary has access to the network and the related parameters (but does not have access to the input transformations applied at test time). From the network trained in-(BlackBox: Image Transformation at Training and Test Time), novel adversarial images were generated by the four attack methods. The results show that Bit-Depth Reduction and JPEG Compression are weak defenses in such a gray box setting. <br />
The results for this experiment are shown in below figure. Networks using these defenses classify up to 50 % of images correctly.<br />
<br />
[[File:sFig6.png|center| 600px |]]<br />
<br />
==Comparison With Ensemble Adversarial Training==<br />
The results of the experiment are compared with the state of the art ensemble adversarial training approach proposed by Tramer et al. [2] 2017. Ensemble Training fits the parameters of a Convolutional Neural Network on adversarial examples that were generated to attack an ensemble of pre-trained models. The model release by Tramer et al [2]: an Inception-Resnet-v2, trained on adversarial examples generated by FGSM against Inception-Resnet-v2 and Inception-v3<br />
models. The results of ensemble training and the preprocessing techniques mentioned in this paper are shown in Table 2.<br />
The results show that ensemble adversarial training works better on FGSM attacks (which it uses at training time), but is outperformed by each of the transformation-based defenses all other attacks.<br />
<br />
<br />
<br />
[[File:sTab2.png|600px|thumb|center|Table 2. Top-1 classification accuracy on images perturbed using attacks against ResNet-50 models trained on input-transformed images and an Inception-v4 model trained using ensemble adversarial. Adversarial images are generated by running attacks against the models, aiming for an average normalized <math>L_2</math>-dissimilarity of 0.06. The best defense against each attack is typeset in boldface.]]<br />
<br />
=Discussion/Conclusions=<br />
The paper proposed reasonable approaches to countering adversarial images. The authors evaluated Total Variance Minimization and Image Quilting and compared it with already proposed ideas like Image Cropping- Rescaling, Bit Depth Reduction, JPEG Compression, and Decompression on the challenging ImageNet dataset.<br />
Previous work by (Wang et al) [10], shows that a strong input defense should, be nondifferentiable and randomized. Two of the defenses - namely Total Variation Minimization and Image Quilting, both possess this property. Future work suggests applying the same techniques to other domains such as speech recognition and image segmentation. For example, in speech recognition, total variance minimization can be used to remove perturbations from waveforms and "spectrogram quilting" techniques that reconstruct a spectrogram could be developed. The input transformations can also be studied with ensemble adversarial training by Tramèr et al.[2]<br />
<br />
<br />
=Critiques=<br />
1. The terminology of Black Box, White Box, and Grey Box attack is not exactly given and clear.<br />
<br />
2. White Box attacks could have been considered where the adversary has a full access to the model as well as the pre-processing techniques.<br />
<br />
3. Though the authors did a considerable work in showing the effect of four attacks on ImageNet database, much stronger attacks (Madry et al) [7], could have been evaluated.<br />
<br />
4. Authors claim that the success rate is generally measured as a function of the magnitude of perturbations, performed by the attack using the L2- dissimilarity, but the claim is not supported by any references. None of the previous work has used these metrics<br />
<br />
<br />
=References=<br />
<br />
1. Chuan Guo , Mayank Rana & Moustapha Ciss´e & Laurens van der Maaten , Countering Adversarial Images Using Input Transformations<br />
<br />
2. Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, Patrick McDaniel, Ensemble Adversarial Training: Attacks and defenses.<br />
<br />
3. Abigail Graese, Andras Rozsa, and Terrance E. Boult. Assessing threat of adversarial examples of deep neural networks. CoRR, abs/1610.04256, 2016. <br />
<br />
4. Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G. Ororbia II, Xinyu Xing, C. Lee Giles, and Xue Liu. Adversary resistant deep neural networks with an application to malware detection. CoRR, abs/1610.01239, 2016a.<br />
<br />
5.Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. CoRR, abs/1704.01155, 2017. <br />
<br />
6. Gintare Karolina Dziugaite, Zoubin Ghahramani, and Daniel Roy. A study of the effect of JPG compression on adversarial images. CoRR, abs/1608.00853, 2016.<br />
<br />
7.Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu .Towards Deep Learning Models Resistant to Adversarial Attacks, arXiv:1706.06083v3<br />
<br />
8.Alexei Efros and William Freeman. Image quilting for texture synthesis and transfer. In Proc. SIGGRAPH, pp. 341–346, 2001.<br />
<br />
9.Leonid Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D, 60:259–268, 1992.<br />
<br />
10.Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G. Ororbia II, Xinyu Xing, C. Lee Giles, and Xue Liu. Learning adversary-resistant deep neural networks. CoRR, abs/1612.01401, 2016b.<br />
<br />
11. Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. CoRR, abs/1611.02770, 2016.<br />
<br />
12. Moustapha Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet. Houdini: Fooling deep structured prediction models. CoRR, abs/1707.05373, 2017 <br />
<br />
13. Marco Melis, Ambra Demontis, Battista Biggio, Gavin Brown, Giorgio Fumera, and Fabio Roli. Is deep learning safe for robot vision? adversarial examples against the icub humanoid. CoRR,abs/1708.06939, 2017.<br />
<br />
14. Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. CoRR, abs/1607.02533, 2016b.<br />
<br />
15. Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: A simple and accurate method to fool deep neural networks. In Proc. CVPR, pp. 2574–2582, 2016.<br />
<br />
16. Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, pp. 39–57, 2017.<br />
<br />
17. Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In Proc. ICLR, 2015.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Countering_Adversarial_Images_Using_Input_Transformations&diff=39772Countering Adversarial Images Using Input Transformations2018-11-18T14:31:13Z<p>Mpafla: /* Problem Definition */</p>
<hr />
<div>The code for this paper is available here[https://github.com/facebookresearch/adversarial_image_defenses]<br />
<br />
==Motivation ==<br />
As the use of machine intelligence has increased, robustness has become a critical feature to guarantee the reliability of deployed machine-learning systems. However, recent research has shown that existing models are not robust to small, adversarial designed perturbations of the input. Adversarial examples are inputs to Machine Learning models that an attacker has intentionally designed to cause the model to make a mistake. Adversarially perturbed examples have been deployed to attack image classification services (Liu et al., 2016)[11], speech recognition systems (Cisse et al., 2017a)[12], and robot vision (Melis et al., 2017)[13]. The existence of these adversarial examples has motivated proposals for approaches that increase the robustness of learning systems to such examples<br />
Below example (Goodfellow et. al) [17], a small perturbation when applied to the original image of a panda, the prediction is changed to a gibbon.<br />
<br />
[[File:Panda.png|center]]<br />
<br />
==Introduction==<br />
The paper studies strategies that defend against adversarial-example attacks on image-classification systems by transforming the images before feeding them to a Convolutional Network Classifier. <br />
Generally, defenses against adversarial examples fall into two main categories -<br />
<br />
# Model-Specific – They enforce model properties such as smoothness and in-variance via the learning algorithm. <br />
# Model-Agnostic – They try to remove adversarial perturbations from the input. <br />
<br />
This paper focuses on increasing the effectiveness of Model Agnostic defense strategies. Specifically, they investigate the following image transformations as a means for protecting against adversarial images:<br />
<br />
# Image Cropping and Re-scaling (Graese et al, 2016). <br />
# Bit Depth Reduction (Xu et. al, 2017) <br />
# JPEG Compression (Dziugaite et al, 2016) <br />
# Total Variance Minimization (Rudin et al, 1992) <br />
# Image Quilting (Efros & Freeman, 2001). <br />
<br />
These image transformations have been studied against Adversarial attacks such as - fast gradient sign method (Goodfelow et. al., 2015), its iterative extension (Kurakin et al., 2016a), Deepfool (Moosavi-Dezfooli et al., 2016), and the Carlini & Wagner (2017) <math>L_2</math>attack. <br />
<br />
From their experiments, the strongest defenses are based on Total Variance Minimization and Image Quilting: as these defenses are non-differentiable and inherently random which makes it difficult for an adversary to get around them.<br />
<br />
==Previous Work==<br />
Recently, a lot of research has focused on countering adversarial threats. Wang et al [4], proposed a new adversary resistant technique that obstructs attackers from constructing impactful adversarial images. This is done by randomly nullifying features within images. Tramer et al [2], showed the state-of-the-art Ensemble Adversarial Training Method, which augments the training process but not only included adversarial images constructed from their model but also including adversarial images generated from an ensemble of other models. Their method implemented on an Inception V2 classifier finished 1st among 70 submissions of NIPS 2017 competition on Defenses against Adversarial Attacks. Graese, et al. [3], showed how input transformation such as shifting, blurring and noise can render the majority of the adversarial examples as non-adversarial. Xu et al.[5] demonstrated, how feature squeezing methods, such as reducing the color bit depth of each pixel and spatial smoothing, defends against attacks. Dziugaite et al [6], studied the effect of JPG compression on adversarial images.<br />
<br />
==Terminology==<br />
<br />
'''Gray Box Attack''' : Model Architecture and parameters are Public<br />
<br />
'''Black Box Attack''': Adversary does not have access to the model.<br />
<br />
'''Non Targeted Adversarial Attack''': The goal of the attack is to modify a source image in a way such that the image will be classified incorrectly by the network.<br />
<br />
'''Targeted Adversarial Attack''': The goal of the attack is to modify a source image in way such that image will be classified as a ''target'' class by the network.<br />
<br />
'''Defense''': A defense is a strategy that aims make the prediction on an adversarial example h(x') equal to the prediction on the corresponding clean example h(x).<br />
<br />
== Problem Definition ==<br />
The paper discusses non-targeted adversarial attacks for image recognition systems. Given image space X, a source image x ∈ X, a perturbed image x' ∈ X, and a classifier h(.), a non-targeted adversarial example of x is a perturbed image x', such that h(x) ≠ h(x').<br />
<br />
From a set of N clean images <math>[{x_{1}, …, x_{n}}]</math>, an adversarial attack aims to generate <math>[{x^{'}_{1}, …, x^{'}_{n}}]</math> images, such that (<math>x^{'}_{n}</math>) is an adversary of (<math>x_{n}</math>).<br />
<br />
The success rate of an attack is given as: <br />
<br />
[[File:Attack.PNG|200px |]],<br />
<br />
which is the proportions of predictions that were altered by an attack.<br />
<br />
The success rate is generally measured as a function of the magnitude of perturbations performed by the attack. In this paper, L2 perturbations are used and are quantified using the normalized L2-dissimilarity metric:<br />
[[File:diss.png|150px |]]<br />
<br />
A strong adversarial attack has a high rate, while its normalized L2-dissimilarity given by the above equation is less.<br />
<br />
==Adversarial Attacks==<br />
<br />
For the experimental purposes, below 4 attacks have been studied.<br />
<br />
1. '''Fast Gradient Sign Method (FGSM; Goodfellow et al. (2015)) [17]''': Given a source input x, and true label y, and let l be the differentiable loss function used to train the classifier h(.). Then the corresponding adversarial example is given by:<br />
<br />
[[File:FGSM.PNG|200px |]]<br />
<br />
2. '''Iterative FGSM ((I-FGSM; Kurakin et al. (2016b))[14]''': iteratively applies the FGSM update, where M is the number of iterations.It is given as:<br />
<br />
[[File:IFGSM.PNG|300px |]]<br />
<br />
3. '''DeepFool ((Moosavi-Dezfooliet al., 2016) [15]''': projects x onto a linearization of the decision boundary defined by h(.) for M iterations. It is given as:<br />
<br />
[[File:DeepFool.PNG|400px |]]<br />
<br />
4. '''Carlini-Wagner's L2 attack(CW-L2; Carlini & Wagner (2017))[16]''': propose an optimization-based attack that combines a differentiable surrogate for the model’s classification accuracy with an L2-penalty term which encourages the adversary image to be close to the original image. Let Z(x) be the operation that computes the logit vector (i.e., the output before the softmax layer) for an input x, and Z(x)k be the logit value corresponding to class k. The untargeted variant<br />
of CW-L2 finds a solution to the unconstrained optimization problem. It is given as:<br />
<br />
[[File:Carlini.PNG|500px |]]<br />
<br />
Below figure shows adversarial images and corresponding perturbations at five levels of normalized L2-dissimilarity for all four attacks, mentioned above.<br />
<br />
[[File:Strength.PNG|center|600px |]]<br />
<br />
==Defenses==<br />
Defense is a strategy that aims make the prediction on an adversarial example equal to the prediction on the corresponding clean example. <br />
Five image transformations that alter the structure of these perturbations have been studied:<br />
# Image Cropping and Re-scaling, <br />
# Bit Depth Reduction, <br />
# JPEG Compression, <br />
# Total Variance Minimization, <br />
# Image Quilting.<br />
<br />
'''Image cropping and Rescaling''' has the effect of altering the spatial positioning of the adversarial perturbation. In this study, images are cropped and re-scaled during training time. At test time, prediction of randomly cropped are averaged.<br />
<br />
'''Bit Depth Reduction( Xu et. al)''' performs a simple type of quantization that can remove small (adversarial) variations in pixel values from an image. Images are reduced to 3 bits in the experiment.<br />
<br />
'''JPEG Compression and Decompression (Dziugaite etal., 2016)''' removes small perturbations by performing simple quantization.<br />
<br />
'''Total Variance Minimization [9]''' :<br />
This combines pixel dropout with total variance minimization. This approach randomly selects a small set of pixels, and reconstructs the “simplest” image that is consistent with the selected pixels. The reconstructed image does not contain the adversarial perturbations because these perturbations tend to be small and localized.Specifically, we first select a random set of pixels by sampling a Bernoulli random variable <math>X(i; j; k)</math> for each pixel location <math>(i; j; k)</math>;we maintain a pixel when <math>(i; j; k)</math>= 1. Next, we use total variation, minimization to constructs an image z that is similar to the (perturbed) input image x for the selected<br />
set of pixels, whilst also being “simple” in terms of total variation by solving:<br />
<br />
[[File:TV!.png|300px|]] , <br />
<br />
where <math>TV_{p}(z)</math> represents <math>L_{p}</math> total variation of '''z''' :<br />
<br />
[[File:TV2.png|500px|]]<br />
<br />
The total variation (TV) measures the amount of fine-scale variation in the image z, as a result of which TV minimization encourages removal of small (adversarial) perturbations in the image.<br />
<br />
'''Image Quilting(Efros & Freeman, 2001)[8]'''<br />
Image Quilting is a non-parametric technique that synthesizes images by piecing together small patches that are taken from a database of image patches. The algorithm places appropriate patches in the database for a predefined set of grid points and computes minimum graph cuts in all overlapping boundary regions to remove edge artifacts. Image Quilting can be used to remove adversarial perturbations by constructing a patch database that only contains patches from "clean" images ( without adversarial perturbations); the patches used to create the synthesized image are selected by finding the K nearest neighbors ( in pixel space) of the corresponding patch from the adversarial image in the patch database, and picking one of these neighbors uniformly at random. The motivation for this defense is that resulting image only contains pixels that were not modified by the adversary - the database of real patches is unlikely to contain the structures that appear in adversarial images.<br />
<br />
=Experiments=<br />
<br />
Five experiments were performed to test the efficacy of defenses. <br />
<br />
'''Set up:'''<br />
Experiments are performed on the ImageNet image classification dataset. The dataset comprises 1.2 million training images and 50,000 test images that correspond to one of 1000 classes. The adversarial images are produced by attacking a ResNet-50 model, with different kinds of attacks mentioned in Section5. The strength of an adversary is measured in terms of its normalized L2-dissimilarity. To produce the adversarial images, L2 dissimilarity for each of the attack was set as below:<br />
<br />
- FGSM. Increasing the step size <math>\epsilon</math>, increases the normalized L2-dissimilarity.<br />
<br />
- I-FGSM. We fix M=10, and increase <math>\epsilon</math> to increase the normalized L2-dissimilarity.<br />
<br />
- DeepFool. We fix M=5, and increase <math>\epsilon</math> to increase the normalized L2-dissimilarity.<br />
<br />
- CW-L2. We fix <math>k</math>=0 and <math>\lambda_{f}</math> =10, and multiply the resulting perturbation <br />
<br />
The hyperparameters of the defenses have been fixed in all the experiments. Specifically the pixel dropout probability was set to <math>p</math>=0.5 and regularization parameter of total variation minimizer <math>\lambda_{TV}</math>=0.03.<br />
<br />
Below figure shows the difference between the set up in different experiments below. The network is either trained on a) regular images or b) transformed images. The different settings are marked by 8.1, 8.2 and 8.3 <br />
[[File:models3.png]] <br />
<br />
==GrayBox- Image Transformation at Test Time== <br />
This experiment applies a transformation on adversarial images at test time before feeding them to a ResNet -50 which was trained to classify clean images. Below figure shows the results for five different transformations applied and their corresponding Top-1 accuracy. Few of the interesting observations from the plot are: All of the image transformations partly eliminate the effects of the attack, Crop ensemble gives the best accuracy around 40-60 percent. The accuracy of Image Quilting Defense hardly deteriorates as the strength of the adversary increases.<br />
<br />
[[File:sFig4.png|center|600px |]]<br />
<br />
==BlackBox - Image Transformation at Training and Test Time==<br />
ResNet-50 model was trained on transformed ImageNet Training images. Before feeding the images to the network for training, standard data augmentation (from He et al) along with bit depth reduction, JPEG Compression, TV Minimization, or Image Quilting were applied on the images. The classification accuracy on the same adversarial images as in the previous case is shown Figure below. (Adversary cannot get this trained model to generate new images - Hence this is assumed as a Black Box setting!). Below figure concludes that training Convolutional Neural Networks on images that are transformed in the same way at test time, dramatically improves the effectiveness of all transformation defenses. Nearly 80 -90 % of the attacks are defended successfully, even when the L2- dissimilarity is high.<br />
<br />
<br />
[[File:sFig5.png|center|600px |]]<br />
<br />
<br />
==Blackbox - Ensembling==<br />
Four networks ResNet-50, ResNet-10, DenseNet-169, and Inception-v4 along with an ensemble of defenses were studied, as shown in Table 1. The adversarial images are produced by attacking a ResNet-50 model. The results in the table conclude that Inception-v4 performs best. This could be due to that network having a higher accuracy even in non-adversarial settings. The best ensemble of defenses achieves an accuracy of about 71% against all the other attacks. The attacks deteriorate the accuracy of the best defenses by at most 6%.<br />
<br />
<br />
[[File:sTab1.png|600px|thumb|center|Table 1. Top-1 classification accuracy of ensemble and model transfer defenses (columns) against four black-box attacks (rows). The four networks we use to classify images are ResNet-50 (RN50), ResNet-101 (RN101), DenseNet-169 (DN169), and Inception-v4 (Iv4). Adversarial images are generated by running attacks against the ResNet-50 model, aiming for an average normalized <math>L_2</math>-dissimilarity of 0.06. Higher is better. The best defense against each attack is typeset in boldface.]]<br />
<br />
==GrayBox - Image Transformation at Training and Test Time ==<br />
In this experiment, the adversary has access to the network and the related parameters (but does not have access to the input transformations applied at test time). From the network trained in-(BlackBox: Image Transformation at Training and Test Time), novel adversarial images were generated by the four attack methods. The results show that Bit-Depth Reduction and JPEG Compression are weak defenses in such a gray box setting. <br />
The results for this experiment are shown in below figure. Networks using these defenses classify up to 50 % of images correctly.<br />
<br />
[[File:sFig6.png|center| 600px |]]<br />
<br />
==Comparison With Ensemble Adversarial Training==<br />
The results of the experiment are compared with the state of the art ensemble adversarial training approach proposed by Tramer et al. [2] 2017. Ensemble Training fits the parameters of a Convolutional Neural Network on adversarial examples that were generated to attack an ensemble of pre-trained models. The model release by Tramer et al [2]: an Inception-Resnet-v2, trained on adversarial examples generated by FGSM against Inception-Resnet-v2 and Inception-v3<br />
models. The results of ensemble training and the preprocessing techniques mentioned in this paper are shown in Table 2.<br />
The results show that ensemble adversarial training works better on FGSM attacks (which it uses at training time), but is outperformed by each of the transformation-based defenses all other attacks.<br />
<br />
<br />
<br />
[[File:sTab2.png|600px|thumb|center|Table 2. Top-1 classification accuracy on images perturbed using attacks against ResNet-50 models trained on input-transformed images and an Inception-v4 model trained using ensemble adversarial. Adversarial images are generated by running attacks against the models, aiming for an average normalized <math>L_2</math>-dissimilarity of 0.06. The best defense against each attack is typeset in boldface.]]<br />
<br />
=Discussion/Conclusions=<br />
The paper proposed reasonable approaches to countering adversarial images. The authors evaluated Total Variance Minimization and Image Quilting and compared it with already proposed ideas like Image Cropping- Rescaling, Bit Depth Reduction, JPEG Compression, and Decompression on the challenging ImageNet dataset.<br />
Previous work by (Wang et al) [10], shows that a strong input defense should, be nondifferentiable and randomized. Two of the defenses - namely Total Variation Minimization and Image Quilting, both possess this property. Future work suggests applying the same techniques to other domains such as speech recognition and image segmentation. For example, in speech recognition, total variance minimization can be used to remove perturbations from waveforms and "spectrogram quilting" techniques that reconstruct a spectrogram could be developed. The input transformations can also be studied with ensemble adversarial training by Tramèr et al.[2]<br />
<br />
<br />
=Critiques=<br />
1. The terminology of Black Box, White Box, and Grey Box attack is not exactly given and clear.<br />
<br />
2. White Box attacks could have been considered where the adversary has a full access to the model as well as the pre-processing techniques.<br />
<br />
3. Though the authors did a considerable work in showing the effect of four attacks on ImageNet database, much stronger attacks (Madry et al) [7], could have been evaluated.<br />
<br />
4. Authors claim that the success rate is generally measured as a function of the magnitude of perturbations, performed by the attack using the L2- dissimilarity, but the claim is not supported by any references. None of the previous work has used these metrics<br />
<br />
<br />
=References=<br />
<br />
1. Chuan Guo , Mayank Rana & Moustapha Ciss´e & Laurens van der Maaten , Countering Adversarial Images Using Input Transformations<br />
<br />
2. Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, Patrick McDaniel, Ensemble Adversarial Training: Attacks and defenses.<br />
<br />
3. Abigail Graese, Andras Rozsa, and Terrance E. Boult. Assessing threat of adversarial examples of deep neural networks. CoRR, abs/1610.04256, 2016. <br />
<br />
4. Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G. Ororbia II, Xinyu Xing, C. Lee Giles, and Xue Liu. Adversary resistant deep neural networks with an application to malware detection. CoRR, abs/1610.01239, 2016a.<br />
<br />
5.Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. CoRR, abs/1704.01155, 2017. <br />
<br />
6. Gintare Karolina Dziugaite, Zoubin Ghahramani, and Daniel Roy. A study of the effect of JPG compression on adversarial images. CoRR, abs/1608.00853, 2016.<br />
<br />
7.Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu .Towards Deep Learning Models Resistant to Adversarial Attacks, arXiv:1706.06083v3<br />
<br />
8.Alexei Efros and William Freeman. Image quilting for texture synthesis and transfer. In Proc. SIGGRAPH, pp. 341–346, 2001.<br />
<br />
9.Leonid Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D, 60:259–268, 1992.<br />
<br />
10.Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G. Ororbia II, Xinyu Xing, C. Lee Giles, and Xue Liu. Learning adversary-resistant deep neural networks. CoRR, abs/1612.01401, 2016b.<br />
<br />
11. Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. CoRR, abs/1611.02770, 2016.<br />
<br />
12. Moustapha Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet. Houdini: Fooling deep structured prediction models. CoRR, abs/1707.05373, 2017 <br />
<br />
13. Marco Melis, Ambra Demontis, Battista Biggio, Gavin Brown, Giorgio Fumera, and Fabio Roli. Is deep learning safe for robot vision? adversarial examples against the icub humanoid. CoRR,abs/1708.06939, 2017.<br />
<br />
14. Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. CoRR, abs/1607.02533, 2016b.<br />
<br />
15. Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: A simple and accurate method to fool deep neural networks. In Proc. CVPR, pp. 2574–2582, 2016.<br />
<br />
16. Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, pp. 39–57, 2017.<br />
<br />
17. Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In Proc. ICLR, 2015.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Robot_Learning_in_Homes:_Improving_Generalization_and_Reducing_Dataset_Bias&diff=39771Robot Learning in Homes: Improving Generalization and Reducing Dataset Bias2018-11-18T14:18:45Z<p>Mpafla: /* Grasping Formulation */</p>
<hr />
<div>==Introduction==<br />
<br />
<br />
Using data-driven approaches in robotics has increased in the last decade. Instead of using hand-designed models, these data-driven approaches work on large-scale datasets and learn appropriate policies that map from high-dimensional observations to actions. Since collecting data using an actual robot in real-time is very expensive, most of the data-driven approaches in robotics use simulators in order to collect simulated data. The concern which arises here is whether these approaches are able to be robust enough to domain shift and to be used for real-world data. It is an undeniable fact that there is a wide reality gap between simulators and the real world.<br />
<br />
On the other hand, the declining costs of hardware to expand collecting data for a variety of tasks push the robotics community to collect real-world physical data. This approach has been quite successful at tasks such as grasping, pushing, poking and imitation learning. However, the major problem is that the performance of these learning models is not good enough and tends to plateau fast. Furthermore, robotic action data did not lead to similar gains in other areas such as computer vision and natural language processing. As the paper claimed, the solution for all of these obstacles is using “real data”. Current robotic datasets lack diversity of environment. Learning-based approaches need to move out of simulators in the labs and go to real environments such as real homes so that they can learn from real datasets. <br />
<br />
Like every other process, collecting real data and working with it has several challenges. First, there is a need for cheap and compact robots to collect data in homes but current industrial robots (i.e. Sawyer and Baxter) are too expensive. Secondly, cheap robots are not accurate enough to collect reliable data. Also, collecting data in homes cannot have a supervisor at all times. These challenges in addition to some other external factors can have a result in having noisy data. In this paper, a first systematic effort has been presented for collecting a dataset inside people's homes which has the following parts: <br />
<br />
-A cheap robot which is appropriate for use in homes<br />
<br />
-Collecting training data in 6 different homes and testing data in 3 homes<br />
<br />
-An approach for modeling the noise in the labeled data<br />
<br />
[[File:aa1.PNG|600px|thumb|center|]]<br />
<br />
==Overview==<br />
<br />
<br />
This paper emphasizes the importance of diversifying the data for robotic learning in order to have a greater generalization, by focusing on the task of grasping. A diverse dataset also allows for removing biases in the data. By considering these facts, the paper argues that even for simple tasks like grasping, datasets which are collected in labs suffer from strong biases such as simple backgrounds and same environment dynamics. Hence, the learning approaches cannot generalize the models and work well on real datasets.<br />
<br />
As a future possibility, there would be a need for having a low-cost robot to collect large-scale data inside a huge number of homes. For this reason, they introduced a customized mobile manipulator. They used a Dobot Magician which is a robotic arm mounted on a Kobuki which is a low-cost mobile base. The resulting robot arm has five degrees of freedom (DOF). They also add an Intel R200 RGBD camera to their robot which is at a height of 1m above the ground. An Intel Core i5 processor is also used as an onboard laptop to perform all the processing. The whole system can run for 1.5 hours with a single charge.<br />
<br />
As there is always a trade-off, when we gain a low-cost robot, we are actually losing accuracy for controlling it. So, the low-cost robot which is built from cheaper components than the expensive setups such as Baxter and Sawyer suffers from higher calibration errors and execution errors. This means that the dataset collected with this approach is diverse and huge but it has noisy labels. To illustrate, consider when the robot wants to grasp at location <math> {(x, y)}</math>. Since there is a noise in the execution, the robot may perform this action in the location <math> {(x + \delta_{x}, y+ \delta_{y})}</math> which would assign the success or failure label of this action to a wrong place. Therefore, to solve the problem, they used an approach to learn from noisy data. They modeled noise as a latent variable and used two networks, one for predicting the noise and one for predicting the action to execute.<br />
<br />
==Learning on low-cost robot data==<br />
<br />
This paper uses patch grasping framework in its proposed architecture. Also, as mentioned before, there is a high tendency for noisy labels in the datasets which are collected by inaccurate and cheap robots. The cause of the noise in the labels could be due to the hardware execution error, inaccurate kinematics, camera calibration, proprioception, wear, and tear, etc. Here are more explanations about different parts of the architecture: <br />
<br />
<br />
===Grasping Formulation===<br />
<br />
Planar grasping is the object of interest in this architecture. It means that all the objects are grasped at the same height and vertical to the ground. The final goal is to find <math>{(x, y, \theta)}</math> given an observation <math> {I}</math> of the object, where <math> {x}</math> and <math> {y}</math> are the translational degrees of freedom and <math> {\theta}</math> is the rotational degrees of freedom. For the purpose of comparison, they used a model which does not predict the <math>{(x, y, \theta)}</math> directly from the image <math> {I}</math>, but samples several smaller patches <math> {I_{P}}</math> at different locations <math>{(x, y)}</math>. Thus, the angle of grasp <math> {\theta}</math> is predicted from these patches. Also, in order to have multimodal predictions, discrete steps of the angle <math> {\theta}</math>, <math> {\theta_{D}}</math> is used. <br />
<br />
Hence, each datapoint consists of an image <math> {I}</math>, the executed grasp <math>{(x, y, \theta)}</math> and the grasp success/failure label g. Then, the image <math> {I}</math> and the angle <math> {\theta}</math> are converted to image patch <math> {I_{P}}</math> and angle <math> {\theta_{D}}</math>. Then, to minimize the classification error, a binary cross entropy loss is used which minimizes the error between the predicted and ground truth label <math> g </math>.<br />
<br />
===Modeling noise as latent variable===<br />
<br />
In order to tackle the problem of inaccurate position control, they found a structure in the noise which is dependent on the robot and the design. They modeled this structure as a latent variable which is shown is figure 2: <br />
<br />
<br />
[[File:aa2.PNG|600px|thumb|center|]]<br />
<br />
<br />
<br />
The grasp success probability for image patch <math> {I_{P}}</math> at angle <math> {\theta_{D}}</math> is represented as <math> {P(g|I_{P},\theta_{D};R )}</math> where <math> {R}</math> represents environment variables that can add noise to the system.<br />
<br />
The conditional probability of grasping for this model is computed by:<br />
<br />
<br />
\[ { P(g|I_{P},\theta_{D}, R ) = ∑_{( \hat{I}_{P} ϵ P)} P(g│z=\hat{I}_{P},\theta_{D},R ). P(z=\hat{I}_{P} |(\theta_{D},I_{P} ,R ) } \]<br />
<br />
<br />
<br />
<br />
Here, <math> {z}</math> models the latent variable of the actual patch executed, and <math> {\hat{I}_{P}}</math> belongs to a set of possible neighboring patches <math> {P}</math>. <math> {P(z=\hat{I}_{P} |(\theta_{D},I_{P} ,R )}</math> shows the noise which can be caused by <math> {R}</math> variables and is implemented as the Noise Modelling Network (NMN). <math> {P(g│z=\hat{I}_{P},\theta_{D},R )}</math> shows the grasp prediction probability given the true patch and is implemented as the Grasp Prediction Network (GPN). The overall Robust-Grasp model is computed by marginalizing GPN and NMN.<br />
<br />
===Learning the latent noise model===<br />
<br />
<br />
They assume that <math> {z}</math> is conditionally independent of the local patch-specific variables <math> {(I_{P}, \theta_{D})}</math>. To estimate the latent variable <math> {z}</math>, they used direct optimization to learn both NMN and GPN with noisy labels. The entire image of the scene and the environment information are the inputs of the NMN. The output of the NMN is the probability distribution of the actual patches where the grasps are executed. Finally, a binary cross entropy loss is applied to the marginalized output of these two networks and the true grasp label g.<br />
<br />
===Training details===<br />
<br />
<br />
They implemented their model in PyTorch using a pretrained ResNet-18 model. They concatenated 512 dimensional ResNet feature with a 1-hot vector of robot ID and the raw pixel location of the grasp for their NMN. Also, the inputs of the GPN are the original noisy patch plus 8 other equidistant patches from the original one.<br />
Their training process starts with training only GPN over 5 epochs of the data. Then, the NMN and the marginalization operator are added to the model. So, they train NMN and GPN simultaneously for the other 25 epochs.<br />
<br />
==Results==<br />
<br />
<br />
In the results part of the paper, they show that collecting dataset in homes is essential for generalizing learning from unseen environments. They also show that modeling the noise in their Low-Cost Arm (LCA) can improve grasping performance.<br />
They collected data in parallel using multiple robots in 6 different homes, as shown in Figure 3. They used an object detector (tiny-YOLO) as the input data were unstructured due to LCA limited memory and computational capabilities. They scattered different objects in homes within 2m area to prevent collision of the robot with obstacles and let the robot move randomly and grasp objects. Finally, they collected a dataset with 28K grasp results.<br />
<br />
[[File:aa3.PNG|600px|thumb|center|]]<br />
<br />
<br />
To evaluate their approach in a more quantitative way, they used three test settings:<br />
<br />
- The first one is a binary classification or held-out data. The test set is collected by performing random grasps on objects. They measure the performance of binary classification by predicting the success or failure of grasping, given a location and the angle. Using binary classification allows for testing a lot of models without running them on real robots. They collected two held-out datasets using LCA in lab and homes and the dataset for Baxter robot.<br />
<br />
- The second one is Real Low-Cost Arm(Real-LCA). Here, they evaluate their model by running it in three unseen homes. They put 20 new objects in these three homes in different orientations. Since the objects and the environments are completely new, this tests could measure the generalization of the model.<br />
<br />
- The third one is Real Sawyer(Real-Sawyer). They evaluate the performance of their model by running the model on the Sawyer robot which is more accurate than the LCA. They tested their model in the lab environment to show that training models with the datasets collected from homes can improve the performance of models even in lab environments.<br />
<br />
They used baselines for both their data which is collected in homes and their model which is Robust-Grasp. They used two datasets for the baseline. The dataset collected by (Lab-Baxter) and the dataset collected by their LCA in the lab(Lab-LCA).<br />
They compared their model with the noise independent patch grasping model (Patch-Grasp). They also compared their data and model with DexNet-3.0 (DexNet) for a strong real-world grasping baseline.<br />
<br />
<br />
<br />
===Experiment 1: Performance on held-out data===<br />
<br />
<br />
Table 1 shows that the models trained on lab data cannot generalize to the Home-LCA environment. However, the model trained on Home-LCA have a good performance on both lab data and home environment.<br />
<br />
[[File:aa4.PNG|600px|thumb|center|]]<br />
<br />
<br />
<br />
===Experiment 2: Performance on Real LCA Robot===<br />
<br />
<br />
In table 2, the performance of the Home-LCA is compared against a pre-trained DexNet and the model trained on the Lab-Baxter. Training on the Home-LCA dataset performs 43.7% better than training on the Lab-Baxter dataset and 33% better than DexNet. The low performance of DexNet can be described by the possible noise in the depth images that are caused by the natural light. So that’s why DexNet which requires high quality depth sensing cannot perform well. By using cheap commodity RGBD cameras in LCA, the noise in the depth images is not a matter of concern.<br />
<br />
[[File:aa5.PNG|600px|thumb|center|]]<br />
<br />
===Performance on Real Sawyer===<br />
<br />
<br />
To compare the performance of the Robust-Grasp model against the Patch-Grasp model, they used Lab-Baxter which is an accurate robot. Sawyer robot is used for testing to ensure that the testing robot is different from both training robots. As shown in Table 3, the Robust-Grasp model trained on Home-LCA outperforms the Patch-Grasp model and achieves 77.5% accuracy. Furthermore, the visualizations of predicted noise corrections in Figure 4 shows that the corrections depend on both the pixel locations of the noisy grasp and the robot.<br />
<br />
<br />
[[File:aa6.PNG|600px|thumb|center|]]<br />
<br />
[[File:aa7.PNG|600px|thumb|center|]]<br />
<br />
<br />
<br />
==Related work==<br />
<br />
<br />
Over the last few years, the interest of scaling up robot learning with large scale datasets has been increased. Hence, many papers were published in this area. A hand annotated grasping dataset, a self-supervised grasping dataset, and grasping using reinforcement learning are some examples of using large scale datasets for grasping. There were also many papers that worked on other robotic tasks like material recognition or pushing objects. However, none of these papers worked on real data in real environments like homes. They just used high-cost hardware and lab data.<br />
<br />
<br />
Furthermore, since grasping is one of the basic problems of robotic, there were some efforts to improve grasping. Classic approaches focused on physics-based issues of grasping and required 3D models of the objects. However, recent works focused on data-driven approaches which learn from visual observations to grasp objects. The point here is that they usually require high quality depth as input which seems to be a barrier for practical use of robots in real environments.<br />
<br />
<br />
Most labs use industrial robots or standard collaborative hardware for their experiments. Therefore, there is few research that used low cost robots. One of the examples is learning using a cheap inaccurate robot for stack multiple blocks, although it is not clear whether learning approaches are used in it alongside mapping and planning.<br />
<br />
<br />
Learning from noisy inputs is another challenge specifically in computer vision. A controversial question which is often raised in this area is whether learning from noise can improve the performance. Some works show it could have bad effects on the performance; however, some other works find it valuable when the noise is dependent of the environment. In this paper, they used a model that can exploit the noise and learn a better grasping model.<br />
<br />
==Conclusion==<br />
<br />
All in all, the paper presents an approach for collecting large-scale robot data in real home environments. They implemented their approach by using a mobile manipulator which is a lot cheaper than the existing industrial robots. They collected a dataset of 28K grasps in six different homes. In order to solve the problem of noisy labels which were caused by their inaccurate robots, they presented a framework to factor out the noise in the data. They tested their model by physically grasping 20 new objects in three new homes and in the lab. The model trained with home dataset showed 43.7% improvement over the models trained with lab data. Their results also showed that their model can improve the grasping performance even in lab environments. They also demonstrated that their architecture for modeling the noise improved the performance by about 10%.<br />
<br />
==Critiques==<br />
<br />
This paper is not a significant algorithmic contribution. They are just combining a large number of data engineering techniques for the robot learning problem. The authors claim that they have obtained 43.7% more accuracy than baseline models, but it does not seem to be a fair comparison as the data collection happened in simulated settings in the lab for other methods, whereas the authors use the home dataset. The authors must have also discussed safety issues when training robots in real environments as against simulated environments like labs. The authors are encouraging other researchers to look outside the labs, but are not discussing the critical safety issues in this approach.<br />
<br />
<br />
The paper argues that the dataset collected by the LCA is noisy, since the robot is cheap and inaccurate. It further asserts that in order to handle the noise in the dataset, they can model the noise as a latent variable and their model can improve the performance of grasping. Although learning from noisy data and achieving a good performance is valuable, it is better that they test their noise modeling network for other robots as well. Since their noise modelling network takes robot information as an input, it would be a good idea to generalize it by testing it using different inaccurate robots to ensure that it would perform well.<br />
<br />
==References==<br />
<br />
#Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. "Domain randomization for transferring deep neural networks from simulation to the real world." 2017. URL https://arxiv.org/abs/1703.06907.<br />
#Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. "Sim-to-real transfer of robotic control with dynamics randomization." arXiv preprint arXiv:1710.06537,2017.<br />
#Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. "Asymmetric actor critic for image-based robot learning." Robotics Science and Systems, 2018.<br />
#Lerrel Pinto and Abhinav Gupta. "Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours." CoRR, abs/1509.06825, 2015. URL http://arxiv.org/abs/1509. 06825.<br />
#Adithyavairavan Murali, Lerrel Pinto, Dhiraj Gandhi, and Abhinav Gupta. "CASSL: Curriculum accelerated self-supervised learning." International Conference on Robotics and Automation, 2018.<br />
# Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. "End-to-end training of deep visuomotor policies." The Journal of Machine Learning Research, 17(1):1334–1373, 2016.<br />
#Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. "Learning hand-eye coordination for robotic grasping with deep learning and large scale data collection." CoRR, abs/1603.02199, 2016. URL http://arxiv.org/abs/1603.02199.<br />
#Pulkit Agarwal, Ashwin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. "Learning to poke by poking: Experiential learning of intuitive physics." 2016. URL http://arxiv.org/ abs/1606.07419<br />
#Chelsea Finn, Ian Goodfellow, and Sergey Levine. "Unsupervised learning for physical interaction through video prediction." In Advances in neural information processing systems, 2016.<br />
#Ashvin Nair, Dian Chen, Pulkit Agrawal, Phillip Isola, Pieter Abbeel, Jitendra Malik, and Sergey Levine. "Combining self-supervised learning and imitation for vision-based rope manipulation." International Conference on Robotics and Automation, 2017.<br />
#Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. "Revisiting unreasonable effectiveness of data in deep learning era." ICCV, 2017.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Robot_Learning_in_Homes:_Improving_Generalization_and_Reducing_Dataset_Bias&diff=39770Robot Learning in Homes: Improving Generalization and Reducing Dataset Bias2018-11-18T14:10:14Z<p>Mpafla: /* Introduction */</p>
<hr />
<div>==Introduction==<br />
<br />
<br />
Using data-driven approaches in robotics has increased in the last decade. Instead of using hand-designed models, these data-driven approaches work on large-scale datasets and learn appropriate policies that map from high-dimensional observations to actions. Since collecting data using an actual robot in real-time is very expensive, most of the data-driven approaches in robotics use simulators in order to collect simulated data. The concern which arises here is whether these approaches are able to be robust enough to domain shift and to be used for real-world data. It is an undeniable fact that there is a wide reality gap between simulators and the real world.<br />
<br />
On the other hand, the declining costs of hardware to expand collecting data for a variety of tasks push the robotics community to collect real-world physical data. This approach has been quite successful at tasks such as grasping, pushing, poking and imitation learning. However, the major problem is that the performance of these learning models is not good enough and tends to plateau fast. Furthermore, robotic action data did not lead to similar gains in other areas such as computer vision and natural language processing. As the paper claimed, the solution for all of these obstacles is using “real data”. Current robotic datasets lack diversity of environment. Learning-based approaches need to move out of simulators in the labs and go to real environments such as real homes so that they can learn from real datasets. <br />
<br />
Like every other process, collecting real data and working with it has several challenges. First, there is a need for cheap and compact robots to collect data in homes but current industrial robots (i.e. Sawyer and Baxter) are too expensive. Secondly, cheap robots are not accurate enough to collect reliable data. Also, collecting data in homes cannot have a supervisor at all times. These challenges in addition to some other external factors can have a result in having noisy data. In this paper, a first systematic effort has been presented for collecting a dataset inside people's homes which has the following parts: <br />
<br />
-A cheap robot which is appropriate for use in homes<br />
<br />
-Collecting training data in 6 different homes and testing data in 3 homes<br />
<br />
-An approach for modeling the noise in the labeled data<br />
<br />
[[File:aa1.PNG|600px|thumb|center|]]<br />
<br />
==Overview==<br />
<br />
<br />
This paper emphasizes the importance of diversifying the data for robotic learning in order to have a greater generalization, by focusing on the task of grasping. A diverse dataset also allows for removing biases in the data. By considering these facts, the paper argues that even for simple tasks like grasping, datasets which are collected in labs suffer from strong biases such as simple backgrounds and same environment dynamics. Hence, the learning approaches cannot generalize the models and work well on real datasets.<br />
<br />
As a future possibility, there would be a need for having a low-cost robot to collect large-scale data inside a huge number of homes. For this reason, they introduced a customized mobile manipulator. They used a Dobot Magician which is a robotic arm mounted on a Kobuki which is a low-cost mobile base. The resulting robot arm has five degrees of freedom (DOF). They also add an Intel R200 RGBD camera to their robot which is at a height of 1m above the ground. An Intel Core i5 processor is also used as an onboard laptop to perform all the processing. The whole system can run for 1.5 hours with a single charge.<br />
<br />
As there is always a trade-off, when we gain a low-cost robot, we are actually losing accuracy for controlling it. So, the low-cost robot which is built from cheaper components than the expensive setups such as Baxter and Sawyer suffers from higher calibration errors and execution errors. This means that the dataset collected with this approach is diverse and huge but it has noisy labels. To illustrate, consider when the robot wants to grasp at location <math> {(x, y)}</math>. Since there is a noise in the execution, the robot may perform this action in the location <math> {(x + \delta_{x}, y+ \delta_{y})}</math> which would assign the success or failure label of this action to a wrong place. Therefore, to solve the problem, they used an approach to learn from noisy data. They modeled noise as a latent variable and used two networks, one for predicting the noise and one for predicting the action to execute.<br />
<br />
==Learning on low-cost robot data==<br />
<br />
This paper uses patch grasping framework in its proposed architecture. Also, as mentioned before, there is a high tendency for noisy labels in the datasets which are collected by inaccurate and cheap robots. The cause of the noise in the labels could be due to the hardware execution error, inaccurate kinematics, camera calibration, proprioception, wear, and tear, etc. Here are more explanations about different parts of the architecture: <br />
<br />
<br />
===Grasping Formulation===<br />
<br />
Planar grasping is the object of interest in this architecture. It means that all the objects are grasped at the same height and vertical to the ground. The final goal is to find <math>{(x, y, \theta)}</math> given an observation <math> {I}</math> of the object, where <math> {x}</math> and <math> {y}</math> are the translational degrees of freedom and <math> {\theta}</math> is the rotational degrees of freedom. For the purpose of comparison, they used a model which does not predict the <math>{(x, y, \theta)}</math> directly from the image <math> {I}</math>, but samples several smaller patches <math> {I_{P}}</math> at different locations <math>{(x, y)}</math>. Thus, the angle of grasp <math> {\theta}</math> is predicted from these patches. Also, in order to have multimodal predictions, discrete steps of the angle <math> {\theta}</math>, <math> {\theta_{D}}</math> is used. <br />
<br />
Hence, each datapoint consists of an image <math> {I}</math>, the executed grasp <math>{(x, y, \theta)}</math> and the grasp success/failure label g. Then, the image <math> {I}</math> and the angle <math> {\theta}</math> are converted to image patch <math> {I_{P}}</math> and angle <math> {\theta_{D}}</math>. Then, to minimize the classification error, a binary cross entropy loss is used.<br />
<br />
<br />
===Modeling noise as latent variable===<br />
<br />
In order to tackle the problem of inaccurate position control, they found a structure in the noise which is dependent on the robot and the design. They modeled this structure as a latent variable which is shown is figure 2: <br />
<br />
<br />
[[File:aa2.PNG|600px|thumb|center|]]<br />
<br />
<br />
<br />
The grasp success probability for image patch <math> {I_{P}}</math> at angle <math> {\theta_{D}}</math> is represented as <math> {P(g|I_{P},\theta_{D};R )}</math> where <math> {R}</math> represents environment variables that can add noise to the system.<br />
<br />
The conditional probability of grasping for this model is computed by:<br />
<br />
<br />
\[ { P(g|I_{P},\theta_{D}, R ) = ∑_{( \hat{I}_{P} ϵ P)} P(g│z=\hat{I}_{P},\theta_{D},R ). P(z=\hat{I}_{P} |(\theta_{D},I_{P} ,R ) } \]<br />
<br />
<br />
<br />
<br />
Here, <math> {z}</math> models the latent variable of the actual patch executed, and <math> {\hat{I}_{P}}</math> belongs to a set of possible neighboring patches <math> {P}</math>. <math> {P(z=\hat{I}_{P} |(\theta_{D},I_{P} ,R )}</math> shows the noise which can be caused by <math> {R}</math> variables and is implemented as the Noise Modelling Network (NMN). <math> {P(g│z=\hat{I}_{P},\theta_{D},R )}</math> shows the grasp prediction probability given the true patch and is implemented as the Grasp Prediction Network (GPN). The overall Robust-Grasp model is computed by marginalizing GPN and NMN.<br />
<br />
===Learning the latent noise model===<br />
<br />
<br />
They assume that <math> {z}</math> is conditionally independent of the local patch-specific variables <math> {(I_{P}, \theta_{D})}</math>. To estimate the latent variable <math> {z}</math>, they used direct optimization to learn both NMN and GPN with noisy labels. The entire image of the scene and the environment information are the inputs of the NMN. The output of the NMN is the probability distribution of the actual patches where the grasps are executed. Finally, a binary cross entropy loss is applied to the marginalized output of these two networks and the true grasp label g.<br />
<br />
===Training details===<br />
<br />
<br />
They implemented their model in PyTorch using a pretrained ResNet-18 model. They concatenated 512 dimensional ResNet feature with a 1-hot vector of robot ID and the raw pixel location of the grasp for their NMN. Also, the inputs of the GPN are the original noisy patch plus 8 other equidistant patches from the original one.<br />
Their training process starts with training only GPN over 5 epochs of the data. Then, the NMN and the marginalization operator are added to the model. So, they train NMN and GPN simultaneously for the other 25 epochs.<br />
<br />
==Results==<br />
<br />
<br />
In the results part of the paper, they show that collecting dataset in homes is essential for generalizing learning from unseen environments. They also show that modeling the noise in their Low-Cost Arm (LCA) can improve grasping performance.<br />
They collected data in parallel using multiple robots in 6 different homes, as shown in Figure 3. They used an object detector (tiny-YOLO) as the input data were unstructured due to LCA limited memory and computational capabilities. They scattered different objects in homes within 2m area to prevent collision of the robot with obstacles and let the robot move randomly and grasp objects. Finally, they collected a dataset with 28K grasp results.<br />
<br />
[[File:aa3.PNG|600px|thumb|center|]]<br />
<br />
<br />
To evaluate their approach in a more quantitative way, they used three test settings:<br />
<br />
- The first one is a binary classification or held-out data. The test set is collected by performing random grasps on objects. They measure the performance of binary classification by predicting the success or failure of grasping, given a location and the angle. Using binary classification allows for testing a lot of models without running them on real robots. They collected two held-out datasets using LCA in lab and homes and the dataset for Baxter robot.<br />
<br />
- The second one is Real Low-Cost Arm(Real-LCA). Here, they evaluate their model by running it in three unseen homes. They put 20 new objects in these three homes in different orientations. Since the objects and the environments are completely new, this tests could measure the generalization of the model.<br />
<br />
- The third one is Real Sawyer(Real-Sawyer). They evaluate the performance of their model by running the model on the Sawyer robot which is more accurate than the LCA. They tested their model in the lab environment to show that training models with the datasets collected from homes can improve the performance of models even in lab environments.<br />
<br />
They used baselines for both their data which is collected in homes and their model which is Robust-Grasp. They used two datasets for the baseline. The dataset collected by (Lab-Baxter) and the dataset collected by their LCA in the lab(Lab-LCA).<br />
They compared their model with the noise independent patch grasping model (Patch-Grasp). They also compared their data and model with DexNet-3.0 (DexNet) for a strong real-world grasping baseline.<br />
<br />
<br />
<br />
===Experiment 1: Performance on held-out data===<br />
<br />
<br />
Table 1 shows that the models trained on lab data cannot generalize to the Home-LCA environment. However, the model trained on Home-LCA have a good performance on both lab data and home environment.<br />
<br />
[[File:aa4.PNG|600px|thumb|center|]]<br />
<br />
<br />
<br />
===Experiment 2: Performance on Real LCA Robot===<br />
<br />
<br />
In table 2, the performance of the Home-LCA is compared against a pre-trained DexNet and the model trained on the Lab-Baxter. Training on the Home-LCA dataset performs 43.7% better than training on the Lab-Baxter dataset and 33% better than DexNet. The low performance of DexNet can be described by the possible noise in the depth images that are caused by the natural light. So that’s why DexNet which requires high quality depth sensing cannot perform well. By using cheap commodity RGBD cameras in LCA, the noise in the depth images is not a matter of concern.<br />
<br />
[[File:aa5.PNG|600px|thumb|center|]]<br />
<br />
===Performance on Real Sawyer===<br />
<br />
<br />
To compare the performance of the Robust-Grasp model against the Patch-Grasp model, they used Lab-Baxter which is an accurate robot. Sawyer robot is used for testing to ensure that the testing robot is different from both training robots. As shown in Table 3, the Robust-Grasp model trained on Home-LCA outperforms the Patch-Grasp model and achieves 77.5% accuracy. Furthermore, the visualizations of predicted noise corrections in Figure 4 shows that the corrections depend on both the pixel locations of the noisy grasp and the robot.<br />
<br />
<br />
[[File:aa6.PNG|600px|thumb|center|]]<br />
<br />
[[File:aa7.PNG|600px|thumb|center|]]<br />
<br />
<br />
<br />
==Related work==<br />
<br />
<br />
Over the last few years, the interest of scaling up robot learning with large scale datasets has been increased. Hence, many papers were published in this area. A hand annotated grasping dataset, a self-supervised grasping dataset, and grasping using reinforcement learning are some examples of using large scale datasets for grasping. There were also many papers that worked on other robotic tasks like material recognition or pushing objects. However, none of these papers worked on real data in real environments like homes. They just used high-cost hardware and lab data.<br />
<br />
<br />
Furthermore, since grasping is one of the basic problems of robotic, there were some efforts to improve grasping. Classic approaches focused on physics-based issues of grasping and required 3D models of the objects. However, recent works focused on data-driven approaches which learn from visual observations to grasp objects. The point here is that they usually require high quality depth as input which seems to be a barrier for practical use of robots in real environments.<br />
<br />
<br />
Most labs use industrial robots or standard collaborative hardware for their experiments. Therefore, there is few research that used low cost robots. One of the examples is learning using a cheap inaccurate robot for stack multiple blocks, although it is not clear whether learning approaches are used in it alongside mapping and planning.<br />
<br />
<br />
Learning from noisy inputs is another challenge specifically in computer vision. A controversial question which is often raised in this area is whether learning from noise can improve the performance. Some works show it could have bad effects on the performance; however, some other works find it valuable when the noise is dependent of the environment. In this paper, they used a model that can exploit the noise and learn a better grasping model.<br />
<br />
==Conclusion==<br />
<br />
All in all, the paper presents an approach for collecting large-scale robot data in real home environments. They implemented their approach by using a mobile manipulator which is a lot cheaper than the existing industrial robots. They collected a dataset of 28K grasps in six different homes. In order to solve the problem of noisy labels which were caused by their inaccurate robots, they presented a framework to factor out the noise in the data. They tested their model by physically grasping 20 new objects in three new homes and in the lab. The model trained with home dataset showed 43.7% improvement over the models trained with lab data. Their results also showed that their model can improve the grasping performance even in lab environments. They also demonstrated that their architecture for modeling the noise improved the performance by about 10%.<br />
<br />
==Critiques==<br />
<br />
This paper is not a significant algorithmic contribution. They are just combining a large number of data engineering techniques for the robot learning problem. The authors claim that they have obtained 43.7% more accuracy than baseline models, but it does not seem to be a fair comparison as the data collection happened in simulated settings in the lab for other methods, whereas the authors use the home dataset. The authors must have also discussed safety issues when training robots in real environments as against simulated environments like labs. The authors are encouraging other researchers to look outside the labs, but are not discussing the critical safety issues in this approach.<br />
<br />
<br />
The paper argues that the dataset collected by the LCA is noisy, since the robot is cheap and inaccurate. It further asserts that in order to handle the noise in the dataset, they can model the noise as a latent variable and their model can improve the performance of grasping. Although learning from noisy data and achieving a good performance is valuable, it is better that they test their noise modeling network for other robots as well. Since their noise modelling network takes robot information as an input, it would be a good idea to generalize it by testing it using different inaccurate robots to ensure that it would perform well.<br />
<br />
==References==<br />
<br />
#Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. "Domain randomization for transferring deep neural networks from simulation to the real world." 2017. URL https://arxiv.org/abs/1703.06907.<br />
#Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. "Sim-to-real transfer of robotic control with dynamics randomization." arXiv preprint arXiv:1710.06537,2017.<br />
#Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. "Asymmetric actor critic for image-based robot learning." Robotics Science and Systems, 2018.<br />
#Lerrel Pinto and Abhinav Gupta. "Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours." CoRR, abs/1509.06825, 2015. URL http://arxiv.org/abs/1509. 06825.<br />
#Adithyavairavan Murali, Lerrel Pinto, Dhiraj Gandhi, and Abhinav Gupta. "CASSL: Curriculum accelerated self-supervised learning." International Conference on Robotics and Automation, 2018.<br />
# Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. "End-to-end training of deep visuomotor policies." The Journal of Machine Learning Research, 17(1):1334–1373, 2016.<br />
#Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. "Learning hand-eye coordination for robotic grasping with deep learning and large scale data collection." CoRR, abs/1603.02199, 2016. URL http://arxiv.org/abs/1603.02199.<br />
#Pulkit Agarwal, Ashwin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. "Learning to poke by poking: Experiential learning of intuitive physics." 2016. URL http://arxiv.org/ abs/1606.07419<br />
#Chelsea Finn, Ian Goodfellow, and Sergey Levine. "Unsupervised learning for physical interaction through video prediction." In Advances in neural information processing systems, 2016.<br />
#Ashvin Nair, Dian Chen, Pulkit Agrawal, Phillip Isola, Pieter Abbeel, Jitendra Malik, and Sergey Levine. "Combining self-supervised learning and imitation for vision-based rope manipulation." International Conference on Robotics and Automation, 2017.<br />
#Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. "Revisiting unreasonable effectiveness of data in deep learning era." ICCV, 2017.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Searching_For_Efficient_Multi_Scale_Architectures_For_Dense_Image_Prediction&diff=38747Searching For Efficient Multi Scale Architectures For Dense Image Prediction2018-11-13T01:53:27Z<p>Mpafla: /* What they used in this paper */</p>
<hr />
<div><br />
[Need add more pics and references]<br />
=Introduction=<br />
<br />
The design of neural network architectures is an important component for the success of machine learning and data science projects. In recent years, the field of Neural Architecture Search(NAS) has emerged, which is the study of automatically finding an optimal neural architecture in a given task in a well-defined architecture space. Often, the resulting architecture has outperformed human experts designed network in many tasks such as image classification and natural language processing.[2,3,4] <br />
<br />
Goal:<br />
This paper presents a method of let computers searching for a neural architecture that performs well in the task of Dense image segmentation.<br />
<br />
=Motivation=<br />
<br />
Deep Neural network's success is largely due to the fact that it greatly reduces the work in Feature Engineering, as DNN has the ability to automatically extract useful features given the raw input. However, it created a<br />
new type of engineering work - network engineering. In order to successfully extract features, you need to have the corresponding network architecture. So what really happened is the engineering work is shifted from feature engineering to how to design the network so that it can better abstract useful features.<br />
<br />
The motivation for NAS is that since there is no guiding theory on how to design the optimal network architecture, given that we have <br />
abundant computational resources, one intuitive solution is to define a finite search space and let the computers do the dirty work of searching for structures and hyperparameters.<br />
<br />
=Related Work =<br />
<br />
This paper focusses on two main literature research topics. One is the neural architecture search (NAS) and the other is the Multi-Scale representation for sense image prediction. Neural architecture search trains a controller network to generate neural architectures. The following are the important research directions in this area: <br />
<br />
1) One kind of research transfers architectures learned on a proxy dataset to more challenging datasets and demonstrates superior performance over many human-invented architectures. <br />
<br />
2) Reinforcement learning, evolutionary algorithms and sequential model-based optimization have been used to learn network structures. <br />
<br />
3) Some other works focus on increasing model size, sharing model weights to accelerate model search or a continous relaxation of the architecture representation. <br />
<br />
4) Some recent methods focus on proposing methods for embedding an exponentially large number of architectures in a grid arrangement for semantic segmentation tasks. <br />
<br />
In the area of multi-scale representation for dense image prediction the following are useful prior work: <br />
<br />
1) State of the art methods use Convolutional Neural Nets. There are different methods proposed for supplying global features and context information to perform pixel level classification. <br />
<br />
2) Some approaches focus on how to efficiently encode multi-scale context information in a network architecture like designing models that take an input an image pyramid so that large-scale objects are captured by the downsampled image. <br />
<br />
3) Research also tried to come up with a theme on how best to tune the architecture to extract context information. Some works focus on sampling rates in atrous convolution to encode multi-scale context. Some others build context module by gradually increasing the rate on top of belief maps. <br />
<br />
<br />
<br />
<br />
=NAS Overview=<br />
<br />
NAS essentially turns a design problem into a search problem. As a search problem in general, we need a clear definition of three things:<br />
<ol><br />
<li> Search space</li><br />
<li> Search strategy</li><br />
<li> Performance Estimation Strategy</li><br />
</ol> <br />
[5]<br />
<br />
<br />
The search space is very intuitive to understand. In what hyperparameter space we should look for our optimal solution. In the field of NAS, the search space is heavily dependent on the assumption we make on the neural architecture. The search<br />
strategy details how to look explore the search space. The evaluation strategy is when we find a set of hyperparameters, how should we evaluate our model. In the field of NAS, it is typically to find architectures that achieve high predictive performance on unseen data. [5]<br />
<br />
We will take a deep dive into the above three dimensions of NAS in the following sections<br />
<br />
=Search Space=<br />
There are typically three ways of defining the search space.<br />
==Chain-structured neural networks ==<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Screen_Shot_2018-11-10_at_6.03.00_PM.png|100px]]<br />
</div><br />
[5]<br />
The chain structed network can be viewd as sequence of n layers, where the layer <math> i</math> recives input from <math> i-1</math> layer and the output serves<br />
the input to layer <math> i+1</math>.<br />
<br />
The search space is then parametrized by:<br />
1) Number of layers n<br />
2) Type of operations can be executed on each layer<br />
3) Hyperparameters associated with each layer<br />
<br />
==Multi-branch networks ==<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Screen Shot 2018-11-10 at 6.03.08 PM.png|200px]]</div><br />
<br />
[5]<br />
This architecture allows significantly more degrees of freedom. It allows shortcuts and parallel branches. Some of the ideas are inspired by human hand-crafted networks. For example, the shortcut from shallow layers directly to the deep layers are coming from networks like ResNet [6]<br />
<br />
The search space includes the search space of chain-structured networks, with additional freedom of adding shortcut connections and allowing parallel branches to exist.<br />
<br />
==Cell/Block ==<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Screen Shot 2018-11-10 at 6.03.31 PM.png|300px]]</div><br />
<br />
[6]<br />
This architecture defines a cell which is used as the building block of the neural network. A good analogy here is to think a cell as a lego piece, and you can define different types of cells as different<br />
lego pieces. And then you can combine them together to form a new neural structure. <br />
<br />
<br />
The search space includes the internal structure of the cell and how to combine these blocks to form the resulting architecture.<br />
<br />
==What they used in this paper ==<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Screen Shot 2018-11-10 at 6.50.04 PM.png|500px]]<br />
</div><br />
[1]<br />
This paper's approach is very close to the Cell/Block approach above<br />
<br />
The paper defines two components: The "network backbone" and a cell unit called "DPC" which represented by a directed acyclic graph (DAG) with five branches (i.e. the optimal value). The network backbone's job is to take input image as a tensor and return a feature map f that is a supposedly good abstraction of the image. The DPC is what they introduced in this paper, short for Dense Prediction Cell. In theory, the search space consists of what they choose for the network backbone and the internal structure of the DPC. In practice, they just used MobileNet and Modified Xception net as the backbone. So the search space only consists of the internal structure of the DPC cell.<br />
<br />
For the network backbone, they simply choose from existing mature architecture. They used networks like Mobile-Net-v2, Inception-Net, and e.t.c. For the structure of DPC, they define a smaller unit of called branch. A branch is a triple of (Xi, OP, Yi), where Xi is an input tensor, and OP is the operation that can be done on the tensor, and Yi is the resulting after the Operation. <br />
<br />
In the paper, they set each DPC consists of 5 cells for the balance expressivity and computational tractability.<br />
<br />
The operator space, OP, is defined as the following set of functions:<br />
<ol><br />
<li>Convolution with a 1 × 1 kernel.</li><br />
<li>3×3 atrous separable convolution with rate rh×rw, where rh and rw ∈ {1, 3, 6, 9, . . . , 21}. </li><br />
<li>Average spatial pyramid pooling with grid size gh × gw, where gh and gw ∈ {1, 2, 4, 8}. </li><br />
</ol><br />
<br />
<br />
The operation spae has 1 + 8×8 + 4×4 = 81 functions in the operator space, resulting in i × 81 possible options. Therefore, for B = 5,<br />
the search space size is B! × 81^B ≈ 4.2 × 10^11 configurations.<br />
<br />
=Search Strategy=<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:search_strategy.png|500px]]<br />
</div><br />
<br />
There are some common search strategies used in the field of NAS, such as Reinforcement learning, Random search, Evolution algorithm, and Grid Search.<br />
<br />
The one they used in the paper is Random Search. It basically samples points from the search space uniformly at random as well as sampling<br />
some points that are close to the current observed best point. Intuitively it makes sense because it combines exploration and exploitation. When you sample points close to the current<br />
optimal point, you are doing exploitation. And when you sample points randomly, you are doing exploration.<br />
<br />
<br />
They quoted from another paper that claims random search performs the random search is competitive with reinforcement learning and other learning techniques. [7] <br />
In the implementation, they used Google's black box optimization tool Google vizier. It is not open source, but there is an open source implementation of it [8]<br />
<br />
=Performance Evaluation Strategy=<br />
<br />
The evaluation in this particular task is very tricky. The reason is we are evaluating neural network here. In order to evaluate it, we need to train it first. And we are doing pixel level classification on images with high resolutions, so the naive approach would require a tremendous amount of computational resources. <br />
<br />
The way they solve it in the paper is defining a proxy task. The proxy task is a task that requires sufficient less computational resources, while can still give a good estimate of the performance of the network. In most image classical tasks of NAS, the proxy<br />
task is to train the network on images of lower resolution. The assumption is, if the network performs well on images with lower density, it should reasonably perform well on images with higher resolution.<br />
<br />
However, the above approach does not work on this case. The reason is that the dense prediction tasks innately require high-resolution images as training data. The approach used in the paper is the flowing:<br />
<ol><br />
<li> Use a smaller backbone for proxy task</li><br />
<li> caching the feature maps produced by the network backbone on the training set and directly building a single DPC on top of it </li><br />
<li> Early stopping train for 30k iterations with a batch size of 8</li><br />
</ol><br />
<br />
If training on the large-scale backbone without fixing the weights of the backbone, they would need one week to train a network on a P100 GPU, but now they cut down the proxy task to be run 90 min. Then they rank the selected architectures, choosing the top 50 and do <br />
a full evaluation on it.<br />
<br />
The evaluation metric they used is called mIOU, which is pixel level intersection over union. Which just the area of the intersection<br />
of the ground truth and the prediction over the area of the union of the ground truth and the prediction.<br />
<br />
=Result=<br />
<br />
This method achieves state of art performances in many datasets. The following table quantifies the gain on performance on many datasets.<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Screen Shot 2018-11-10 at 6.51.14 PM.png| 400px]]<br />
</div><br />
The chose to train on modified Xception network as a backbone, and the following are the resulting architecture for the DPC.<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Screen Shot 2018-11-12 at 12.32.05 PM.png|400px]]<br />
</div><br />
<br />
As we can see, the searched DPC model achieves better performance (measured by mIOU) with less than half of the computational resources(parameters), and 37% less of operations (add and multiply).<br />
<br />
=Future work=<br />
The author suggests that when increasing the number of branches in the DPC, there might be a further gain on the performance on the<br />
image segmentation task. However, although the random search in an exponentially growing space may become more challenging. There may need more intelligent search strategy.<br />
<br />
=Critique=<br />
<br />
1. Rich man's game<br />
<br />
The technique described in the paper can only be applied by parties with abundant computational resources, like Google, Facebook, Microsoft, and e.t.c. For small research groups and companies, this method is not that useful due to the lack of computational power. Future improvement will be needed on the design an even more efficient proxy task that can tell whether a network will perform<br />
well that requires fewer computations. <br />
<br />
2. Benefit/Cost ratio<br />
<br />
The technique here does outperform human designed network in many cases, but the gain is not huge. In Cityscapes dataset, the performance gain is 0.7%, wherein PASCAL-Person-Part dataset, the gain is 3.7%, and the PASCAL VOC 2012 dataset, it does not outperform human experts. (All measured by mIOU) Even though the push of the state-of-the-art is always something that worth celebrating, <br />
but in practice, one would argue after spending so many resources doing the search, the computer should achieve superhuman performance. (Like Chess Engine vs Chess Grand Master). In practice, one may simply go with the current state-of-the-art model to avoid the expensive search cost.<br />
<br />
3. Still Heavily influenced by Human Bias<br />
<br />
When we define the search space, we introduced human bias. Firstly, the network backbone is chosen from previous matured architectures, which may not actually be optimal. Secondly, the internal branches in the DPC also consist with layers whose operations are defined by us humans, and we define these operations based on previous experience. That also prevents the search algorithm to find something revolutionary.<br />
<br />
4. May have the potential to take away entry-level data science jobs.<br />
<br />
If there is a significant reduction in the search cost, it will be more cost effective to apply NAS rather than hire data scientists. Once matured, this technology will have the potential to take away entry-level data science jobs and make data science jobs only possessed by high-level researchers. <br />
<br />
There are some real-world applications that already deploy NAS techniques in production. Two good examples are Google AutoML and Microsoft Custom Vision AI.<br />
[9, 10]<br />
<br />
=References=<br />
1. Searching For Efficient Multi-Scale Architectures For Dense Image Prediction, [[https://arxiv.org/abs/1809.04184]].<br />
<br />
2. E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classifier architecture search. arXiv:1802.01548, 2018.<br />
<br />
3. C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy. Progressive neural architecture search. In ECCV, 2018.<br />
<br />
4. B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. In CVPR, 2018.<br />
<br />
5. Neural Architecture Search: A Survey [[https://arxiv.org/abs/1808.05377]]<br />
<br />
6. Deep Residual Learning for Image Recognition [[https://arxiv.org/pdf/1512.03385.pdf]]<br />
<br />
7. .J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.<br />
In the implementation wise, they used a Google vizier, which is a search tool for black box optimization. [D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley. Google vizier: A service for black-box optimization. In SIGKDD, 2017.]<br />
<br />
8. Github implementation of Google Vizer, a black-box optimization tool [https://github.com/tobegit3hub/advisor.]<br />
<br />
9. AutoML: https://cloud.google.com/automl/ <br />
<br />
10. Custom-vision: https://azure.microsoft.com/en-us/services/cognitive-services/custom-vision-service/</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Annotating_Object_Instances_with_a_Polygon_RNN&diff=38298Annotating Object Instances with a Polygon RNN2018-11-08T04:34:11Z<p>Mpafla: /* Architecture */</p>
<hr />
<div>Summary of the CVPR '17 best [https://www.cs.utoronto.ca/~fidler/papers/paper_polyrnn.pdf ''paper'']<br />
<br />
The presentation video of paper is available here[https://www.youtube.com/watch?v=S1UUR4FlJ84].<br />
<br />
= Background =<br />
<br />
If a snapshot of an image is given to a human, how will he/she describe a scene? He/she might identify that there is a car parked near the curb, or that the car is parked right beside a street light. This ability to decompose objects in scenes into separate entities is key to understanding what is around us and it helps to reason about the behavior of objects in the scene.<br />
<br />
Automating this process is a classic computer vision problem and is often termed "object detection". There are four distinct levels of detection (refer to Figure 1 for a visual cue):<br />
<br />
1. Classification + Localization: This is the most basic method that detects whether '''an''' object is either present or absent in the image and then identifies the position of the object within the image in the form of a bounding box overlayed on the image.<br />
<br />
2. Object Detection: The classic definition of object detection points to the detection and localization of '''multiple''' objects of interest in the image. The output of the detection is still a bounding box overlayed on the image at the position corresponding to the location of the objects in the image.<br />
<br />
3. Semantic Segmentation: This is a pixel level approach, i.e., each pixel in the image is assigned to a category label. Here, there is no difference between instances; this is to say that there are objects present from three distinct categories in the image, without tracking or reporting the number of appearances of each instance within a category. <br />
<br />
4. Instance Segmentation (''This paper performs this''): The goal is to not only to assign pixel-level categorical labels, but to identify each entity separately as sheep 1, sheep 2, sheep 3, grass, and so on.<br />
<br />
[[File:Figure_1.jpeg | 450px|thumb|center|Figure 1: Different levels of detection in an image.]]<br />
<br />
<br />
== Motivation ==<br />
<br />
Semantic segmentation helps us achieve a deeper understanding of images than image classification or object detection. Over and above this, instance segmentation is crucial in applications where multiple objects of the same category are to be tracked, especially in autonomous driving, mobile robotics, and medical image processing. This paper deals with a novel method to tackle the instance segmentation problem pertaining specifically to the field of autonomous driving, but shown to generalize well in other fields such as medical image processing.<br />
<br />
== Goal ==<br />
<br />
Most of the recent approaches to on instance segmentation are based on deep neural networks and have demonstrated impressive performance. Given that these approaches require a lot of computational resources and that their performance depends on the amount of accessible training data, there has been an increase in the demand to label/annotate large-scale datasets. This is both expensive and time-consuming. <br />
<br />
{| class=wikitable width=700 align=center<br />
|Thus, the '''main goal''' of the paper is to enable '''semi-automatic''' annotation of object instances.<br />
|}<br />
<br />
Figure 2 demonstrates how the interface looks like for better clarity.<br />
<br />
Most of the datasets available pass through a stage where annotators manually outline the objects with a closed polygon. Polygons allow annotation of objects with a small number of clicks (30 - 40) compared to other methods. This approach works as the silhouette of an object is typically connected without holes. <br />
<br />
{| class=wikitable width=900 align=center<br />
|Thus, the authors suggest to adopt this same technique to annotate images using polygons, except they plan to automate the method and replace/reduce manual labeling. The '''intuition''' behind the success of this method is the '''sparse''' nature of these polygons that allow annotating of an object through a cluster of pixels rather than classification at the pixel-level.<br />
|}<br />
<br />
[[File:Annotating Object Instances Example.png | 450px|thumb|center|Figure 2: Given a bounding box, polygon outlining the the object instance inside the box is predicted. This approach is designed to facilitation annotation, and easily incorporates user corrections of points to improve the overall object’s polygon. ]]<br />
<br />
<br />
= Related Works =<br />
<br />
Some of the techniques used in semi-automatic annotation are as follows:<br />
<br />
1. '''GrabCut''': Some researchers use multiple scribbles from users to aid the model in defining the foreground and background. <br />
<br />
[[File:GrabCut_Example.png | 450px|thumb|center|Figure 3: Illustration of GrabCut.]]<br />
<br />
2. '''GrabCut + CNN''': Scribbles have also been used to train CNNs for semantic image segmentation. <br />
<br />
3. '''Superpixels''': Superpixels in the form of small polygons where the color intensity within each superpixel is similar, to a certain threshold, have been used to provide a sparse representation of the large number of pixels in an image. However, the performance of this technique depends on the scale of the superpixels and hence sometimes merges small objects.<br />
<br />
[[File:Superpixel_idea.jpg | 450px|thumb|center|Figure 4: Illustration of the superpixel idea.]] <br />
<br />
<br />
= Model =<br />
<br />
As an '''input''' to the model, an annotator or perhaps another neural network provides a bounding box containing an object of interest and the model auto-generates a polygon outlining the object instance using a Recurrent Neural Network which they call: Polygon-RNN.<br />
<br />
The RNN model predicts the vertices of the polygon at each time step given a CNN representation of the image, the last two time steps, and the first vertex location. The location of the first vertex is defined differently and will be defined shortly. The information regarding the previous two-time steps helps the RNN create a polygon in a specific direction and the first vertex provides a cue for loop closure of the polygon edges.<br />
<br />
The polygon is parametrized as a sequence of 2D vertices and it is assumed that the polygon is closed. In addition, the polygon generation is fixed to follow a clockwise orientation since there are multiple ways to create a polygon given that it is cyclic structure. However, the starting point of the sequence is defined so that it can be any of the vertices of the polygon.<br />
<br />
== Architecture ==<br />
<br />
There are two primary networks at play: 1. CNN with skip connections, and 2. One-to-many type RNN.<br />
<br />
[[File:Figure_2_Neel.JPG | 800px|thumb|center|Figure 5: Model architecture for Polygon-RNN depicting a CNN with skip connections feeding into a 2 layer ConvLSTM (One-to-many type) ('''Note''': A possible point of confusion - the authors have only shown the layers of VGG16 architecture here that have the skip connections introduced).]]<br />
<br />
1. '''CNN with skip connections''':<br />
<br />
The authors have adopted the VGG16 feature extractor architecture with a few modifications pertaining to the preservation of features fused together in a tensor that can feed into the RNN (refer to Figure 5). Namely, the last max-pooling layer (''pool5'') present in the VGG16 CNN has been removed. The image fed into the CNN is pre-shrunk to a 224x224x3 tensor(3 being the Red, Green, and Blue channels). The image passes through 2 pooling layers and 2 convolutional layers. Since, the features extracted after each operation are to be preserved and fused later on, at each of these four steps, the idea is to have a tensor with a common width of 512; so the output tensor at pool2 is convolved with 4 3x3x128 filters and the output tensor at pool3 is convolved with 2 3x3x256 filters. The skip connections from the four layers allow the CNN to extract low-level edge and corner features (helps to follow the object's boundaries) as well as boundary/semantic information about the instances (helps to identify the object). Finally, a 3x3 convolution applied along with a ReLU non-linearity results in a 28x28x128 tensor that contains semantic information pertinent to the image frame and is taken as an input by the RNN.<br />
<br />
2. '''RNN - 2 Layer ConvLSTM'''<br />
<br />
The RNN is employed to capture information about the previous vertices in the time-series. Specifically, a Convolutional LSTM is used as a decoder. The ConvLSTM allows preservation of the spatial information in 2D and reduces the number of parameters compared to a Fully Connected RNN. The polygon is modeled with a kernel size of 3x3 and 16 channels outputting a vertex at each time step. The ConvLSTM gets as input a tensor step t which<br />
concatenates 4 features: the CNN feature representation of the image, one-hot encoding of the previous predicted vertex and the vertex predicted<br />
from two time steps ago, as well as the one-hot encoding of the first predicted vertex. <br />
<br />
The Convolutional LSTM computes the hidden state <math display = "inline">h_t</math> given the input <math display = "inline">x_t</math> based on the following equations:<br />
<center><br />
<math display="block"><br />
\begin{pmatrix}<br />
i_t \\<br />
f_t \\<br />
o_t \\<br />
g_t \\<br />
\end{pmatrix}<br />
= W_h * h_{t-1} + W_x * x_t + b<br />
</math><br />
<br />
<math display="block"><br />
c_t = \sigma(f_t) \bigodot c_{t-1} + \sigma(i_t) \bigodot tanh(g_t)<br />
</math><br />
<br />
<math display="block"><br />
h_t = \sigma(o_t) \bigodot tanh(c_t)<br />
</math><br />
</center><br />
where <math display = "inline">i, f, o</math> denote the input, forget, and output gate, <math display = "inline">h</math> is the hidden state and <math display = "inline">c</math> is the cell state. Also, <math display = "inline">\sigma</math> denotes the signoid function, <math display = "inline">\bigodot</math> indicates an element-wise product and <math display = "inline">*</math> a convolution. <math display = "inline">W_h</math> denotes the hidden-to-state convolution kernel and <math display = "inline">W_x</math> the input-to-state convolution kernel.<br />
<br />
The authors have treated the vertex prediction task as a classification task in that the location of the vertices is through a one-hot representation of dimension DxD + 1 (D chosen to be 28 by the authors in tests). The one additional dimension is the storage cue for loop closure for the polygon. Given that, the one-hot representation of the two previously predicted vertices and the first vertex are taken in as an input, a clockwise (or for that reason any fixed direction) direction can be forced for the creation of the polygon. Coming back to the prediction of the first vertex, this is done through further modification of the CNN by adding two DxD layers with one branch predicting object instance boundaries while the other takes in this output as well as the image features to predict the first vertex. This CNN is trained separately. Here, <math display = "inline">y_t</math> denotes the one-hot encoding of the vertex and is the output at time step t.<br />
<br />
== Training ==<br />
<br />
The training of the model is done as follows:<br />
<br />
1. Cross-entropy is used for the RNN loss function.<br />
<br />
2. Instead of Stochastic Gradient Descent, Adam is used for optimization: batch size = 8, learning rate = 1e^-4 (learning rate decays after 10 epochs by a factor of 10) <br />
<br />
3. For the first vertex prediction, the modified CNN mentioned previously, is trained using a multi-task cost function.<br />
<br />
The reported time for training is one day on a Nvidia Titan-X GPU.<br />
<br />
== Importance of Human Annotator in the Loop ==<br />
<br />
The model allows for the prediction at a given time step to be corrected and this corrected vertex is then fed into the next time step of the RNN, effectively rejecting the network predicted vertex. This has the simple effect of putting the model "back on the right track". Note that this is only possible due to the adoption of the RNN architecture i.e. the inherent nature of the RNN to accept previous outputs allows incorporation of the user's judgement. The typical inference time as quoted by the paper is 250ms per object.<br />
<br />
= Results =<br />
<br />
== Evaluation Metrics ==<br />
<br />
The evaluation of the model performance was conducted based on the Cityscapes and KITTI Datasets. There are two metrics used for evaluation:<br />
<br />
1. '''IoU''': The standard Intersection over Union (IoU) measure is used for comparison. In add The calculation for IoU takes both the predicted and ground-truth object boundaries. The intersection (area contained in both boundaries at once) is divided by the union (the area contained by at least one, or both, of the boundaries). A low score of this metric would mean that there is little overlap between the boundaries, or large areas on non-overlap, and a score of 1.0 would indicate that the two boundaries contain the same area.<br />
<br />
2. '''Number of Clicks''': To evaluate the speed up factor, the checkerboard distance is used to measure the distance between the ground truth (GT) and the output of the Polygon RNN. A set of distance thresholds are set <math display = "inline">T &isin; [1,2,3,4]</math> and if the distance exceeds the particular threshold, the correction is made by an annotator to match the GT and the '''Number of Clicks''' is used to evaluate the speed up factor.<br />
<br />
== Baseline Techniques ==<br />
<br />
1. '''SharpMask''': a 50 layer ResNet considered as the state of the art annotation method.<br />
<br />
2. '''DeepMask''': a build-up on the 50 layer ResNet with an addition of another CNN.<br />
<br />
3. '''Dilation10''': another simple technique using purely convolutional operations.<br />
<br />
4. '''SquareBox''': a simple technique where an entire bounding box is labeled as an object<br />
<br />
== Quantitative Results ==<br />
<br />
The Polygon RNN method outperforms the baselines in 6 out of the 8 categories and has a mean IoU greater than all of the baselines. Particularly, in the car, person, and rider categories, a 12%, 7%, and 6% higher performance than SharpMask is achieved.<br />
<br />
[[File:Table_1_Neel.JPG | 800px|thumb|center|Table 1: IoU performance on Cityscapes data without any annotator intervention.]]<br />
<br />
In addition, with the help of the annotator, the speedup factor was 7.3 times with under 5 clicks which the authors claim is the main advantage of this method.<br />
<br />
[[File:Table_0_Neel.JPG | 800px|thumb|center|Table 2: IoU performance on Cityscapes data with annotator intervention.]]<br />
<br />
The method also works well with other datasets such as KITTI:<br />
<br />
[[File:Table_2_Neel.JPG | 800px|thumb|center|Table 3: IoU performance on KITTI data.]]<br />
<br />
== Qualitative Results ==<br />
<br />
In addition, most of the comparisons with human annotators show that the method is at par with human-level annotation.<br />
<br />
<gallery widths=500px heights=500px perrow=2 mode="packed"><br />
File:Figure_3_Neel.JPG|Figure 6: Qualitative results: comparison with human annotator.|alt=alt language<br />
File:Figure_4_Neel.JPG|Figure 7: Qualitative results: comparison with human annotator.|alt=alt language<br />
</gallery><br />
<br />
=Conclusion=<br />
<br />
The important conclusions from this paper are:<br />
<br />
1. The paper presented a powerful generic annotation tool for modelling complex annotations as a simple polygon that works on different unseen datasets. <br />
<br />
2. Significant improvement in annotation time can be achieved with the Polygon-RNN method itself (speed-up factor of 4.74).<br />
<br />
3. However, the flexibility of having inputs from a human annotator helps increase the IoU for a certain range of clicks.<br />
<br />
4. The model architecture has a down-sampling factor of 16 and the final output resolution and accuracy is sensitive to object size.<br />
<br />
5. Another downside of the model architecture is that training time is increased due to the training of the CNN for the first vertex.<br />
<br />
=Critique=<br />
<br />
1. With the human annotator in the loop, the model speeds up the process of annotation by over 7 times which is perhaps a big cost and time cutting improvement for companies.<br />
<br />
2. Given that this model uses the VGG16 architecture compared to the 50 layer ResNet in SharpMask, this method is quite efficient.<br />
<br />
3. This paper requires training of an entire CNN for the first vertex and is inefficient in that sense as it introduces additional parameters adding to the computation time and resource demand.<br />
<br />
4. The baseline methods have an upper hand compared to this model when it comes to larger objects since the nature of the down-scaled structure adopted by this model.<br />
<br />
5. In terms of future work, elimination of the additional CNN for the first vertex as well as an enhanced architecture to remain insensitive to the size of the object to be annotated should be implemented.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Wavelet_Pooling_For_Convolutional_Neural_Networks&diff=37893stat946w18/Wavelet Pooling For Convolutional Neural Networks2018-11-06T00:28:41Z<p>Mpafla: /* Proposed Method */</p>
<hr />
<div>=Wavelet Pooling For Convolutional Neural Networks=<br />
<br />
[https://goo.gl/forms/8NucSpF36K6IUZ0V2 Your feedback on presentations]<br />
<br />
<br />
== Introduction, Important Terms and Brief Summary==<br />
<br />
This paper focuses on the following important techniques: <br />
<br />
1) Convolutional Neural Nets (CNN): These are networks with layered structures that conform to the shape of inputs and consistently obtain high accuracies in the classification of images and objects. Researchers continue to focus on CNN to improve their performances. <br />
<br />
2) Pooling: Pooling subsamples the results of the convolution layers and gradually reduces spatial dimensions of the data throughout the network. It is done to reduce parameters, increase computational efficiency and regulate overfitting. <br />
<br />
Some of the pooling methods, including max pooling and average pooling, are deterministic. This means they are efficient and simple but hinder the potential for optimal network learning. In contrast, mixed pooling and stochastic pooling use a probabilistic approach, which can address some problems of deterministic methods. Neighborhood approach is used in all the mentioned pooling methods due to simplicity and efficiency. Nevertheless, it suffers from edge halos, blurring, and aliasing which need to be minimized. This paper introduces wavelet pooling, which uses a second-level wavelet decomposition to subsample features. The nearest neighbor interpolation is replaced by an organic, subband method that more accurately represents the feature contents with fewer artifacts. The method decomposes features into a second level decomposition and discards first level subbands to reduce feature dimensions. This method is compared to other state-of-the-art pooling methods to demonstrate superior results. Tests are conducted on benchmark classification tests like MNIST, CIFAR10, SHVN and KDEF.<br />
<br />
== Intuition ==<br />
<br />
Convolutional networks commonly employ convolutional layers to extract features and use pooling methods for spatial dimensionality reduction. In this study, wavelet pooling is introduced as an alternative to traditional neighborhood pooling by providing a more structural feature dimension reduction method. Max pooling is addressed to have over-fitting problems and average pooling is mentioned to smooth data.<br />
<br />
== History ==<br />
<br />
A history of different pooling methods have been introduced and referenced in this study:<br />
* manual subsampling at 1979<br />
* Max pooling at 1992<br />
* Mixed pooling at 2014<br />
* pooling methods with probabilistic approaches at 2014 and 2015<br />
<br />
== Background ==<br />
Average Pooling and Max Pooling are well-known pooling methods and are popular techniques used in the literature. While these methods are simple and effective, they still have some limitations. The authors identify the following limitations:<br />
<br />
'''Limitations of Max Pooling and Average Pooling'''<br />
<br />
'''Max pooling''': takes the maximum value of a region <math>R_{ij} </math> and selects it to obtain a condensed feature map. It can '''erase the details''' of the image (happens if the main details have less intensity than the insignificant details) and also commonly '''over-fits''' the training data. The max-pooling is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = max_{(p,q)\in R_{ij}}(a_{kpq})<br />
\end{align}<br />
<br />
'''Average pooling''': calculates the average value of a region and selects it to obtain a condensed feature map. Depending on the data, this method can '''dilute pertinent details''' from an image (happens for data with values much lower than the significant details) The avg-pooling is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = \frac{1}{|R_{ij}|}\sum_{(p,q)\in R_{ij}}{{a_{kpq}}}<br />
\end{align}<br />
<br />
Where <math>a_{kij}</math> is the output activation of the <math>k^{th}</math> feature map at <math>(i,j)</math>, <math>a_{kpq}</math> is the input activation at<br />
<math>(p,q)</math> within <math>R_{ij}</math>, and <math>|R_{ij}|</math> is the size of the pooling region. Figure 2 provides an example of the weaknesses of these two methods using toy images:<br />
<br />
[[File: fig0001.PNG| 700px|center]]<br />
<br />
<br />
'''How the researchers try to '''combat these issues'''?'''<br />
Using '''probabilistic pooling methods''' such as:<br />
<br />
1. '''Mixed pooling''': which combines max and average pooling by randomly selecting one over the other during training in three separate ways:<br />
<br />
* For all features within a layer<br />
* Mixed between features within a layer<br />
* Mixed between regions for different features within a layer<br />
<br />
Mixed Pooling is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = \lambda \cdot max_{(p,q)\in R_{ij}}(a_{kpq})+(1-\lambda) \cdot \frac{1}{|R_{ij}|}\sum_{(p,q)\in R_{ij}}{{a_{kpq}}}<br />
\end{align}<br />
<br />
Where <math>\lambda</math> is a random value 0 or 1, indicating max or average pooling.<br />
<br />
2. '''Stochastic pooling''': improves upon max pooling by randomly sampling from neighborhood regions based on the probability values of each activation. This is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = a_l ~ \text{where } ~ l\sim P(p_1,p_2,...,p_{|R_{ij}|})<br />
\end{align}<br />
<br />
with probability of activations within each region defined as follows:<br />
<br />
\begin{align}<br />
p_{pq} = \dfrac{a_{pq}}{\sum_{(p,q)} \in R_{ij} a_{pq}}<br />
\end{align}<br />
<br />
The figure below describes the process of Stochastic Pooling. The figure on the left shows the activations of a given region, and the corresponding probability is shown in the center. The activations with the highest probability is selected by the pooling method. However, any activation can be selected. In this case, the midrange activation of 13% is selected. <br />
<br />
[[File: stochastic pooling.jpeg| 700px|center]]<br />
<br />
As stochastic pooling is based on probability and is not deterministic, it avoids the shortcomings of max and average pooling and enjoys some of the advantages of max pooling.<br />
<br />
3. "Top-k activation pooling" is the method that picks the top-k activation in every pooling region. This makes sure that the maximum information can pass through subsampling gates. It is to be used with max pooling, but after max pooling, to further improve the representation capability, they pick top-k activation, sum them up, and constrain the summation by a constant. <br />
Details in this paper: https://www.hindawi.com/journals/wcmc/2018/8196906/<br />
<br />
'''Wavelets and Wavelet Transform'''<br />
A wavelet is a representation of some signal. For use in wavelet transforms, they are generally represented as combinations of basis signal functions.<br />
<br />
The wavelet transform involves taking the inner product of a signal (in this case, the image), with these basis functions. This produces a set of coefficients for the signal. These coefficients can then be quantized and coded in order to compress the image.<br />
<br />
One issue of note is that wavelets offer a tradeoff between resolution in frequency, or in time (or presumably, image location). For example, a sine wave will be useful to detect signals with its own frequency, but cannot detect where along the sine wave this alignment of signals is occuring. Thus, basis functions must be chosen with this tradeoff in mind.<br />
<br />
Source: Compressing still and moving images with wavelets<br />
<br />
== Proposed Method ==<br />
<br />
The proposed pooling method uses wavelets (i.e. small waves - generally used in signal processing) to reduce the dimensions of the feature maps. They use wavelet transform to minimize artifacts resulting from neighborhood reduction. They postulate that their approach, which discards the first-order sub-bands, more organically captures the data compression.<br />
<br />
* '''Forward Propagation'''<br />
<br />
The proposed wavelet pooling scheme pools features by performing a 2nd order decomposition in the wavelet domain according to the fast wavelet transform (FWT) which is a more efficient implementation of the two-dimensional discrete wavelet transform (DWT) as follows:<br />
<br />
\begin{align}<br />
W_{\varphi}[j+1,k] = h_{\varphi}[-n]*W_{\varphi}[j,n]|_{n=2k,k\leq0}<br />
\end{align}<br />
<br />
\begin{align}<br />
W_{\psi}[j+1,k] = h_{\psi}[-n]*W_{\psi}[j,n]|_{n=2k,k\leq0}<br />
\end{align}<br />
<br />
where <math>\varphi</math> is the approximation function, and <math>\psi</math> is the detail function, <math>W_{\varphi},W_{\psi}</math> are called approximation and detail coefficients. <math>h_{\varphi[-n]}</math> and <math>h_{\psi[-n]}</math> are the time reversed scaling and wavelet vectors, (n) represents the sample in the vector, while (j) denotes the resolution level<br />
<br />
When using the FWT on images, it is applied twice (once on the rows, then again on the columns). By doing this in combination, the detail sub-bands (LH, HL, HH) at each decomposition level, and approximation sub-band (LL) for the highest decomposition level is obtained.<br />
After performing the 2nd order decomposition, the image features are reconstructed, but only using the 2nd order wavelet sub-bands. This method pools the image features by a factor of 2 using the inverse FWT (IFWT) which is based off the inverse DWT (IDWT).<br />
<br />
\begin{align}<br />
W_{\varphi}[j,k] = h_{\varphi}[-n]*W_{\varphi}[j+1,n]+h_{\psi}[-n]*W_{\psi}[j+1,n]|_{n=\frac{k}{2},k\leq0}<br />
\end{align}<br />
<br />
[[File: wavelet pooling forward.PNG| 700px|center]]<br />
<br />
<br />
* '''Backpropagation'''<br />
<br />
The proposed wavelet pooling algorithm performs backpropagation by reversing the process of its forward propagation. First, the image feature being backpropagated undergoes 1st order wavelet decomposition. After decomposition, the detail coefficient sub-bands up-sample by a factor of 2 to create a new 1st level decomposition. The initial decomposition then becomes the 2nd level decomposition. Finally, this new 2nd order wavelet decomposition reconstructs the image feature for further backpropagation using the IDWT. Figure 5, illustrates the wavelet pooling backpropagation algorithm in details:<br />
<br />
[[File:wavelet pooling backpropagation.PNG| 700px|center]]<br />
<br />
== Results and Discussion ==<br />
<br />
All experiments have been performed using the MatConvNet(Vedaldi & Lenc, 2015) architecture. Stochastic gradient descent has been used for training. For the proposed method, the Haar wavelet has been chosen as the basis wavelet for its property of having even, square sub-bands. All CNN structures except for MNIST use a network loosely based on Zeilers network (Zeiler & Fergus, 2013). The experiments are repeated with Dropout (Srivastava, 2013) and the Local Response Normalization (Krizhevsky, 2009) is replaced with Batch Normalization (Ioffe & Szegedy, 2015) for CIFAR-10 and SHVN (Dropout only) to examine how these regularization techniques change the pooling results. The authors have tested the proposed method on four different datasets as shown in the figure:<br />
<br />
[[File: selection of image datasets.PNG| 700px|center]]<br />
<br />
Different methods based on Max, Average, Mixed, Stochastic and Wavelet have been used at the pooling section of each architecture. Accuracy and Model Energy have been used as the metrics to evaluate the performance of the proposed methods. These have been evaluated and their performances have been compared on different data-sets.<br />
<br />
* MNIST:<br />
<br />
The network architecture is based on the example MNIST structure from MatConvNet, with batch-normalization, inserted. All other parameters are the same. The figure below shows their network structure for the MNIST experiments.<br />
<br />
[[File: CNN MNIST.PNG| 700px|center]]<br />
<br />
The input training data and test data come from the MNIST database of handwritten digits. The full training set of 60,000 images is used, as well as the full testing set of 10,000 images. The table below shows their proposed method outperforms all methods. Given the small number of epochs, max pooling is the only method to start to over-fit the data during training. Mixed and stochastic pooling show a rocky trajectory but do not over-fit. Average and wavelet pooling show a smoother descent in learning and error reduction. The figure below shows the energy of each method per epoch.<br />
<br />
[[File: MNIST pooling method energy.PNG| 700px|center]]<br />
<br />
<br />
The accuracies for both paradigms are shown below:<br />
<br />
<br />
[[File: MNIST perf.PNG| 700px|center]]<br />
<br />
* CIFAR-10:<br />
<br />
The authors perform two sets of experiments with the pooling methods. The first is a regular network structure with no dropout layers. They use this network to observe each pooling method without extra regularization. The second uses dropout and batch normalization and performs over 30 more epochs to observe the effects of these changes. <br />
<br />
[[File: CNN CIFAR.PNG| 700px|center]]<br />
<br />
The input training and test data come from the CIFAR-10 dataset. <br />
The full training set of 50,000 images is used, as well as the full testing set of 10,000 images. For both cases, with no dropout, and with dropout, Tables below show that the proposed method has the second highest accuracy.<br />
<br />
[[File: fig0000.jpg| 700px|center]]<br />
<br />
Max pooling over-fits fairly quickly, while wavelet pooling resists over-fitting. The change in learning rate prevents their method from over-fitting, and it continues to show a slower propensity for learning. Mixed and stochastic pooling maintain a consistent progression of learning, and their validation sets trend at a similar, but better rate than their proposed method. Average pooling shows the smoothest descent in learning and error reduction, especially in the validation set. The energy of each method per epoch is also shown below:<br />
<br />
[[File: CIFAR_pooling_method_energy.PNG| 700px|center]]<br />
<br />
<br />
* SHVN:<br />
<br />
Two sets of experiments are performed with the pooling methods. The first is a regular network structure with no dropout layers. They use this network to observe each pooling method without extra regularization same as what happened in the previous datasets.<br />
The second network uses dropout to observe the effects of this change. The figure below shows their network structure for the SHVN experiments:<br />
<br />
[[File: CNN SHVN.PNG| 700px|center]]<br />
<br />
The input training and test data come from the SHVN dataset. For the case with no dropout, they use 55,000 images from the training set. For the case with dropout, they use the full training set of 73,257 images, a validation set of 30,000 images they extract from the extra training set of 531,131 images, as well as the full testing set of 26,032 images. For both cases, with no dropout, and with dropout, Tables below show their proposed method has the second lowest accuracy.<br />
<br />
[[File: SHVN perf.PNG| 700px|center]]<br />
<br />
Max and wavelet pooling both slightly over-fit the data. Their method follows the path of max pooling but performs slightly better in maintaining some stability. Mixed, stochastic, and average pooling maintain a slow progression of learning, and their validation sets trend at near identical rates. The figure below shows the energy of each method per epoch.<br />
<br />
[[File: SHVN pooling method energy.PNG| 700px|center]]<br />
<br />
* KDEF:<br />
<br />
They run one set of experiments with the pooling methods that includes dropout. The figure below shows their network structure for the KDEF experiments:<br />
<br />
[[File:CNN KDEF.PNG| 700px|center]]<br />
<br />
The input training and test data come from the KDEF dataset. This dataset contains 4,900 images of 35 people displaying seven basic emotions (afraid, angry, disgusted, happy, neutral, sad, and surprised) using facial expressions. They display emotions at five poses (full left and right profiles, half left and right profiles, and straight).<br />
<br />
This dataset contains a few errors that they have fixed (missing or corrupted images, uncropped images, etc.). All of the missing images are at angles of -90, -45, 45, or 90 degrees. They fix the missing and corrupt images by mirroring their counterparts in MATLAB and adding them back to the dataset. They manually crop the images that need to match the dimensions set by the creators (762 x 562).<br />
KDEF does not designate a training or test data set. They shuffle the data and separate 3,900 images as training data, and 1,000 images as test data. They resize the images to 128x128 because of memory and time constraints.<br />
<br />
The dropout layers regulate the network and maintain stability in spite of some pooling methods known to over-fit. The table below shows their proposed method has the second highest accuracy. Max pooling eventually over-fits, while wavelet pooling resists over-fitting. Average and mixed pooling resist over-fitting but are unstable for most of the learning. Stochastic pooling maintains a consistent progression of learning. Wavelet pooling also follows a smoother, consistent progression of learning.<br />
The figure below shows the energy of each method per epoch.<br />
<br />
[[File: KDEF pooling method energy.PNG| 700px|center]]<br />
<br />
The accuracies for both paradigms are shown below:<br />
<br />
[[File: KDEF perf.PNG| 700px|center]]<br />
<br />
== Conclusion ==<br />
<br />
They prove wavelet pooling has the potential to equal or eclipse some of the traditional methods currently utilized in CNNs. Their proposed method outperforms all others in the MNIST dataset, outperforms all but one in the CIFAR-10 and KDEF datasets, and performs within respectable ranges of the pooling methods that outdo it in the SHVN dataset. The addition of dropout and batch normalization show their proposed methods response to network regularization. Like the non-dropout cases, it outperforms all but one in both the CIFAR-10 & KDEF datasets and performs within respectable ranges of the pooling methods that outdo it in the SHVN dataset.<br />
<br />
== Suggested Future work ==<br />
<br />
Upsampling and downsampling factors in decomposition and reconstruction needs to be changed to achieve more feature reduction.<br />
The subbands that we previously discard should be kept for higher accuracies. To achieve higher computational efficiency, improving the FTW method is needed.<br />
<br />
== Critiques and Suggestions ==<br />
*The functionality of backpropagation process which can be a positive point of the study is not described enough comparing to the existing methods.<br />
* The main study is on wavelet decomposition while the reason of using Haar as mother wavelet and the number of decomposition levels selection has not been described and are just mentioned as a future study! <br />
* At the beginning, the study mentions that the pooling method is not under attention as it should be. In the end, results show that choosing the pooling method depends on the dataset and they mention trial and test as a reasonable approach to choose the pooling method. In my point of view, the authors have not really been focused on providing a pooling method which can help the current conditions to be improved effectively. At least, trying to extract a better pattern for relating results to the dataset structure could be so helpful.<br />
* Average pooling origins which are mentioned as the main pooling algorithm to compare with, is not even referenced in the introduction.<br />
* Combination of the wavelet, Max and Average pooling can be an interesting option to investigate more on this topic; both in a row(Max/Avg after wavelet pooling) and combined like mix pooling option.<br />
* While the current datasets express the performance of the proposed method in an appropriate way, it could be a good idea to evaluate the method using some larger datasets. Maybe it helps to understand whether the size of a dataset can affect the overfitting behavior of max pooling which is mentioned in the paper.<br />
<br />
== References ==<br />
<br />
Williams, Travis, and Robert Li. "Wavelet Pooling for Convolutional Neural Networks." (2018).<br />
<br />
Hilton, Michael L., Björn D. Jawerth, and Ayan Sengupta. "Compressing still and moving images with wavelets." Multimedia systems 2.5 (1994): 218-227.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Wavelet_Pooling_For_Convolutional_Neural_Networks&diff=37892stat946w18/Wavelet Pooling For Convolutional Neural Networks2018-11-06T00:28:23Z<p>Mpafla: /* Proposed Method */</p>
<hr />
<div>=Wavelet Pooling For Convolutional Neural Networks=<br />
<br />
[https://goo.gl/forms/8NucSpF36K6IUZ0V2 Your feedback on presentations]<br />
<br />
<br />
== Introduction, Important Terms and Brief Summary==<br />
<br />
This paper focuses on the following important techniques: <br />
<br />
1) Convolutional Neural Nets (CNN): These are networks with layered structures that conform to the shape of inputs and consistently obtain high accuracies in the classification of images and objects. Researchers continue to focus on CNN to improve their performances. <br />
<br />
2) Pooling: Pooling subsamples the results of the convolution layers and gradually reduces spatial dimensions of the data throughout the network. It is done to reduce parameters, increase computational efficiency and regulate overfitting. <br />
<br />
Some of the pooling methods, including max pooling and average pooling, are deterministic. This means they are efficient and simple but hinder the potential for optimal network learning. In contrast, mixed pooling and stochastic pooling use a probabilistic approach, which can address some problems of deterministic methods. Neighborhood approach is used in all the mentioned pooling methods due to simplicity and efficiency. Nevertheless, it suffers from edge halos, blurring, and aliasing which need to be minimized. This paper introduces wavelet pooling, which uses a second-level wavelet decomposition to subsample features. The nearest neighbor interpolation is replaced by an organic, subband method that more accurately represents the feature contents with fewer artifacts. The method decomposes features into a second level decomposition and discards first level subbands to reduce feature dimensions. This method is compared to other state-of-the-art pooling methods to demonstrate superior results. Tests are conducted on benchmark classification tests like MNIST, CIFAR10, SHVN and KDEF.<br />
<br />
== Intuition ==<br />
<br />
Convolutional networks commonly employ convolutional layers to extract features and use pooling methods for spatial dimensionality reduction. In this study, wavelet pooling is introduced as an alternative to traditional neighborhood pooling by providing a more structural feature dimension reduction method. Max pooling is addressed to have over-fitting problems and average pooling is mentioned to smooth data.<br />
<br />
== History ==<br />
<br />
A history of different pooling methods have been introduced and referenced in this study:<br />
* manual subsampling at 1979<br />
* Max pooling at 1992<br />
* Mixed pooling at 2014<br />
* pooling methods with probabilistic approaches at 2014 and 2015<br />
<br />
== Background ==<br />
Average Pooling and Max Pooling are well-known pooling methods and are popular techniques used in the literature. While these methods are simple and effective, they still have some limitations. The authors identify the following limitations:<br />
<br />
'''Limitations of Max Pooling and Average Pooling'''<br />
<br />
'''Max pooling''': takes the maximum value of a region <math>R_{ij} </math> and selects it to obtain a condensed feature map. It can '''erase the details''' of the image (happens if the main details have less intensity than the insignificant details) and also commonly '''over-fits''' the training data. The max-pooling is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = max_{(p,q)\in R_{ij}}(a_{kpq})<br />
\end{align}<br />
<br />
'''Average pooling''': calculates the average value of a region and selects it to obtain a condensed feature map. Depending on the data, this method can '''dilute pertinent details''' from an image (happens for data with values much lower than the significant details) The avg-pooling is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = \frac{1}{|R_{ij}|}\sum_{(p,q)\in R_{ij}}{{a_{kpq}}}<br />
\end{align}<br />
<br />
Where <math>a_{kij}</math> is the output activation of the <math>k^{th}</math> feature map at <math>(i,j)</math>, <math>a_{kpq}</math> is the input activation at<br />
<math>(p,q)</math> within <math>R_{ij}</math>, and <math>|R_{ij}|</math> is the size of the pooling region. Figure 2 provides an example of the weaknesses of these two methods using toy images:<br />
<br />
[[File: fig0001.PNG| 700px|center]]<br />
<br />
<br />
'''How the researchers try to '''combat these issues'''?'''<br />
Using '''probabilistic pooling methods''' such as:<br />
<br />
1. '''Mixed pooling''': which combines max and average pooling by randomly selecting one over the other during training in three separate ways:<br />
<br />
* For all features within a layer<br />
* Mixed between features within a layer<br />
* Mixed between regions for different features within a layer<br />
<br />
Mixed Pooling is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = \lambda \cdot max_{(p,q)\in R_{ij}}(a_{kpq})+(1-\lambda) \cdot \frac{1}{|R_{ij}|}\sum_{(p,q)\in R_{ij}}{{a_{kpq}}}<br />
\end{align}<br />
<br />
Where <math>\lambda</math> is a random value 0 or 1, indicating max or average pooling.<br />
<br />
2. '''Stochastic pooling''': improves upon max pooling by randomly sampling from neighborhood regions based on the probability values of each activation. This is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = a_l ~ \text{where } ~ l\sim P(p_1,p_2,...,p_{|R_{ij}|})<br />
\end{align}<br />
<br />
with probability of activations within each region defined as follows:<br />
<br />
\begin{align}<br />
p_{pq} = \dfrac{a_{pq}}{\sum_{(p,q)} \in R_{ij} a_{pq}}<br />
\end{align}<br />
<br />
The figure below describes the process of Stochastic Pooling. The figure on the left shows the activations of a given region, and the corresponding probability is shown in the center. The activations with the highest probability is selected by the pooling method. However, any activation can be selected. In this case, the midrange activation of 13% is selected. <br />
<br />
[[File: stochastic pooling.jpeg| 700px|center]]<br />
<br />
As stochastic pooling is based on probability and is not deterministic, it avoids the shortcomings of max and average pooling and enjoys some of the advantages of max pooling.<br />
<br />
3. "Top-k activation pooling" is the method that picks the top-k activation in every pooling region. This makes sure that the maximum information can pass through subsampling gates. It is to be used with max pooling, but after max pooling, to further improve the representation capability, they pick top-k activation, sum them up, and constrain the summation by a constant. <br />
Details in this paper: https://www.hindawi.com/journals/wcmc/2018/8196906/<br />
<br />
'''Wavelets and Wavelet Transform'''<br />
A wavelet is a representation of some signal. For use in wavelet transforms, they are generally represented as combinations of basis signal functions.<br />
<br />
The wavelet transform involves taking the inner product of a signal (in this case, the image), with these basis functions. This produces a set of coefficients for the signal. These coefficients can then be quantized and coded in order to compress the image.<br />
<br />
One issue of note is that wavelets offer a tradeoff between resolution in frequency, or in time (or presumably, image location). For example, a sine wave will be useful to detect signals with its own frequency, but cannot detect where along the sine wave this alignment of signals is occuring. Thus, basis functions must be chosen with this tradeoff in mind.<br />
<br />
Source: Compressing still and moving images with wavelets<br />
<br />
== Proposed Method ==<br />
<br />
The proposed pooling method uses wavelets (i.e. small wave - generally used in signal processing) to reduce the dimensions of the feature maps. They use wavelet transform to minimize artifacts resulting from neighborhood reduction. They postulate that their approach, which discards the first-order sub-bands, more organically captures the data compression.<br />
<br />
* '''Forward Propagation'''<br />
<br />
The proposed wavelet pooling scheme pools features by performing a 2nd order decomposition in the wavelet domain according to the fast wavelet transform (FWT) which is a more efficient implementation of the two-dimensional discrete wavelet transform (DWT) as follows:<br />
<br />
\begin{align}<br />
W_{\varphi}[j+1,k] = h_{\varphi}[-n]*W_{\varphi}[j,n]|_{n=2k,k\leq0}<br />
\end{align}<br />
<br />
\begin{align}<br />
W_{\psi}[j+1,k] = h_{\psi}[-n]*W_{\psi}[j,n]|_{n=2k,k\leq0}<br />
\end{align}<br />
<br />
where <math>\varphi</math> is the approximation function, and <math>\psi</math> is the detail function, <math>W_{\varphi},W_{\psi}</math> are called approximation and detail coefficients. <math>h_{\varphi[-n]}</math> and <math>h_{\psi[-n]}</math> are the time reversed scaling and wavelet vectors, (n) represents the sample in the vector, while (j) denotes the resolution level<br />
<br />
When using the FWT on images, it is applied twice (once on the rows, then again on the columns). By doing this in combination, the detail sub-bands (LH, HL, HH) at each decomposition level, and approximation sub-band (LL) for the highest decomposition level is obtained.<br />
After performing the 2nd order decomposition, the image features are reconstructed, but only using the 2nd order wavelet sub-bands. This method pools the image features by a factor of 2 using the inverse FWT (IFWT) which is based off the inverse DWT (IDWT).<br />
<br />
\begin{align}<br />
W_{\varphi}[j,k] = h_{\varphi}[-n]*W_{\varphi}[j+1,n]+h_{\psi}[-n]*W_{\psi}[j+1,n]|_{n=\frac{k}{2},k\leq0}<br />
\end{align}<br />
<br />
[[File: wavelet pooling forward.PNG| 700px|center]]<br />
<br />
<br />
* '''Backpropagation'''<br />
<br />
The proposed wavelet pooling algorithm performs backpropagation by reversing the process of its forward propagation. First, the image feature being backpropagated undergoes 1st order wavelet decomposition. After decomposition, the detail coefficient sub-bands up-sample by a factor of 2 to create a new 1st level decomposition. The initial decomposition then becomes the 2nd level decomposition. Finally, this new 2nd order wavelet decomposition reconstructs the image feature for further backpropagation using the IDWT. Figure 5, illustrates the wavelet pooling backpropagation algorithm in details:<br />
<br />
[[File:wavelet pooling backpropagation.PNG| 700px|center]]<br />
<br />
== Results and Discussion ==<br />
<br />
All experiments have been performed using the MatConvNet(Vedaldi & Lenc, 2015) architecture. Stochastic gradient descent has been used for training. For the proposed method, the Haar wavelet has been chosen as the basis wavelet for its property of having even, square sub-bands. All CNN structures except for MNIST use a network loosely based on Zeilers network (Zeiler & Fergus, 2013). The experiments are repeated with Dropout (Srivastava, 2013) and the Local Response Normalization (Krizhevsky, 2009) is replaced with Batch Normalization (Ioffe & Szegedy, 2015) for CIFAR-10 and SHVN (Dropout only) to examine how these regularization techniques change the pooling results. The authors have tested the proposed method on four different datasets as shown in the figure:<br />
<br />
[[File: selection of image datasets.PNG| 700px|center]]<br />
<br />
Different methods based on Max, Average, Mixed, Stochastic and Wavelet have been used at the pooling section of each architecture. Accuracy and Model Energy have been used as the metrics to evaluate the performance of the proposed methods. These have been evaluated and their performances have been compared on different data-sets.<br />
<br />
* MNIST:<br />
<br />
The network architecture is based on the example MNIST structure from MatConvNet, with batch-normalization, inserted. All other parameters are the same. The figure below shows their network structure for the MNIST experiments.<br />
<br />
[[File: CNN MNIST.PNG| 700px|center]]<br />
<br />
The input training data and test data come from the MNIST database of handwritten digits. The full training set of 60,000 images is used, as well as the full testing set of 10,000 images. The table below shows their proposed method outperforms all methods. Given the small number of epochs, max pooling is the only method to start to over-fit the data during training. Mixed and stochastic pooling show a rocky trajectory but do not over-fit. Average and wavelet pooling show a smoother descent in learning and error reduction. The figure below shows the energy of each method per epoch.<br />
<br />
[[File: MNIST pooling method energy.PNG| 700px|center]]<br />
<br />
<br />
The accuracies for both paradigms are shown below:<br />
<br />
<br />
[[File: MNIST perf.PNG| 700px|center]]<br />
<br />
* CIFAR-10:<br />
<br />
The authors perform two sets of experiments with the pooling methods. The first is a regular network structure with no dropout layers. They use this network to observe each pooling method without extra regularization. The second uses dropout and batch normalization and performs over 30 more epochs to observe the effects of these changes. <br />
<br />
[[File: CNN CIFAR.PNG| 700px|center]]<br />
<br />
The input training and test data come from the CIFAR-10 dataset. <br />
The full training set of 50,000 images is used, as well as the full testing set of 10,000 images. For both cases, with no dropout, and with dropout, Tables below show that the proposed method has the second highest accuracy.<br />
<br />
[[File: fig0000.jpg| 700px|center]]<br />
<br />
Max pooling over-fits fairly quickly, while wavelet pooling resists over-fitting. The change in learning rate prevents their method from over-fitting, and it continues to show a slower propensity for learning. Mixed and stochastic pooling maintain a consistent progression of learning, and their validation sets trend at a similar, but better rate than their proposed method. Average pooling shows the smoothest descent in learning and error reduction, especially in the validation set. The energy of each method per epoch is also shown below:<br />
<br />
[[File: CIFAR_pooling_method_energy.PNG| 700px|center]]<br />
<br />
<br />
* SHVN:<br />
<br />
Two sets of experiments are performed with the pooling methods. The first is a regular network structure with no dropout layers. They use this network to observe each pooling method without extra regularization same as what happened in the previous datasets.<br />
The second network uses dropout to observe the effects of this change. The figure below shows their network structure for the SHVN experiments:<br />
<br />
[[File: CNN SHVN.PNG| 700px|center]]<br />
<br />
The input training and test data come from the SHVN dataset. For the case with no dropout, they use 55,000 images from the training set. For the case with dropout, they use the full training set of 73,257 images, a validation set of 30,000 images they extract from the extra training set of 531,131 images, as well as the full testing set of 26,032 images. For both cases, with no dropout, and with dropout, Tables below show their proposed method has the second lowest accuracy.<br />
<br />
[[File: SHVN perf.PNG| 700px|center]]<br />
<br />
Max and wavelet pooling both slightly over-fit the data. Their method follows the path of max pooling but performs slightly better in maintaining some stability. Mixed, stochastic, and average pooling maintain a slow progression of learning, and their validation sets trend at near identical rates. The figure below shows the energy of each method per epoch.<br />
<br />
[[File: SHVN pooling method energy.PNG| 700px|center]]<br />
<br />
* KDEF:<br />
<br />
They run one set of experiments with the pooling methods that includes dropout. The figure below shows their network structure for the KDEF experiments:<br />
<br />
[[File:CNN KDEF.PNG| 700px|center]]<br />
<br />
The input training and test data come from the KDEF dataset. This dataset contains 4,900 images of 35 people displaying seven basic emotions (afraid, angry, disgusted, happy, neutral, sad, and surprised) using facial expressions. They display emotions at five poses (full left and right profiles, half left and right profiles, and straight).<br />
<br />
This dataset contains a few errors that they have fixed (missing or corrupted images, uncropped images, etc.). All of the missing images are at angles of -90, -45, 45, or 90 degrees. They fix the missing and corrupt images by mirroring their counterparts in MATLAB and adding them back to the dataset. They manually crop the images that need to match the dimensions set by the creators (762 x 562).<br />
KDEF does not designate a training or test data set. They shuffle the data and separate 3,900 images as training data, and 1,000 images as test data. They resize the images to 128x128 because of memory and time constraints.<br />
<br />
The dropout layers regulate the network and maintain stability in spite of some pooling methods known to over-fit. The table below shows their proposed method has the second highest accuracy. Max pooling eventually over-fits, while wavelet pooling resists over-fitting. Average and mixed pooling resist over-fitting but are unstable for most of the learning. Stochastic pooling maintains a consistent progression of learning. Wavelet pooling also follows a smoother, consistent progression of learning.<br />
The figure below shows the energy of each method per epoch.<br />
<br />
[[File: KDEF pooling method energy.PNG| 700px|center]]<br />
<br />
The accuracies for both paradigms are shown below:<br />
<br />
[[File: KDEF perf.PNG| 700px|center]]<br />
<br />
== Conclusion ==<br />
<br />
They prove wavelet pooling has the potential to equal or eclipse some of the traditional methods currently utilized in CNNs. Their proposed method outperforms all others in the MNIST dataset, outperforms all but one in the CIFAR-10 and KDEF datasets, and performs within respectable ranges of the pooling methods that outdo it in the SHVN dataset. The addition of dropout and batch normalization show their proposed methods response to network regularization. Like the non-dropout cases, it outperforms all but one in both the CIFAR-10 & KDEF datasets and performs within respectable ranges of the pooling methods that outdo it in the SHVN dataset.<br />
<br />
== Suggested Future work ==<br />
<br />
Upsampling and downsampling factors in decomposition and reconstruction needs to be changed to achieve more feature reduction.<br />
The subbands that we previously discard should be kept for higher accuracies. To achieve higher computational efficiency, improving the FTW method is needed.<br />
<br />
== Critiques and Suggestions ==<br />
*The functionality of backpropagation process which can be a positive point of the study is not described enough comparing to the existing methods.<br />
* The main study is on wavelet decomposition while the reason of using Haar as mother wavelet and the number of decomposition levels selection has not been described and are just mentioned as a future study! <br />
* At the beginning, the study mentions that the pooling method is not under attention as it should be. In the end, results show that choosing the pooling method depends on the dataset and they mention trial and test as a reasonable approach to choose the pooling method. In my point of view, the authors have not really been focused on providing a pooling method which can help the current conditions to be improved effectively. At least, trying to extract a better pattern for relating results to the dataset structure could be so helpful.<br />
* Average pooling origins which are mentioned as the main pooling algorithm to compare with, is not even referenced in the introduction.<br />
* Combination of the wavelet, Max and Average pooling can be an interesting option to investigate more on this topic; both in a row(Max/Avg after wavelet pooling) and combined like mix pooling option.<br />
* While the current datasets express the performance of the proposed method in an appropriate way, it could be a good idea to evaluate the method using some larger datasets. Maybe it helps to understand whether the size of a dataset can affect the overfitting behavior of max pooling which is mentioned in the paper.<br />
<br />
== References ==<br />
<br />
Williams, Travis, and Robert Li. "Wavelet Pooling for Convolutional Neural Networks." (2018).<br />
<br />
Hilton, Michael L., Björn D. Jawerth, and Ayan Sengupta. "Compressing still and moving images with wavelets." Multimedia systems 2.5 (1994): 218-227.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_to_Teach&diff=37862Learning to Teach2018-11-06T00:00:32Z<p>Mpafla: /* Introduction */</p>
<hr />
<div><br />
<br />
=Introduction=<br />
<br />
This paper proposed the "learning to teach" (L2T) framework with two intelligent agents: a student model/agent, corresponding to the learner in traditional machine learning algorithms, and a teacher model/agent, determining the appropriate data, loss function, and hypothesis space to facilitate the learning of the student model.<br />
<br />
In modern human society, the role of teaching is heavily implicated in our education system, the goal is to equip students with the necessary knowledge and skills in an efficient manner. This is the fundamental ''student'' and ''teacher'' framework on which education stands. However, in the field of artificial intelligence and specifically machine learning, researchers have focused most of their efforts on the ''student'' ie. designing various optimization algorithms to enhance the learning ability of intelligent agents. The paper argues that a formal study on the role of ‘teaching’ in AI is required. Analogous to teaching in human society, the teaching framework can select training data which corresponds to choosing the right teaching materials (e.g. textbooks); designing the loss functions corresponding to setting up targeted examinations; defining the hypothesis space corresponds to imparting the proper methodologies. Furthermore, an optimization framework (instead of heuristics) should be used to update the teaching skills based on the feedback from students, so as to achieve teacher-student co-evolution.<br />
<br />
Thus, the training phase of L2T would have several episodes of interactions between the teacher and the student model. Based on the state information in each step, the teacher model would update the teaching actions so that the student model could perform better on the Machine Learning problem. The student model would then provide reward signals back to the teacher model. These reward signals are used by the teacher model as part of the Reinforcement Learning process to update its parameters. This process is end-to-end trainable and the authors are convinced that once converged, the teacher model could be applied to new learning scenarios and even new students, without extra efforts on re-training.<br />
<br />
To demonstrate the practical value of the proposed approach, the '''training data scheduling''' problem is chosen as an example. The authors show that by using the proposed method to adaptively select the most<br />
suitable training data, they can significantly improve the accuracy and convergence speed of various neural networks including multi-layer perceptron (MLP), convolutional neural networks (CNNs)<br />
and recurrent neural networks (RNNs), for different applications including image classification and text understanding.<br />
<br />
=Related Work=<br />
The L2T framework connects with two emerging trends in machine learning. The first is the movement from simple to advanced learning. This includes meta-learning (Schmidhuber, 1987; Thrun & Pratt, 2012) which explores automatic learning by transferring learned knowledge from meta tasks [1]. This approach has been applied to few-shot learning scenarios and in designing general optimizers and neural network architectures. (Hochreiter et al., 2001; Andrychowicz et al., 2016; Li & Malik, 2016; Zoph & Le, 2017)<br />
<br />
The second is the teaching which can be classified into machine-teaching (Zhu, 2015) [2] and hardness based methods. The former seeks to construct a minimal training set for the student to learn a target model (ie. an oracle). The latter assumes an order of data from easy instances to hard ones, hardness being determined in different ways. In curriculum learning (CL) (Bengio et al, 2009; Spitkovsky et al. 2010; Tsvetkov et al, 2016) [3] measures hardness through heuristics of the data while self-paced learning (SPL) (Kumar et al., 2010; Lee & Grauman, 2011; Jiang et al., 2014; Supancic & Ramanan, 2013) [4] measures hardness by loss on data. <br />
<br />
The limitations of these works boil down to a lack of formally defined teaching problem as well as the reliance on heuristics and fixed rules for teaching which hinders generalization of the teaching task.<br />
<br />
=Learning to Teach=<br />
To introduce the problem and framework, without loss of generality, consider the setting of supervised learning.<br />
<br />
In supervised learning, each sample <math>x</math> is from a fixed but unknown distribution <math>P(x)</math>, and the corresponding label <math> y </math> is from a fixed but unknown distribution <math>P(y|x) </math>. The goal is to find a function <math>f_\omega(x)</math> with parameter vector <math>\omega</math> that minimizes the gap between the predicted label and the actual label.<br />
<br />
<br />
<br />
\usepackage{amssymb}<br />
<br />
==Problem Definition==<br />
The student model, denoted &mu;(), takes the set of training data <math> D </math>, the function class <math> Ω </math>, and loss function <math> L </math> as input to output a function, <math> f(ω) </math>, with parameter <math>ω^*</math> which minimizes risk <math>R(ω)</math> as in:<br />
<br />
\begin{align*}<br />
ω^* = arg min_{w \in \Omega} \sum_{x,y \in D} L(y, f_ω(x)) =: \mu (D, L, \Omega)<br />
\end{align*}<br />
<br />
The teaching model, denoted φ, tries to provide <math> D </math>, <math> L </math>, and <math> Ω </math> (or any combination, denoted <math> A </math>) to the student model such that the student model either achieves lower risk R(ω) or progresses as fast as possible.<br />
<br />
::'''Training Data''': Outputting a good training set <math> D </math>, analogous to human teachers providing students with proper learning materials such as textbooks.<br />
::'''Loss Function''': Designing a good loss function <math> L </math> , analogous to providing useful assessment criteria for students.<br />
::'''Hypothesis Space''': Defining a good function class <math> Ω </math> which the student model can select from. This is analogous to human teachers providing appropriate context, eg. middle school students taught math with basic algebra while undergraduate students are taught with calculus. Different Ω leads to different errors and optimization problem (Mohri et al., 2012).<br />
<br />
==Framework==<br />
The training phase consists of the teacher providing the student with the subset <math> A_{train} </math> of <math> A </math> and then taking feedback to improve its own parameters. The L2T process is outlined in figure below:<br />
<br />
[[File: L2T_process.png | 500px|center]]<br />
<br />
* <math> s_t &isin; S </math> represents information available to the teacher model at time <math> t </math>. <math> s_t </math> is typically constructed from the current student model <math> f_{t−1} </math> and the past teaching history of the teacher model.<br />
* <math> a_t &isin; A </math> represents action taken the teacher model at time <math> t </math>. Can be any combination of teaching tasks involving the training data, loss function, and hypothesis space.<br />
* <math> φ_θ : S → A </math> is policy used by the teacher model to generate its action <math> φ_θ(s_t) = a_t </math><br />
* Student model takes <math> a_t </math> as input and outputs function <math> f_t </math>, by using the conventional ML techniques.<br />
<br />
Once the training process converges, the teacher model may be utilized to teach a different subset of <math> A </math> or teach a different student model.<br />
<br />
=Application=<br />
<br />
There are different approaches to training the teacher model, this paper will apply reinforcement learning with <math> φ_θ </math> being the ''policy'' that interacts with <math> S </math>, the ''environment''. The paper applies data teaching to train a deep neural network student, <math> f </math>, for several classification tasks. Thus the student feedback measure will be classification accuracy. Its learning rule will be mini-batch stochastic gradient descent, where batches of data will arrive sequentially in random order. The teacher model is responsible for providing the training data, which in this case means it must determine which instances (subset) of the mini-batch of data will be fed to the student.<br />
<br />
The optimizer for training the teacher model is maximum expected reward: <br />
<br />
\begin{align} <br />
J(θ) = E_{φ_θ(a|s)}[R(s,a)]<br />
\end{align}<br />
<br />
Which is non-differentiable w.r.t. <math> θ </math>, thus a likelihood ratio policy gradient algorithm is used to optimize <math> J(θ) </math> (Williams, 1992) [4]<br />
<br />
==Experiments==<br />
<br />
The L2T framework is tested on the following student models: multi-layer perceptron (MLP), ResNet (CNN), and Long-Short-Term-Memory network (RNN). <br />
<br />
The student tasks are Image classification for MNIST, for CIFAR-10, and sentiment classification for IMDB movie review dataset. <br />
<br />
The strategy will be benchmarked against the following teaching strategies:<br />
<br />
::'''NoTeach''': Outputting a good training set D, analogous to human teachers providing students with proper learning materials such as textbooks<br />
::'''Self-Paced Learning (SPL)''': Teaching by ''hardness'' of data, defined as the loss. This strategy begins by filtering out data with larger loss value to train the student with "easy" data and gradually increases the hardness.<br />
::'''L2T''': The Learning to Teach framework.<br />
::'''RandTeach''': Randomly filter data in each epoch according to the logged ratio of filtered data instances per epoch (as opposed to deliberate and dynamic filtering by L2T).<br />
<br />
When teaching a new student with the same model architecture, we observe that L2T achieves significantly faster convergence than other strategies across all tasks:<br />
<br />
[[File: L2T_speed.png | 1100px|center]]<br />
<br />
===Filtration Number===<br />
<br />
When investigating the details of filtered data instances per epoch, for the two image classification tasks, the L2T teacher filters an increasing amount of data as training goes on. Thus for the two image classification tasks, the student model can learn from hard instances of data from the beginning for training while for the natural language task, the student model must first learn from easy data instances.<br />
<br />
[[File: L2T_fig3.png | 1100px|center]]<br />
<br />
===Teaching New Student with Different Model Architecture===<br />
<br />
In this part, first a teacher model is trained by interacting with a student model. Then using the teacher model, another student model<br />
which has a different model architecture is taught.<br />
The results of Applying the teacher trained on ResNet32 to teach other architectures is shown below.<br />
<br />
[[File: L2T_fig4.png | 1100px|center]]<br />
<br />
===Training Time Analysis===<br />
<br />
The learning curves demonstrate the efficiency in accuracy achieved by the L2T over the other strategies. This is especially evident during the earlier training stages.<br />
<br />
[[File: L2T_fig5.png | 600px|center]]<br />
<br />
===Accuracy Improvement===<br />
<br />
When comparing training accuracy on the IMDB sentiment classification task, L2T improves on teaching policy over NoTeach and SPL.<br />
<br />
[[File: L2T_t1.png | 500px|center]]<br />
<br />
=Future Work=<br />
<br />
There is some useful future work that can be extended from this work: <br />
<br />
1) Recent advances in multi-agent reinforcement learning could be tried on the Reinforcement Learning problem formulation of this paper. <br />
<br />
2) Some human in the loop architectures like CHAT and HAT (https://www.ijcai.org/proceedings/2017/0422.pdf) should give better results for the same framework. <br />
<br />
3) It would be interesting to try out the framework suggested in this paper (L2T) in Imperfect information and partially observable settings. <br />
<br />
=Critique=<br />
<br />
While the conceptual framework of L2T is sound, the paper only experimentally demonstrates efficacy for ''data teaching'' which would seem to be the simplest to implement. The feasibility and effectiveness of teaching the loss function and hypothesis space are not explored in a real-world scenario. Furthermore, the experimental results for data teaching suggest that the speed of convergence is the main improvement over other teaching strategies whereas the difference in accuracy less remarkable. The paper also assesses accuracy only by comparing L2T with NoTeach and SPL on the IMDB classification task, the improvement (or lack thereof) on the other classification tasks and teaching strategies is omitted. Again, this distinction is not possible to assess in loss function or hypothesis space teaching within the scope of this paper.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_to_Teach&diff=37861Learning to Teach2018-11-05T23:59:24Z<p>Mpafla: /* Problem Definition */</p>
<hr />
<div><br />
<br />
=Introduction=<br />
<br />
This paper proposed the "learning to teach" (L2T) framework with two intelligent agents: a student model/agent, corresponding to the learner in traditional machine learning algorithms, and a teacher model/agent, determining the appropriate data, loss function, and hypothesis space to facilitate the learning of the student model.<br />
<br />
In modern human society, the role of teaching is heavily implicated in our education system, the goal is to equip students with the necessary knowledge and skills in an efficient manner. This is the fundamental ''student'' and ''teacher'' framework on which education stands. However, in the field of artificial intelligence and specifically machine learning, researchers have focused most of their efforts on the ''student'' ie. designing various optimization algorithms to enhance the learning ability of intelligent agents. The paper argues that a formal study on the role of ‘teaching’ in AI is required. Analogous to teaching in human society, the teaching framework can select training data which corresponds to choosing the right teaching materials (e.g. textbooks); designing the loss functions corresponding to setting up targeted examinations; defining the hypothesis space corresponds to imparting the proper methodologies. Furthermore, an optimization framework (instead of heuristics) should be used to update the teaching skills based on the feedback from students, so as to achieve teacher-student co-evolution.<br />
<br />
Thus, the training phase of L2T would have several episodes of interactions between the teacher and the student model. Based on the state information in each step, the teacher model would update the teaching actions so that the student model could perform better on the Machine Learning problem. The student model would then provide reward signals back to the teacher model. These reward signals are used by the teacher model as part of the Reinforcement Learning process to update its parameters. This process is end-to-end trainable and the authors are convinced that once converged, the teacher model could be applied to new learning scenarios and even new students, without extra efforts on re-training.<br />
<br />
To demonstrate the practical value of the proposed approach, a specific problem is chosen, '''training data scheduling''', as an example. The authors show that by using the proposed method to adaptively select the most<br />
suitable training data, they can significantly improve the accuracy and convergence speed of various neural networks including multi-layer perceptron (MLP), convolutional neural networks (CNNs)<br />
and recurrent neural networks (RNNs), for different applications including image classification and text understanding.<br />
<br />
=Related Work=<br />
The L2T framework connects with two emerging trends in machine learning. The first is the movement from simple to advanced learning. This includes meta-learning (Schmidhuber, 1987; Thrun & Pratt, 2012) which explores automatic learning by transferring learned knowledge from meta tasks [1]. This approach has been applied to few-shot learning scenarios and in designing general optimizers and neural network architectures. (Hochreiter et al., 2001; Andrychowicz et al., 2016; Li & Malik, 2016; Zoph & Le, 2017)<br />
<br />
The second is the teaching which can be classified into machine-teaching (Zhu, 2015) [2] and hardness based methods. The former seeks to construct a minimal training set for the student to learn a target model (ie. an oracle). The latter assumes an order of data from easy instances to hard ones, hardness being determined in different ways. In curriculum learning (CL) (Bengio et al, 2009; Spitkovsky et al. 2010; Tsvetkov et al, 2016) [3] measures hardness through heuristics of the data while self-paced learning (SPL) (Kumar et al., 2010; Lee & Grauman, 2011; Jiang et al., 2014; Supancic & Ramanan, 2013) [4] measures hardness by loss on data. <br />
<br />
The limitations of these works boil down to a lack of formally defined teaching problem as well as the reliance on heuristics and fixed rules for teaching which hinders generalization of the teaching task.<br />
<br />
=Learning to Teach=<br />
To introduce the problem and framework, without loss of generality, consider the setting of supervised learning.<br />
<br />
In supervised learning, each sample <math>x</math> is from a fixed but unknown distribution <math>P(x)</math>, and the corresponding label <math> y </math> is from a fixed but unknown distribution <math>P(y|x) </math>. The goal is to find a function <math>f_\omega(x)</math> with parameter vector <math>\omega</math> that minimizes the gap between the predicted label and the actual label.<br />
<br />
<br />
<br />
\usepackage{amssymb}<br />
<br />
==Problem Definition==<br />
The student model, denoted &mu;(), takes the set of training data <math> D </math>, the function class <math> Ω </math>, and loss function <math> L </math> as input to output a function, <math> f(ω) </math>, with parameter <math>ω^*</math> which minimizes risk <math>R(ω)</math> as in:<br />
<br />
\begin{align*}<br />
ω^* = arg min_{w \in \Omega} \sum_{x,y \in D} L(y, f_ω(x)) =: \mu (D, L, \Omega)<br />
\end{align*}<br />
<br />
The teaching model, denoted φ, tries to provide <math> D </math>, <math> L </math>, and <math> Ω </math> (or any combination, denoted <math> A </math>) to the student model such that the student model either achieves lower risk R(ω) or progresses as fast as possible.<br />
<br />
::'''Training Data''': Outputting a good training set <math> D </math>, analogous to human teachers providing students with proper learning materials such as textbooks.<br />
::'''Loss Function''': Designing a good loss function <math> L </math> , analogous to providing useful assessment criteria for students.<br />
::'''Hypothesis Space''': Defining a good function class <math> Ω </math> which the student model can select from. This is analogous to human teachers providing appropriate context, eg. middle school students taught math with basic algebra while undergraduate students are taught with calculus. Different Ω leads to different errors and optimization problem (Mohri et al., 2012).<br />
<br />
==Framework==<br />
The training phase consists of the teacher providing the student with the subset <math> A_{train} </math> of <math> A </math> and then taking feedback to improve its own parameters. The L2T process is outlined in figure below:<br />
<br />
[[File: L2T_process.png | 500px|center]]<br />
<br />
* <math> s_t &isin; S </math> represents information available to the teacher model at time <math> t </math>. <math> s_t </math> is typically constructed from the current student model <math> f_{t−1} </math> and the past teaching history of the teacher model.<br />
* <math> a_t &isin; A </math> represents action taken the teacher model at time <math> t </math>. Can be any combination of teaching tasks involving the training data, loss function, and hypothesis space.<br />
* <math> φ_θ : S → A </math> is policy used by the teacher model to generate its action <math> φ_θ(s_t) = a_t </math><br />
* Student model takes <math> a_t </math> as input and outputs function <math> f_t </math>, by using the conventional ML techniques.<br />
<br />
Once the training process converges, the teacher model may be utilized to teach a different subset of <math> A </math> or teach a different student model.<br />
<br />
=Application=<br />
<br />
There are different approaches to training the teacher model, this paper will apply reinforcement learning with <math> φ_θ </math> being the ''policy'' that interacts with <math> S </math>, the ''environment''. The paper applies data teaching to train a deep neural network student, <math> f </math>, for several classification tasks. Thus the student feedback measure will be classification accuracy. Its learning rule will be mini-batch stochastic gradient descent, where batches of data will arrive sequentially in random order. The teacher model is responsible for providing the training data, which in this case means it must determine which instances (subset) of the mini-batch of data will be fed to the student.<br />
<br />
The optimizer for training the teacher model is maximum expected reward: <br />
<br />
\begin{align} <br />
J(θ) = E_{φ_θ(a|s)}[R(s,a)]<br />
\end{align}<br />
<br />
Which is non-differentiable w.r.t. <math> θ </math>, thus a likelihood ratio policy gradient algorithm is used to optimize <math> J(θ) </math> (Williams, 1992) [4]<br />
<br />
==Experiments==<br />
<br />
The L2T framework is tested on the following student models: multi-layer perceptron (MLP), ResNet (CNN), and Long-Short-Term-Memory network (RNN). <br />
<br />
The student tasks are Image classification for MNIST, for CIFAR-10, and sentiment classification for IMDB movie review dataset. <br />
<br />
The strategy will be benchmarked against the following teaching strategies:<br />
<br />
::'''NoTeach''': Outputting a good training set D, analogous to human teachers providing students with proper learning materials such as textbooks<br />
::'''Self-Paced Learning (SPL)''': Teaching by ''hardness'' of data, defined as the loss. This strategy begins by filtering out data with larger loss value to train the student with "easy" data and gradually increases the hardness.<br />
::'''L2T''': The Learning to Teach framework.<br />
::'''RandTeach''': Randomly filter data in each epoch according to the logged ratio of filtered data instances per epoch (as opposed to deliberate and dynamic filtering by L2T).<br />
<br />
When teaching a new student with the same model architecture, we observe that L2T achieves significantly faster convergence than other strategies across all tasks:<br />
<br />
[[File: L2T_speed.png | 1100px|center]]<br />
<br />
===Filtration Number===<br />
<br />
When investigating the details of filtered data instances per epoch, for the two image classification tasks, the L2T teacher filters an increasing amount of data as training goes on. Thus for the two image classification tasks, the student model can learn from hard instances of data from the beginning for training while for the natural language task, the student model must first learn from easy data instances.<br />
<br />
[[File: L2T_fig3.png | 1100px|center]]<br />
<br />
===Teaching New Student with Different Model Architecture===<br />
<br />
In this part, first a teacher model is trained by interacting with a student model. Then using the teacher model, another student model<br />
which has a different model architecture is taught.<br />
The results of Applying the teacher trained on ResNet32 to teach other architectures is shown below.<br />
<br />
[[File: L2T_fig4.png | 1100px|center]]<br />
<br />
===Training Time Analysis===<br />
<br />
The learning curves demonstrate the efficiency in accuracy achieved by the L2T over the other strategies. This is especially evident during the earlier training stages.<br />
<br />
[[File: L2T_fig5.png | 600px|center]]<br />
<br />
===Accuracy Improvement===<br />
<br />
When comparing training accuracy on the IMDB sentiment classification task, L2T improves on teaching policy over NoTeach and SPL.<br />
<br />
[[File: L2T_t1.png | 500px|center]]<br />
<br />
=Future Work=<br />
<br />
There is some useful future work that can be extended from this work: <br />
<br />
1) Recent advances in multi-agent reinforcement learning could be tried on the Reinforcement Learning problem formulation of this paper. <br />
<br />
2) Some human in the loop architectures like CHAT and HAT (https://www.ijcai.org/proceedings/2017/0422.pdf) should give better results for the same framework. <br />
<br />
3) It would be interesting to try out the framework suggested in this paper (L2T) in Imperfect information and partially observable settings. <br />
<br />
=Critique=<br />
<br />
While the conceptual framework of L2T is sound, the paper only experimentally demonstrates efficacy for ''data teaching'' which would seem to be the simplest to implement. The feasibility and effectiveness of teaching the loss function and hypothesis space are not explored in a real-world scenario. Furthermore, the experimental results for data teaching suggest that the speed of convergence is the main improvement over other teaching strategies whereas the difference in accuracy less remarkable. The paper also assesses accuracy only by comparing L2T with NoTeach and SPL on the IMDB classification task, the improvement (or lack thereof) on the other classification tasks and teaching strategies is omitted. Again, this distinction is not possible to assess in loss function or hypothesis space teaching within the scope of this paper.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/differentiableplasticity&diff=37474stat946F18/differentiableplasticity2018-11-01T02:21:31Z<p>Mpafla: /* Model */</p>
<hr />
<div>'''Differentiable Plasticity: ''' Summary of the ICML 2018 paper https://arxiv.org/abs/1804.02464<br />
<br />
= Presented by =<br />
<br />
1. Ganapathi Subramanian, Sriram [Quest ID: 20676799]<br />
<br />
= Motivation =<br />
1. Neural Networks naturally have a static architecture. Once a Neural Network is trained, the network architecture components (ex. network connections) cannot be changed and learning, effectively, stops with the training step. If a different task needs to be considered, then the agent must be trained again from scratch. <br />
<br />
2. Plasticity is the characteristic of biological systems like humans, which can change network connections over time. This enables lifelong learning in biological systems and thus is capable of adapting to dynamic changes in the environment with great sample efficiency in the data observed. This is called synaptic plasticity which is based on the Hebb's rule (i.e. if a neuron repeatedly takes part in making another neuron fire, the connection between them is strengthened). Neural networks are very far from achieving synaptic plasticity. <br />
<br />
3. Differential plasticity is a step in this direction. The behavior of the plastic connection is trained using gradient descent so that the previously trained networks can adapt to changing conditions. <br />
<br />
Example: Using the current state of the art supervised learning examples, we can train Neural Networks to recognize specific letters that it has seen during training. Using lifelong learning, the agent can develop a knowledge about any alphabet, including those that it has never been exposed to during training.<br />
<br />
= Objectives =<br />
The paper has the following objectives: <br />
<br />
1. To tackle to problem of meta-learning (learning to learn). <br />
<br />
2. To design neural networks with plastic connections with a special emphasis on gradient descent capability for backpropagation training. <br />
<br />
3. To use backpropagation to optimize both the base weights and the amount of plasticity in each connection. <br />
<br />
4. To demonstrate the performance of such networks on three complex and different domains namely complex pattern memorization, one shot classification and reinforcement learning.<br />
<br />
= Important Terms =<br />
<br />
Hebb’s rule: This is a famous rule in neuroscience. It defines the relationship of activities between neurons with their connection. It states that if a neuron repeatedly takes part in making another neuron fire, the connection between them is strengthened. Also summarized as "neurons that fire together, wire together".<br />
<br />
= Related Work =<br />
<br />
Previous Approaches to solving this problem are summarized below: <br />
<br />
1. Train standard recurrent neural networks to incorporate past experience in their future responses within each episode. For the learning abilities, the RNN is attached with an external content-addressable memory bank. An attention mechanism within the controller network does the read-write to the memory bank and thus enables fast memorization. <br />
<br />
2. Augment each weight with a plastic component that automatically grows and decays as a function of inputs and outputs. All connection have the same non-trainable plasticity and only the corresponding weights are trained. Recent approaches have tried fast-weights which augments recurrent networks with fast-changing Hebbian weights and computes the activation function at each step. The network has a high bias towards the recently seen patterns. <br />
<br />
3. Optimize the learning rule itself, instead of the connections. A parametrized learning rule is used where the structure of the network is fixed beforehand. <br />
<br />
4. Have all the weight updates to be computed on the fly by the network itself or by a separate network at each time step. Pros are the flexibility and the cons are the large learning burden placed on the network. <br />
<br />
5. Perform gradient descent via propagation during the episode. The meta-learning involves training the base network for it to be fine-tuned using additional gradient descent. <br />
<br />
6. For classification tasks, a separate embedding is trained to discriminate between different classes. Classification is then a comparison between the embedding of the test and example instances.<br />
<br />
The superiority of the trainable synaptic plasticity for the meta-learning approach are as follows: <br />
<br />
1. Great potential for flexibility. Example, Memory Networks enforce a specific memory storage model in which memories must be embedded in fixed-size vectors and retrieved through some attention mechanism. In contrast, trainable synaptic plasticity translates into very different forms of memory, the exact implementation of which can be determined<br />
by (trainable) network structure.<br />
<br />
2. Fixed-weight recurrent networks, meanwhile, require neurons to be used for both<br />
storage and computation which increases the computational burdens on neurons. This is avoided in the approach suggested in the paper. <br />
<br />
3. Non-trainable plasticity networks can exploit network connectivity for storage of short-term information, but their uniform, non-trainable plasticity imposes a stereotypical behavior on these memories. In the synaptic plasticity, the amount and rate of plasticity are actively molded by the mechanism itself. Also, it allows for more sustained memory. <br />
<br />
= Model =<br />
<br />
The formulation proposed in the paper is in such a way that the plastic and non-plastic components for each connection are kept separate, while multiple Hebbian rules can be easily defined. <br />
<br />
Model Components: <br />
<br />
1. A connection between any two neurons <math display = "inline">i</math> and <math display = "inline">j</math> has both a fixed component and a plastic component. <br />
<br />
2. The fixed part is just a traditional connection weight <math display = "inline">w_{i,j}</math> . The plastic part is stored in a Hebbian trace <math display = "inline">H_{i,j}</math>, which varies during a<br />
lifetime according to ongoing inputs and outputs. <br />
<br />
3. The relative importance of plastic and fixed components in the connection is structurally determined by the plasticity<br />
coefficient <math display = "inline">\alpha_{i,j}</math>, which multiplies the Hebbian trace to form<br />
the full plastic component of the connection. <br />
<br />
The network equations for the output <math display = "inline">x_j(t)</math> of the neuron <math display = "inline">j</math> are as follows: <br />
<br />
<br />
<math display="block"><br />
x_j(t) = \sigma{\displaystyle \sum_{i \in \text{inputs}}[(w_{i,j}x_i(t-1) + \alpha_{i,j} H_{i,j}(t))x_i(t-1)] }<br />
</math><br />
<br />
<br />
<br />
<math display="block"><br />
H_{i,j}(t+1) = \eta x_i(t-1) x_j(t) + (1 - \eta) H_{i,j}(t) <br />
</math><br />
<br />
Here the first equation gives the activation function, where the <math display = "inline">w_{i,j}</math> is a fixed component and the remaining term (<math display = "inline"> \alpha_{i,j} H_{i,j}(t))x_i(t-1) </math>) is a plastic component. The <math display = "inline">\sigma</math> is a nonlinear function. It is always chosen to be tanh in this paper. The <math display = "inline">H_{i,j}</math> in the second equation is updated as a function of ongoing inputs and outputs, after being initialized to zero at each episode. In contrast, <math display = "inline">w_{i,j}</math> and <math display = "inline">\alpha_{i,j}</math> are the structural parameters trained by gradient descent and conserved across episodes.<br />
<br />
From first equation above, a connection can be fully fixed if <math display = "inline">\alpha = 0 </math> or fully plastic if <math display = "inline">w = 0</math> or have both a fixed and plastic components. <br />
<br />
<br />
The <math display = "inline">\eta</math> which denotes the learning rate is also an optimized parameter of the network. After this training, the agent can learn automatically from ongoing experience. In equation 2 above, the <math display = "inline">\eta</math> could make the Hebbian traces to decay to 0 in the absence of input. So another form of the equation is as follows: <br />
<br />
<br />
<math display="block"><br />
H_{i,j}(t+1) = H_{i,j}(t) + \eta x_j(t)(x_i(t-1) - x_j(t)H_{i,j}(t))<br />
</math><br />
<br />
= Experiment 1 - Binary Pattern Memorization =<br />
<br />
<br />
<br />
This test involves quickly memorizing sets of arbitrary high-dimensional patterns and reconstructing the same while being exposed to partial, degraded versions of them. This is a very simple test as it is already known that hand designed recurrent networks with a Hebbian plastic connection can already solve it for binary patterns.<br />
<br />
<br />
<br />
[[File:binarypatternrecog.png | 650px|thumb|center|Figure 1: Pattern Memorization experiment - Input Structure and Architecture]]<br />
<br />
<br />
<br />
Steps in the experiment: <br />
<br />
<br />
1) The network is a set of five binary patterns in succession as shown in the figure 1. Each of these patterns has 1000 elements each of which has one of the binary value (1 or -1). Here 1 corresponds to dark red and -1 corresponds to dark blue. <br />
<br />
2) The few shot learning paradigm is followed, where each pattern is shown for 10-time steps, with 3-time steps of zero input between the presentations and the whole sequence of patterns is presented 3 times in random order. <br />
<br />
3) One of the presented patterns is chosen in random order and degraded by setting half of its bits to 0. <br />
<br />
4) This degraded pattern is then fed to the network. The network has to reproduce the correct full pattern in its output using its memory that it developed during training. <br />
<br />
The architecture of the network is described as follows: <br />
<br />
1) It is a fully connected RNN with one neuron per pattern element, plus one fixed-output neuron. There are a total of 1001 neurons. <br />
<br />
2) Value of each neuron is clamped to the value of the corresponding element in the pattern if the value is not 0. If the value is 0, the corresponding neurons do not receive pattern input and must use what it gets from lateral connections and reconstruct the correct, expected output values. <br />
<br />
3) Outputs are read from the activation of the neurons. <br />
<br />
4) The performance evaluation is done by computing the loss between the final network output and the correct expected pattern. <br />
<br />
5) The gradient of the error over the <math display = "inline">w_{i,j}</math> and the <math display = "inline">\alpha_{i,j}</math> coefficients is computed by backpropagation and optimized through Adam solver with learning rate 0.001. <br />
<br />
6) The simple decaying Hebbian formula in Equation 2 is used to update the Hebbian traces. Each network has 2 trainable parameters <math display = "inline">w</math> and <math display = "inline">\alpha</math> for each connection, thus there are a total 1001 <math display = "inline">\times</math> 1001 <math display = "inline">\times</math> 2 = 2004002 trainable parameters. <br />
<br />
[[File:exp1results.png | 650px|thumb|center|Figure 2:Experiment 1 - Pattern Memorization Results]]<br />
<br />
<br />
The results are shown in the figure 2 where 10 runs are considered. The error becomes quite low after about 200 episodes of training. <br />
<br />
[[File:exp1nonplasticresults.png| 650px|thumb|center|Figure 3: Pattern Memorization results with non plastic networks]]<br />
<br />
Comparison with Non-Plastic Networks: <br />
<br />
1) Non-plastic networks can solve this task but require additional neurons to solve this task in principle. In practice, the authors say that the task is not solved using Non-plastic RNN or LSTM. <br />
<br />
2) The figure 3 shows the results using non-plastic networks. The best results required the addition of 2000 extra neurons. <br />
<br />
3) For non-plastic RNN, the error flattens around 0.13 which is quite high. Using LSTMs, the task can be solved albeit imperfectly and also the error rate reduces drastically t0 around 0.001. <br />
<br />
4) The plastic network solves the task very quickly with the mean error going below 0.01 within 2000 episodes which are mentioned to be 250 times faster than the LSTM.<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
= Experiment 2 - Memorizing network images=<br />
<br />
This task is an image reconstruction task that where a network is trained on a set of natural images which it looks to memorize. The natural images with graded pixel values contain more information per element as compared to the last experiment. So this experiment is inherently more complex than the previous ones. Then one image is chosen at random and half the image is displayed to the agent. The task is to complete the image. The paper shows that this method effectively solves this task which other state-of-the-art network architectures fail to solve. <br />
<br />
The experiment is as follows: <br />
<br />
1) Images are from the CIFAR-10 database where there are a total of 60000 images each of size 32 <math display = "inline">\times</math> 32. <br />
<br />
2) The architecture has 1025 neurons in total with a total of 2 <math display = "inline">\times</math> 1025 <math display = "inline">\times</math> 1025 = 2101250 parameters. <br />
<br />
3) Each episode has 3 pictures, shown 3 times for 20-time steps each time, with 3-time steps of zero input between the presentations. <br />
<br />
4) The images are degraded by zeroing out one full contiguous half of the image to prevent a trivial solution of simply reconstructing the missing pixel as the average of its neighbors.<br />
<br />
[[File:exp2results.png| 650px|thumb|center|Figure 4: Natural Image memorization results]]<br />
<br />
<br />
<br />
The results are shown in figure 4. The final output of the network is shown in the last column which is the reconstructed image. The results show that the model has learned to perform this task. <br />
<br />
[[File:exp2weights.png| 650px|thumb|center|Figure 5: Final matrices and plasticity coefficients]]<br />
<br />
The final weight matrix and plasticity coefficients matrix are shown in the figure 5. The plasticity matrix shows a structure related to the high correlation of neighboring pixels and half-field zeroing in test images. <br />
<br />
The full plastic network is compared against a similar architecture with shared plasticity coefficients, where all connections share the same <math display = "inline">\alpha</math> value. So, the single parameter is shared across all connections is trained. <br />
<br />
[[File:independentvsshared.png| 650px|thumb|center|Figure 6: Comparing independent and shared <math display = "inline">\alpha</math> value runs]]<br />
<br />
The figure 6 shows the result of comparison where the independent plasticity coefficient for each connection has better performances. Thus the structure observed in the weight matrices of the results is actually useful.<br />
<br />
<br />
= Experiment 3 - Omniglot task =<br />
<br />
This task involves handwritten symbol recognition. It is a standard task for one-shot and few-shot learning. <br />
<br />
Experimental Setup: <br />
<br />
1) The Omniglot data set is a collection of handwritten characters from various writing systems, including 20 instances each of 1,623 different handwritten characters, written by different subjects.<br />
<br />
2) In each episode, N character classes are randomly selected and K instances from each class are sampled. <br />
<br />
3) These instances, together with the class label (from 1 to N), are shown to the model. <br />
<br />
4) Then, a new, unlabeled instance is sampled from one of the N classes and shown to the model.<br />
<br />
5) Model performance is defined as the model’s accuracy in classifying this unlabeled example.<br />
<br />
Architecture: <br />
<br />
1) Model architecture has 4 convolutional layers with 3 <math display = "inline">\times</math> 3 receptive fields and 64 channels. <br />
<br />
2) All convolutions have a stride of 2 to reduce the dimensionality between layers. <br />
<br />
3) The output is a single vector of 64 features, which feeds into an N-way softmax. <br />
<br />
4) The label of the current character is also concurrently fed as a one-hot encoding to this softmax layer, to serve as a guide for the correct output when a label is present.<br />
<br />
Plasticity in the architecture: <br />
<br />
1) Plasticity is applied to the weights from the final layer to the softmax layer, leaving the rest of the convolutional embedding non- plastic. <br />
<br />
2) The expectation is that the convolutional architecture will learn an adequate discriminant between arbitrary handwritten characters and the plastic weights learns to memorize associations between observed patterns and outputs. <br />
<br />
Data Preparation: <br />
<br />
1) The dataset is augmented with rotations by multiples of <math display = "inline">90</math> degrees. <br />
<br />
2) It is divided into 1,523 classes for training and 100 classes (together with their augmentations) for testing. <br />
<br />
3) The networks are trained with an Adam optimizer with a learning rate 3 <math display = "inline">\times 10^{-5}</math>, multiplied by 2/3 every 1M episodes over 5,000,000 episodes. <br />
<br />
4) To evaluate final model performance, 10 models are trained with different random seeds and each of those is tested on 100 episodes using previously unseen test classes.<br />
<br />
Results: <br />
<br />
1) The overall accuracy (i.e. the proportion of episodes with correct classification, aggregated over all test episodes of all runs) is 98.3%, with a 95% confidence interval of 0.80%.<br />
<br />
2) The median accuracy across the 10 runs was 98.5%, indicating consistency in learning.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Memory Networks<br />
! Matching Networks<br />
! ProtoNets<br />
! Memory Module<br />
! MAML<br />
! SNAIL<br />
! DP(This paper)<br />
|-<br />
| 82.8%<br />
| 98.1%<br />
| 97.4%<br />
| 98.4%<br />
| 98.7% <math display = "inline">\pm</math> 0.4<br />
| 99.07% <math display = "inline">\pm</math> 0.16<br />
| 98.03% <math display = "inline">\pm</math> 0.80<br />
|}<br />
<br />
<br />
<br />
3) The above table shows the comparative performance across other non-plastic approaches. The results of the plastic approach are largely similar to those reported for the computationally intensive MAML method and the classification-specialized Matching Networks method. <br />
<br />
4) The performances are slightly below those reported for the SNAIL method, which trains a whole additional temporal-convolution network on top of the convolutional architecture thus having many more parameters.<br />
<br />
5) The conclusion is that a few plastic connections to the output of the network allow for competitive one-shot learning over arbitrary man-made visual symbols.<br />
<br />
= Experiment 4 - Reinforcement learning Maze navigation task =<br />
<br />
This is a maze exploration task where the goal is to teach an agent to reach a goal. The plastic networks are shown to outperform non-plastic ones. <br />
<br />
Experimental setup: <br />
<br />
1) The maze is composed of 9 <math display = "inline">\times</math> 9 squares, surrounded by walls, in which every other square (in either direction) is occupied by a wall. <br />
<br />
[[File:exp4maze.png| 650px|thumb|center|Figure 7: Maze Environment]]<br />
<br />
<br />
2) The maze contains 16 wall square arranged in a regular grid as shown in the figure 7. <br />
<br />
3) At each episode, one non-wall square is randomly chosen as the reward location. When the agent hits this location, it receives a large reward (10.0) and is immediately transported to a random location in the maze Also a small negative reward of -0.1 is provided every time the agent tries to walk into a wall).<br />
<br />
4) Each episode lasts 250-time steps, during which the agent must accumulate as much reward as possible. The reward location is fixed within an episode and randomized across episodes. <br />
<br />
5) The reward is invisible to the agent, and thus the agent only knows it has hit the reward location by the activation of the reward input at the next step.<br />
<br />
6) Inputs to the agent consist of a binary vector describing the 3 <math display = "inline">\times</math> 3 neighborhood centered on the agent (each element being set to 1 or 0 if the corresponding square is or is not a wall), together with the reward at the previous time step. <br />
<br />
7) A2C algorithm is used to meta train the network. <br />
<br />
8) The experiments are run under three conditions: full differentiable plasticity, no plasticity at all, and homogeneous plasticity in which all connections share the same (learnable) <math display = "inline">\alpha</math> parameter. <br />
<br />
9) For each condition, 15 runs with different random seeds are performed. <br />
<br />
<br />
Architecture: <br />
<br />
1) It is a simple recurrent network with 200 neurons, with a softmax layer on top of it to select between the 4 possible actions (up, right, left or down).<br />
<br />
<br />
[[File:exp4performance.png| 650px|thumb|center|Figure 8: Performance curve for the maze navigation experiment]]<br />
<br />
<br />
Results: <br />
<br />
1) The results are shown in the figure 8. The plastic network shows considerably better performance as compared to the other networks.<br />
<br />
2) The non-plastic and homogeneous networks get stuck on a sub-optimal policy. <br />
<br />
3) Thus, the conclusion is that, in this domain, individually sculpting the plasticity of each connection is crucial in reaping the benefits of plasticity for this task.<br />
<br />
= Conclusions =<br />
<br />
<br />
The important contributions from this paper are as follows: <br />
<br />
1) The results show that simple plastic models support efficient meta-learning.<br />
<br />
2) Gradient descent itself is shown to be capable of optimizing the plasticity of a meta-learning system. <br />
<br />
3) The meta-learning is shown to vastly outperform alternative options in the experiments considered. <br />
<br />
4) The method achieved state of the art results on a hard Omniglot test set. <br />
<br />
= Open Source Code =<br />
<br />
Code for this paper can be found at: https://github.com/uber-common/differentiable-plasticity<br />
<br />
<br />
= Critiques =<br />
<br />
The paper addresses an important problem of learning to learn ("meta-learning") and provides a novel framework based on gradient descent to achieve this objective. This paper provides a large scope for future work as many widely used architectures like LSTMs could be tried along with a plastic component. It is also easy to see that the application of such approaches in deep reinforcement learning are also plentiful and there is a good possibility of beating the current baselines in many popular test beds like Atari games using plastic networks. This paper opens up possibilities for a whole class of meta-learning algorithms. <br />
<br />
With regards to the drawbacks of the paper, the paper does not mention how plastic networks will behave if the test sets are completely different from the training dataset. Will the performance be the same as non-plastic networks? It is not very clear if this method will be scalable as there are a large number of parameters to be determined even with the simplest of problems. Also, each experimental domain considered in this paper needed significantly different network architectures (for example in the Omniglot domain plasticity was applied only for the final layers). The paper does not mention any reasons for the specific decisions and if such differences will hold good for other similar problems as well. There has been work in transfer learning applied to both supervised learning and reinforcement learning problems. The authors should have ideally compared plastic networks to performances of some algorithms there as these methods transfer existing knowledge to other related problems and also prevent the need to start training from scratch much similar to the methods adopted in this paper.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/differentiableplasticity&diff=37472stat946F18/differentiableplasticity2018-11-01T02:17:12Z<p>Mpafla: /* Model */</p>
<hr />
<div>'''Differentiable Plasticity: ''' Summary of the ICML 2018 paper https://arxiv.org/abs/1804.02464<br />
<br />
= Presented by =<br />
<br />
1. Ganapathi Subramanian, Sriram [Quest ID: 20676799]<br />
<br />
= Motivation =<br />
1. Neural Networks naturally have a static architecture. Once a Neural Network is trained, the network architecture components (ex. network connections) cannot be changed and effectively learning stops with the training step. If a different task needs to be considered, then the agent must be trained again from scratch. <br />
<br />
2. Plasticity is the characteristic of biological systems like humans, which can change network connections over time. This enables lifelong learning in biological systems and thus is capable of adapting to dynamic changes in the environment with great sample efficiency in the data observed. This is called synaptic plasticity which is based on the Hebb's rule (i.e. if a neuron repeatedly takes part in making another neuron fire, the connection between them is strengthened). Neural networks are very far from achieving synaptic plasticity. <br />
<br />
3. Differential plasticity is a step in this direction. The plastic connections' behavior is trained using gradient descent so that the previously trained networks can adapt to changing conditions. <br />
<br />
Example: Using the current state of the art supervised learning examples, we can train Neural Networks to recognize specific letters that it has seen during training. Using lifelong learning the agent can know about any alphabet, including those that it has never been exposed to during training.<br />
<br />
= Objectives =<br />
The paper has the following objectives: <br />
<br />
1. To tackle to problem of meta-learning (learning to learn). <br />
<br />
2. To design neural networks with plastic connections with a special emphasis on gradient descent capability for backpropagation training. <br />
<br />
3. To use backpropagation to optimize both the base weights and the amount of plasticity in each connection. <br />
<br />
4. To demonstrate the performance of such networks on three complex and different domains namely complex pattern memorization, one shot classification and reinforcement learning.<br />
<br />
= Important Terms =<br />
<br />
Hebb’s rule: This is a famous rule in neuroscience. It defines the relationship of activities between neurons with their connection. It states that if a neuron repeatedly takes part in making another neuron fire, the connection between them is strengthened. Also summarized as "neurons that fire together, wire together".<br />
<br />
= Related Work =<br />
<br />
Previous Approaches to solving this problem are summarized as under: <br />
<br />
1. Train standard recurrent neural networks to incorporate past experience in their future responses within each episode. For the learning abilities, the RNN is attached with an external content-addressable memory bank. An attention mechanism within the controller network does the read-write to the memory bank and thus enables fast memorization. <br />
<br />
2. Augment each weight with a plastic component that automatically grows and decays as a function of inputs and outputs. All connection have the same non-trainable plasticity and only the corresponding weights are trained. Recent approaches have tried fast-weights which augments recurrent networks with fast-changing Hebbian weights and computes activations at each step. The network has a high bias towards the recently seen patterns. <br />
<br />
3. Optimize the learning rule itself instead of the connections. A parametrized learning rule is used where the structure of the network is fixed beforehand. <br />
<br />
4. Have all the weight updates to be computed on the fly by the network itself or by a separate network at each time step. Pros are the flexibility and the cons are the large learning burden placed on the network. <br />
<br />
5. Perform gradient descent via propagation during the episode. The meta-learning involves training the base network for it to be fine-tuned using additional gradient descent. <br />
<br />
6. For classification tasks, a separate embedding is trained to discriminate between different classes. Classification is then a comparison between the embedding of the test and example instances.<br />
<br />
The superiority of the trainable synaptic plasticity for the meta-learning approach are as follows: <br />
<br />
1. Great potential for flexibility. Example, Memory Networks enforce a specific memory storage model in which memories must be embedded in fixed-size vectors and retrieved through some attention mechanism. In contrast, trainable synaptic plasticity translates into very different forms of memory, the exact implementation of which can be determined<br />
by (trainable) network structure.<br />
<br />
2. Fixed-weight recurrent networks, meanwhile, require neurons to be used for both<br />
storage and computation which increases the computational burdens on neurons. This is avoided in the approach suggested in the paper. <br />
<br />
3. Non-trainable plasticity networks can exploit network connectivity for storage of short-term information, but their uniform, non-trainable plasticity imposes a stereotypical behavior on these memories. In the synaptic plasticity, the amount and rate of plasticity are actively molded by the mechanism itself. Also, it allows for more sustained memory. <br />
<br />
= Model =<br />
<br />
The formulation proposed in the paper is in such a way that the plastic and non-plastic components for each connection are kept separate, while multiple Hebbian rules can be easily defined. <br />
<br />
Model Components: <br />
<br />
1. A connection between any two neurons <math display = "inline">i</math> and <math display = "inline">j</math> has both a fixed component and a plastic component. <br />
<br />
2. The fixed part is just a traditional connection weight <math display = "inline">w_{i,j}</math> . The plastic part is stored in a Hebbian trace <math display = "inline">H_{i,j}</math>, which varies during a<br />
lifetime according to ongoing inputs and outputs. <br />
<br />
3. The relative importance of plastic and fixed components in the connection is structurally determined by the plasticity<br />
coefficient <math display = "inline">\alpha_{i,j}</math>, which multiplies the Hebbian trace to form<br />
the full plastic component of the connection. <br />
<br />
The network equations for the output <math display = "inline">x_j(t)</math> of the neuron <math display = "inline">j</math> are as follows: <br />
<br />
<br />
<math display="block"><br />
x_j(t) = \sigma{\displaystyle \sum_{i \in \text{inputs}}[(w_{i,j}x_i(t-1) + \alpha_{i,j} H_{i,j}(t))x_i(t-1)] }<br />
</math><br />
<br />
<br />
<br />
<math display="block"><br />
H_{i,j}(t+1) = \eta x_i(t-1) x_j(t) + (1 - \eta) H_{i,j}(t) <br />
</math><br />
<br />
Here the first equation gives the activation function, where the <math display = "inline">w_{i,j}</math> is a fixed component and the remaining term (<math display = "inline"> \alpha_{i,j} H_{i,j}(t))x_i(t-1) </math>) is a plastic component. The <math display = "inline">\sigma</math> is a nonlinear function. It is always chosen to be tanh in this paper. The <math display = "inline">H_{i,j}</math> in the second equation is updated as a function of ongoing inputs and outputs, after being initialized to zero at each episode. <br />
<br />
From first equation above, a connection can be fully fixed if <math display = "inline">\alpha = 0 </math> or fully plastic if <math display = "inline">w = 0</math> or have both a fixed and plastic components. <br />
<br />
<br />
<br />
The terms <math display = "inline">w_{i,j}</math> and <math display = "inline">\alpha_{i,j}</math> are the structural parameters trained by gradient descent and conserved across episodes. The <math display = "inline">\eta</math> which denotes the learning rate is also an optimized parameter of the network. After this training, the agent can learn automatically from ongoing experience. In equation 2 above, the <math display = "inline">\eta</math> could make the Hebbian traces to decay to 0 in the absence of input. So another form of the equation is as follows: <br />
<br />
<br />
<math display="block"><br />
H_{i,j}(t+1) = H_{i,j}(t) + \eta x_j(t)(x_i(t-1) - x_j(t)H_{i,j}(t))<br />
</math><br />
<br />
= Experiment1 - Binary Pattern Memorization =<br />
<br />
<br />
<br />
This test involves quickly memorizing sets of arbitrary high-dimensional patterns and reconstructing the same while being exposed to partial, degraded versions of them. This is a very simple test as it is already known that hand designed recurrent networks with a Hebbian plastic connection can already solve it for binary patterns.<br />
<br />
<br />
<br />
[[File:binarypatternrecog.png | 650px|thumb|center|Figure 1: Pattern Memorization experiment - Input Structure and Architecture]]<br />
<br />
<br />
<br />
Steps in the experiment: <br />
<br />
<br />
1) The network is a set of five binary patterns in succession as shown in the figure 1. Each of these patterns has 1000 elements each of which has one of the binary value (1 or -1). Here 1 corresponds to dark red and -1 corresponds to dark blue. <br />
<br />
2) The few shot learning paradigm is followed, where each pattern is shown for 10-time steps, with 3-time steps of zero input between the presentations and the whole sequence of patterns is presented 3 times in random order. <br />
<br />
3) One of the presented patterns is chosen in random order and degraded by setting half of its bits to 0. <br />
<br />
4) This degraded pattern is then fed to the network. The network has to reproduce the correct full pattern in its output using its memory that it developed during training. <br />
<br />
The architecture of the network is described as follows: <br />
<br />
1) It is a fully connected RNN with one neuron per pattern element, plus one fixed-output neuron. There are a total of 1001 neurons. <br />
<br />
2) Value of each neuron is clamped to the value of the corresponding element in the pattern if the value is not 0. If the value is 0, the corresponding neurons do not receive pattern input and must use what it gets from lateral connections and reconstruct the correct, expected output values. <br />
<br />
3) Outputs are read from the activation of the neurons. <br />
<br />
4) The performance evaluation is done by computing the loss between the final network output and the correct expected pattern. <br />
<br />
5) The gradient of the error over the <math display = "inline">w_{i,j}</math> and the <math display = "inline">\alpha_{i,j}</math> coefficients is computed by backpropagation and optimized through Adam solver with learning rate 0.001. <br />
<br />
6) The simple decaying Hebbian formula in Equation 2 is used to update the Hebbian traces. Each network has 2 trainable parameters <math display = "inline">w</math> and <math display = "inline">\alpha</math> for each connection, thus there are a total 1001 <math display = "inline">\times</math> 1001 <math display = "inline">\times</math> 2 = 2004002 trainable parameters. <br />
<br />
[[File:exp1results.png | 650px|thumb|center|Figure 2:Experiment1 - Pattern Memorization Results]]<br />
<br />
<br />
The results are shown in the figure 2 where 10 runs are considered. The error becomes quite low after about 200 episodes of training. <br />
<br />
[[File:exp1nonplasticresults.png| 650px|thumb|center|Figure 3: Pattern Memorization results with non plastic networks]]<br />
<br />
Comparison with Non-Plastic Networks: <br />
<br />
1) Nonplastic networks can solve this task but require additional neurons to solve this task in principle. In practice, the authors say that the task is not solved using Non-plastic RNN or LSTM. <br />
<br />
2) The figure 3 shows the results using non-plastic networks. The best results required the addition of 2000 extra neurons. <br />
<br />
3) For non-plastic RNN, the error flattens around 0.13 which is quite high. Using LSTMs, the task can be solved albeit imperfectly and also the error rate reduces drastically t0 around 0.001. <br />
<br />
4) The plastic network solves the task very quickly with the mean error going below 0.01 within 2000 episodes which are mentioned to be 250 times faster than the LSTM.<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
= Experiment2 - Memorizing network images=<br />
<br />
This task is an image reconstruction task that where a network is trained on a set of natural images which it looks to memorize. The natural images with graded pixel values contain more information per element as compared to the last experiment. So this experiment is inherently more complex than the previous ones. Then one image is chosen at random and half the image is displayed to the agent. The task is to complete the image. The paper shows that this method effectively solves this task which other state-of-the-art network architectures fail to solve. <br />
<br />
The experiment is as follows: <br />
<br />
1) Images are from the CIFAR-10 database where there are a total of 60000 images each of size 32 <math display = "inline">\times</math> 32. <br />
<br />
2) The architecture has 1025 neurons in total with a total of 2 <math display = "inline">\times</math> 1025 <math display = "inline">\times</math> 1025 = 2101250 parameters. <br />
<br />
3) Each episode has 3 pictures, shown 3 times for 20-time steps each time, with 3-time steps of zero input between the presentations. <br />
<br />
4) The images are degraded by zeroing out one full contiguous half of the image to prevent a trivial solution of simply reconstructing the missing pixel as the average of its neighbors.<br />
<br />
[[File:exp2results.png| 650px|thumb|center|Figure 4: Natural Image memorization results]]<br />
<br />
<br />
<br />
The results are shown in figure 4. The final output of the network is shown in the last column which is the reconstructed image. The results show that the model has learned to perform this task. <br />
<br />
[[File:exp2weights.png| 650px|thumb|center|Figure 5: Final matrices and plasticity coefficients]]<br />
<br />
The final weight matrix and plasticity coefficients matrix are shown in the figure 5. The plasticity matrix shows a structure related to the high correlation of neighboring pixels and half-field zeroing in test images. <br />
<br />
The full plastic network is compared against a similar architecture with shared plasticity coefficients, where all connections share the same <math display = "inline">\alpha</math> value. So, the single parameter is shared across all connections is trained. <br />
<br />
[[File:independentvsshared.png| 650px|thumb|center|Figure 6: Comparing independent and shared <math display = "inline">\alpha</math> value runs]]<br />
<br />
The figure 6 shows the result of comparison where the independent plasticity coefficient for each connection has better performances. Thus the structure observed in the weight matrices of the results is actually useful.<br />
<br />
<br />
= Experiment 3 - Omniglot task =<br />
<br />
This task involves handwritten symbol recognition. It is a standard task for one-shot and few-shot learning. <br />
<br />
Experimental Setup: <br />
<br />
1) The Omniglot data set is a collection of handwritten characters from various writing systems, including 20 instances each of 1,623 different handwritten characters, written by different subjects.<br />
<br />
2) In each episode, N character classes are randomly selected and K instances from each class are sampled. <br />
<br />
3) These instances, together with the class label (from 1 to N), are shown to the model. <br />
<br />
4) Then, a new, unlabelled instance is sampled from one of the N classes and shown to the model.<br />
<br />
5) Model performance is defined as the model’s accuracy in classifying this unlabelled example.<br />
<br />
Architecture: <br />
<br />
1) Model architecture has 4 convolutional layers with 3 <math display = "inline">\times</math> 3 receptive fields and 64 channels. <br />
<br />
2) All convolutions have a stride of 2 to reduce the dimensionality between layers. <br />
<br />
3) The output is a single vector of 64 features, which feeds into an N-way softmax. <br />
<br />
4) The label of the current character is also concurrently fed as a one-hot encoding to this softmax layer, to serve as a guide for the correct output when a label is present.<br />
<br />
Plasticity in the architecture: <br />
<br />
1) Plasticity is applied to the weights from the final layer to the softmax layer, leaving the rest of the convolutional embedding non- plastic. <br />
<br />
2) The expectation is that the convolutional architecture will learn an adequate discriminant between arbitrary handwritten characters and the plastic weights learns to memorize associations between observed patterns and outputs. <br />
<br />
Data Preparation: <br />
<br />
1) The dataset is augmented with rotations by multiples of <math display = "inline">90</math> degrees. <br />
<br />
2) It is divided into 1,523 classes for training and 100 classes (together with their augmentations) for testing. <br />
<br />
3) The networks are trained with an Adam optimizer with a learning rate 3 <math display = "inline">\times 10^{-5}</math>, multiplied by 2/3 every 1M episodes over 5,000,000 episodes. <br />
<br />
4) To evaluate final model performance, 10 models are trained with different random seeds and each of those is tested on 100 episodes using previously unseen test classes.<br />
<br />
Results: <br />
<br />
1) The overall accuracy (i.e. the proportion of episodes with correct classification, aggregated over all test episodes of all runs) is 98.3%, with a 95% confidence interval of 0.80%.<br />
<br />
2) The median accuracy across the 10 runs was 98.5%, indicating consistency in learning.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Memory Networks<br />
! Matching Networks<br />
! ProtoNets<br />
! Memory Module<br />
! MAML<br />
! SNAIL<br />
! DP(This paper)<br />
|-<br />
| 82.8%<br />
| 98.1%<br />
| 97.4%<br />
| 98.4%<br />
| 98.7% <math display = "inline">\pm</math> 0.4<br />
| 99.07% <math display = "inline">\pm</math> 0.16<br />
| 98.03% <math display = "inline">\pm</math> 0.80<br />
|}<br />
<br />
<br />
<br />
3) The above table shows the comparative performance across other non-plastic approaches. The results of the plastic approach are largely similar to those reported for the computationally intensive MAML method and the classification-specialized Matching Networks method. <br />
<br />
4) The performances are slightly below those reported for the SNAIL method, which trains a whole additional temporal-convolution network on top of the convolutional architecture thus having many more parameters.<br />
<br />
5) The conclusion is that a few plastic connections to the output of the network allow for competitive one-shot learning over arbitrary man-made visual symbols.<br />
<br />
= Experiment4 - Reinforcement learning Maze navigation task =<br />
<br />
This is a maze exploration task where the goal is to teach an agent to reach a goal. The plastic networks are shown to outperform non-plastic ones. <br />
<br />
Experimental setup: <br />
<br />
1) The maze is composed of 9 <math display = "inline">\times</math> 9 squares, surrounded by walls, in which every other square (in either direction) is occupied by a wall. <br />
<br />
[[File:exp4maze.png| 650px|thumb|center|Figure 7: Maze Environment]]<br />
<br />
<br />
2) The maze contains 16 wall square arranged in a regular grid as shown in the figure 7. <br />
<br />
3) At each episode, one non-wall square is randomly chosen as the reward location. When the agent hits this location, it receives a large reward (10.0) and is immediately transported to a random location in the maze Also a small negative reward of -0.1 is provided every time the agent tries to walk into a wall).<br />
<br />
4) Each episode lasts 250-time steps, during which the agent must accumulate as much reward as possible. The reward location is fixed within an episode and randomized across episodes. <br />
<br />
5) The reward is invisible to the agent, and thus the agent only knows it has hit the reward location by the activation of the reward input at the next step.<br />
<br />
6) Inputs to the agent consist of a binary vector describing the 3 <math display = "inline">\times</math> 3 neighborhood centered on the agent (each element being set to 1 or 0 if the corresponding square is or is not a wall), together with the reward at the previous time step. <br />
<br />
7) A2C algorithm is used to meta train the network. <br />
<br />
8) The experiments are run under three conditions: full differentiable plasticity, no plasticity at all, and homogeneous plasticity in which all connections share the same (learnable) <math display = "inline">\alpha</math> parameter. <br />
<br />
9) For each condition, 15 runs with different random seeds are performed. <br />
<br />
<br />
Architecture: <br />
<br />
1) It is a simple recurrent network with 200 neurons, with a softmax layer on top of it to select between the 4 possible actions (up, right, left or down).<br />
<br />
<br />
[[File:exp4performance.png| 650px|thumb|center|Figure 8: Performance curve for the maze navigation experiment]]<br />
<br />
<br />
Results: <br />
<br />
1) The results are shown in the figure 8. The plastic network shows considerably better performance as compared to the other networks.<br />
<br />
2) The non-plastic and homogeneous networks get stuck on a suboptimal policy. <br />
<br />
3) Thus, the conclusion is that, in this domain, individually sculpting the plasticity of each connection is crucial in reaping the benefits of plasticity for this task.<br />
<br />
= Conclusions =<br />
<br />
<br />
The important contributions from this paper are as follows: <br />
<br />
1) The results show that simple plastic models support efficient meta-learning.<br />
<br />
2) Gradient descent itself is shown to be capable of optimizing the plasticity of a meta-learning system. <br />
<br />
3) The meta-learning is shown to vastly outperform alternative options in the experiments considered. <br />
<br />
4) The method achieved state of the art results on a hard Omniglot test set. <br />
<br />
= Open Source Code =<br />
<br />
Code for this paper can be found at: https://github.com/uber-common/differentiable-plasticity<br />
<br />
<br />
= Critiques =<br />
<br />
The paper addresses an important problem of learning to learn ("meta-learning") and provides a novel framework based on gradient descent to achieve this objective. This paper provides a large scope for future work as many widely used architectures like LSTMs could be tried along with a plastic component. It is also easy to see that the application of such approaches in deep reinforcement learning are also plentiful and there is a good possibility of beating the current baselines in many popular test beds like Atari games using plastic networks. This paper opens up possibilities for a whole class of meta-learning algorithms. <br />
<br />
With regards to the drawbacks of the paper, the paper does not mention how plastic networks will behave if the test sets are completely different from the training dataset. Will the performance be the same as non-plastic networks? It is not very clear if this method will be scalable as there are a large number of parameters to be determined even with the simplest of problems. Also, each experimental domain considered in this paper needed significantly different network architectures (for example in the Omniglot domain plasticity was applied only for the final layers). The paper does not mention any reasons for the specific decisions and if such differences will hold good for other similar problems as well. There has been work in transfer learning applied to both supervised learning and reinforcement learning problems. The authors should have ideally compared plastic networks to performances of some algorithms there as these methods transfer existing knowledge to other related problems and also prevent the need to start training from scratch much similar to the methods adopted in this paper.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/differentiableplasticity&diff=37471stat946F18/differentiableplasticity2018-11-01T02:10:30Z<p>Mpafla: /* Model */</p>
<hr />
<div>'''Differentiable Plasticity: ''' Summary of the ICML 2018 paper https://arxiv.org/abs/1804.02464<br />
<br />
= Presented by =<br />
<br />
1. Ganapathi Subramanian, Sriram [Quest ID: 20676799]<br />
<br />
= Motivation =<br />
1. Neural Networks naturally have a static architecture. Once a Neural Network is trained, the network architecture components (ex. network connections) cannot be changed and effectively learning stops with the training step. If a different task needs to be considered, then the agent must be trained again from scratch. <br />
<br />
2. Plasticity is the characteristic of biological systems like humans, which can change network connections over time. This enables lifelong learning in biological systems and thus is capable of adapting to dynamic changes in the environment with great sample efficiency in the data observed. This is called synaptic plasticity which is based on the Hebb's rule (i.e. if a neuron repeatedly takes part in making another neuron fire, the connection between them is strengthened). Neural networks are very far from achieving synaptic plasticity. <br />
<br />
3. Differential plasticity is a step in this direction. The plastic connections' behavior is trained using gradient descent so that the previously trained networks can adapt to changing conditions. <br />
<br />
Example: Using the current state of the art supervised learning examples, we can train Neural Networks to recognize specific letters that it has seen during training. Using lifelong learning the agent can know about any alphabet, including those that it has never been exposed to during training.<br />
<br />
= Objectives =<br />
The paper has the following objectives: <br />
<br />
1. To tackle to problem of meta-learning (learning to learn). <br />
<br />
2. To design neural networks with plastic connections with a special emphasis on gradient descent capability for backpropagation training. <br />
<br />
3. To use backpropagation to optimize both the base weights and the amount of plasticity in each connection. <br />
<br />
4. To demonstrate the performance of such networks on three complex and different domains namely complex pattern memorization, one shot classification and reinforcement learning.<br />
<br />
= Important Terms =<br />
<br />
Hebb’s rule: This is a famous rule in neuroscience. It defines the relationship of activities between neurons with their connection. It states that if a neuron repeatedly takes part in making another neuron fire, the connection between them is strengthened. Also summarized as "neurons that fire together, wire together".<br />
<br />
= Related Work =<br />
<br />
Previous Approaches to solving this problem are summarized as under: <br />
<br />
1. Train standard recurrent neural networks to incorporate past experience in their future responses within each episode. For the learning abilities, the RNN is attached with an external content-addressable memory bank. An attention mechanism within the controller network does the read-write to the memory bank and thus enables fast memorization. <br />
<br />
2. Augment each weight with a plastic component that automatically grows and decays as a function of inputs and outputs. All connection have the same non-trainable plasticity and only the corresponding weights are trained. Recent approaches have tried fast-weights which augments recurrent networks with fast-changing Hebbian weights and computes activations at each step. The network has a high bias towards the recently seen patterns. <br />
<br />
3. Optimize the learning rule itself instead of the connections. A parametrized learning rule is used where the structure of the network is fixed beforehand. <br />
<br />
4. Have all the weight updates to be computed on the fly by the network itself or by a separate network at each time step. Pros are the flexibility and the cons are the large learning burden placed on the network. <br />
<br />
5. Perform gradient descent via propagation during the episode. The meta-learning involves training the base network for it to be fine-tuned using additional gradient descent. <br />
<br />
6. For classification tasks, a separate embedding is trained to discriminate between different classes. Classification is then a comparison between the embedding of the test and example instances.<br />
<br />
The superiority of the trainable synaptic plasticity for the meta-learning approach are as follows: <br />
<br />
1. Great potential for flexibility. Example, Memory Networks enforce a specific memory storage model in which memories must be embedded in fixed-size vectors and retrieved through some attention mechanism. In contrast, trainable synaptic plasticity translates into very different forms of memory, the exact implementation of which can be determined<br />
by (trainable) network structure.<br />
<br />
2. Fixed-weight recurrent networks, meanwhile, require neurons to be used for both<br />
storage and computation which increases the computational burdens on neurons. This is avoided in the approach suggested in the paper. <br />
<br />
3. Non-trainable plasticity networks can exploit network connectivity for storage of short-term information, but their uniform, non-trainable plasticity imposes a stereotypical behavior on these memories. In the synaptic plasticity, the amount and rate of plasticity are actively molded by the mechanism itself. Also, it allows for more sustained memory. <br />
<br />
= Model =<br />
<br />
The formulation proposed in the paper is in such a way that the plastic and non-plastic components for each connection are kept separate, while multiple Hebbian rules can be easily defined. <br />
<br />
Model Components: <br />
<br />
1. A connection between any two neurons <math display = "inline">i</math> and <math display = "inline">j</math> has both a fixed component and a plastic component. <br />
<br />
2. The fixed part is just a traditional connection weight <math display = "inline">w_{i,j}</math> . The plastic part is stored in a Hebbian trace <math display = "inline">H_{i,j}</math>, which varies during a<br />
lifetime according to ongoing inputs and outputs. <br />
<br />
3. The relative importance of plastic and fixed components in the connection is structurally determined by the plasticity<br />
coefficient <math display = "inline">\alpha_{i,j}</math>, which multiplies the Hebbian trace to form<br />
the full plastic component of the connection. <br />
<br />
The network equations for the output <math display = "inline">x_j(t)</math> of the neuron <math display = "inline">j</math> are as follows: <br />
<br />
<br />
<math display="block"><br />
x_j(t) = \sigma{\displaystyle \sum_{i \in \text{inputs}}[(w_{i,j}x_i(t-1) + \alpha_{i,j} H_{i,j}(t))x_i(t-1)] }<br />
</math><br />
<br />
<br />
<br />
<math display="block"><br />
H_{i,j}(t+1) = \eta x_i(t-1) x_j(t) + (1 - \eta) H_{i,j}(t) <br />
</math><br />
<br />
Here the first equation gives the activation function, where the <math display = "inline">w_{i,j}</math> is a fixed component and the remaining term (<math display = "inline"> \alpha_{i,j} H_{i,j}(t))x_i(t-1) </math>) is a plastic component. The <math display = "inline">\sigma</math> is a nonlinear function. It is always chosen to be tanh in this paper. The <math display = "inline">H_{i,j}</math> in the second equation is updated as a function of ongoing inputs and outputs. <br />
<br />
From first equation above, a connection can be fully fixed if <math display = "inline">\alpha = 0 </math> or fully plastic if <math display = "inline">w = 0</math> or have both a fixed and plastic components. <br />
<br />
<br />
<br />
The terms <math display = "inline">w_{i,j}</math> and <math display = "inline">\alpha_{i,j}</math> are the structural parameters trained by gradient descent. The <math display = "inline">\eta</math> which denotes the learning rate is also an optimized parameter of the network. After this training, the agent can learn automatically from ongoing experience. In equation 2 above, the <math display = "inline">\eta</math> could make the Hebbian traces to decay to 0 in the absence of input. So another form of the equation is as follows: <br />
<br />
<br />
<math display="block"><br />
H_{i,j}(t+1) = H_{i,j}(t) + \eta x_j(t)(x_i(t-1) - x_j(t)H_{i,j}(t))<br />
</math><br />
<br />
= Experiment1 - Binary Pattern Memorization =<br />
<br />
<br />
<br />
This test involves quickly memorizing sets of arbitrary high-dimensional patterns and reconstructing the same while being exposed to partial, degraded versions of them. This is a very simple test as it is already known that hand designed recurrent networks with a Hebbian plastic connection can already solve it for binary patterns.<br />
<br />
<br />
<br />
[[File:binarypatternrecog.png | 650px|thumb|center|Figure 1: Pattern Memorization experiment - Input Structure and Architecture]]<br />
<br />
<br />
<br />
Steps in the experiment: <br />
<br />
<br />
1) The network is a set of five binary patterns in succession as shown in the figure 1. Each of these patterns has 1000 elements each of which has one of the binary value (1 or -1). Here 1 corresponds to dark red and -1 corresponds to dark blue. <br />
<br />
2) The few shot learning paradigm is followed, where each pattern is shown for 10-time steps, with 3-time steps of zero input between the presentations and the whole sequence of patterns is presented 3 times in random order. <br />
<br />
3) One of the presented patterns is chosen in random order and degraded by setting half of its bits to 0. <br />
<br />
4) This degraded pattern is then fed to the network. The network has to reproduce the correct full pattern in its output using its memory that it developed during training. <br />
<br />
The architecture of the network is described as follows: <br />
<br />
1) It is a fully connected RNN with one neuron per pattern element, plus one fixed-output neuron. There are a total of 1001 neurons. <br />
<br />
2) Value of each neuron is clamped to the value of the corresponding element in the pattern if the value is not 0. If the value is 0, the corresponding neurons do not receive pattern input and must use what it gets from lateral connections and reconstruct the correct, expected output values. <br />
<br />
3) Outputs are read from the activation of the neurons. <br />
<br />
4) The performance evaluation is done by computing the loss between the final network output and the correct expected pattern. <br />
<br />
5) The gradient of the error over the <math display = "inline">w_{i,j}</math> and the <math display = "inline">\alpha_{i,j}</math> coefficients is computed by backpropagation and optimized through Adam solver with learning rate 0.001. <br />
<br />
6) The simple decaying Hebbian formula in Equation 2 is used to update the Hebbian traces. Each network has 2 trainable parameters <math display = "inline">w</math> and <math display = "inline">\alpha</math> for each connection, thus there are a total 1001 <math display = "inline">\times</math> 1001 <math display = "inline">\times</math> 2 = 2004002 trainable parameters. <br />
<br />
[[File:exp1results.png | 650px|thumb|center|Figure 2:Experiment1 - Pattern Memorization Results]]<br />
<br />
<br />
The results are shown in the figure 2 where 10 runs are considered. The error becomes quite low after about 200 episodes of training. <br />
<br />
[[File:exp1nonplasticresults.png| 650px|thumb|center|Figure 3: Pattern Memorization results with non plastic networks]]<br />
<br />
Comparison with Non-Plastic Networks: <br />
<br />
1) Nonplastic networks can solve this task but require additional neurons to solve this task in principle. In practice, the authors say that the task is not solved using Non-plastic RNN or LSTM. <br />
<br />
2) The figure 3 shows the results using non-plastic networks. The best results required the addition of 2000 extra neurons. <br />
<br />
3) For non-plastic RNN, the error flattens around 0.13 which is quite high. Using LSTMs, the task can be solved albeit imperfectly and also the error rate reduces drastically t0 around 0.001. <br />
<br />
4) The plastic network solves the task very quickly with the mean error going below 0.01 within 2000 episodes which are mentioned to be 250 times faster than the LSTM.<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
= Experiment2 - Memorizing network images=<br />
<br />
This task is an image reconstruction task that where a network is trained on a set of natural images which it looks to memorize. The natural images with graded pixel values contain more information per element as compared to the last experiment. So this experiment is inherently more complex than the previous ones. Then one image is chosen at random and half the image is displayed to the agent. The task is to complete the image. The paper shows that this method effectively solves this task which other state-of-the-art network architectures fail to solve. <br />
<br />
The experiment is as follows: <br />
<br />
1) Images are from the CIFAR-10 database where there are a total of 60000 images each of size 32 <math display = "inline">\times</math> 32. <br />
<br />
2) The architecture has 1025 neurons in total with a total of 2 <math display = "inline">\times</math> 1025 <math display = "inline">\times</math> 1025 = 2101250 parameters. <br />
<br />
3) Each episode has 3 pictures, shown 3 times for 20-time steps each time, with 3-time steps of zero input between the presentations. <br />
<br />
4) The images are degraded by zeroing out one full contiguous half of the image to prevent a trivial solution of simply reconstructing the missing pixel as the average of its neighbors.<br />
<br />
[[File:exp2results.png| 650px|thumb|center|Figure 4: Natural Image memorization results]]<br />
<br />
<br />
<br />
The results are shown in figure 4. The final output of the network is shown in the last column which is the reconstructed image. The results show that the model has learned to perform this task. <br />
<br />
[[File:exp2weights.png| 650px|thumb|center|Figure 5: Final matrices and plasticity coefficients]]<br />
<br />
The final weight matrix and plasticity coefficients matrix are shown in the figure 5. The plasticity matrix shows a structure related to the high correlation of neighboring pixels and half-field zeroing in test images. <br />
<br />
The full plastic network is compared against a similar architecture with shared plasticity coefficients, where all connections share the same <math display = "inline">\alpha</math> value. So, the single parameter is shared across all connections is trained. <br />
<br />
[[File:independentvsshared.png| 650px|thumb|center|Figure 6: Comparing independent and shared <math display = "inline">\alpha</math> value runs]]<br />
<br />
The figure 6 shows the result of comparison where the independent plasticity coefficient for each connection has better performances. Thus the structure observed in the weight matrices of the results is actually useful.<br />
<br />
<br />
= Experiment 3 - Omniglot task =<br />
<br />
This task involves handwritten symbol recognition. It is a standard task for one-shot and few-shot learning. <br />
<br />
Experimental Setup: <br />
<br />
1) The Omniglot data set is a collection of handwritten characters from various writing systems, including 20 instances each of 1,623 different handwritten characters, written by different subjects.<br />
<br />
2) In each episode, N character classes are randomly selected and K instances from each class are sampled. <br />
<br />
3) These instances, together with the class label (from 1 to N), are shown to the model. <br />
<br />
4) Then, a new, unlabelled instance is sampled from one of the N classes and shown to the model.<br />
<br />
5) Model performance is defined as the model’s accuracy in classifying this unlabelled example.<br />
<br />
Architecture: <br />
<br />
1) Model architecture has 4 convolutional layers with 3 <math display = "inline">\times</math> 3 receptive fields and 64 channels. <br />
<br />
2) All convolutions have a stride of 2 to reduce the dimensionality between layers. <br />
<br />
3) The output is a single vector of 64 features, which feeds into an N-way softmax. <br />
<br />
4) The label of the current character is also concurrently fed as a one-hot encoding to this softmax layer, to serve as a guide for the correct output when a label is present.<br />
<br />
Plasticity in the architecture: <br />
<br />
1) Plasticity is applied to the weights from the final layer to the softmax layer, leaving the rest of the convolutional embedding non- plastic. <br />
<br />
2) The expectation is that the convolutional architecture will learn an adequate discriminant between arbitrary handwritten characters and the plastic weights learns to memorize associations between observed patterns and outputs. <br />
<br />
Data Preparation: <br />
<br />
1) The dataset is augmented with rotations by multiples of <math display = "inline">90</math> degrees. <br />
<br />
2) It is divided into 1,523 classes for training and 100 classes (together with their augmentations) for testing. <br />
<br />
3) The networks are trained with an Adam optimizer with a learning rate 3 <math display = "inline">\times 10^{-5}</math>, multiplied by 2/3 every 1M episodes over 5,000,000 episodes. <br />
<br />
4) To evaluate final model performance, 10 models are trained with different random seeds and each of those is tested on 100 episodes using previously unseen test classes.<br />
<br />
Results: <br />
<br />
1) The overall accuracy (i.e. the proportion of episodes with correct classification, aggregated over all test episodes of all runs) is 98.3%, with a 95% confidence interval of 0.80%.<br />
<br />
2) The median accuracy across the 10 runs was 98.5%, indicating consistency in learning.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Memory Networks<br />
! Matching Networks<br />
! ProtoNets<br />
! Memory Module<br />
! MAML<br />
! SNAIL<br />
! DP(This paper)<br />
|-<br />
| 82.8%<br />
| 98.1%<br />
| 97.4%<br />
| 98.4%<br />
| 98.7% <math display = "inline">\pm</math> 0.4<br />
| 99.07% <math display = "inline">\pm</math> 0.16<br />
| 98.03% <math display = "inline">\pm</math> 0.80<br />
|}<br />
<br />
<br />
<br />
3) The above table shows the comparative performance across other non-plastic approaches. The results of the plastic approach are largely similar to those reported for the computationally intensive MAML method and the classification-specialized Matching Networks method. <br />
<br />
4) The performances are slightly below those reported for the SNAIL method, which trains a whole additional temporal-convolution network on top of the convolutional architecture thus having many more parameters.<br />
<br />
5) The conclusion is that a few plastic connections to the output of the network allow for competitive one-shot learning over arbitrary man-made visual symbols.<br />
<br />
= Experiment4 - Reinforcement learning Maze navigation task =<br />
<br />
This is a maze exploration task where the goal is to teach an agent to reach a goal. The plastic networks are shown to outperform non-plastic ones. <br />
<br />
Experimental setup: <br />
<br />
1) The maze is composed of 9 <math display = "inline">\times</math> 9 squares, surrounded by walls, in which every other square (in either direction) is occupied by a wall. <br />
<br />
[[File:exp4maze.png| 650px|thumb|center|Figure 7: Maze Environment]]<br />
<br />
<br />
2) The maze contains 16 wall square arranged in a regular grid as shown in the figure 7. <br />
<br />
3) At each episode, one non-wall square is randomly chosen as the reward location. When the agent hits this location, it receives a large reward (10.0) and is immediately transported to a random location in the maze Also a small negative reward of -0.1 is provided every time the agent tries to walk into a wall).<br />
<br />
4) Each episode lasts 250-time steps, during which the agent must accumulate as much reward as possible. The reward location is fixed within an episode and randomized across episodes. <br />
<br />
5) The reward is invisible to the agent, and thus the agent only knows it has hit the reward location by the activation of the reward input at the next step.<br />
<br />
6) Inputs to the agent consist of a binary vector describing the 3 <math display = "inline">\times</math> 3 neighborhood centered on the agent (each element being set to 1 or 0 if the corresponding square is or is not a wall), together with the reward at the previous time step. <br />
<br />
7) A2C algorithm is used to meta train the network. <br />
<br />
8) The experiments are run under three conditions: full differentiable plasticity, no plasticity at all, and homogeneous plasticity in which all connections share the same (learnable) <math display = "inline">\alpha</math> parameter. <br />
<br />
9) For each condition, 15 runs with different random seeds are performed. <br />
<br />
<br />
Architecture: <br />
<br />
1) It is a simple recurrent network with 200 neurons, with a softmax layer on top of it to select between the 4 possible actions (up, right, left or down).<br />
<br />
<br />
[[File:exp4performance.png| 650px|thumb|center|Figure 8: Performance curve for the maze navigation experiment]]<br />
<br />
<br />
Results: <br />
<br />
1) The results are shown in the figure 8. The plastic network shows considerably better performance as compared to the other networks.<br />
<br />
2) The non-plastic and homogeneous networks get stuck on a suboptimal policy. <br />
<br />
3) Thus, the conclusion is that, in this domain, individually sculpting the plasticity of each connection is crucial in reaping the benefits of plasticity for this task.<br />
<br />
= Conclusions =<br />
<br />
<br />
The important contributions from this paper are as follows: <br />
<br />
1) The results show that simple plastic models support efficient meta-learning.<br />
<br />
2) Gradient descent itself is shown to be capable of optimizing the plasticity of a meta-learning system. <br />
<br />
3) The meta-learning is shown to vastly outperform alternative options in the experiments considered. <br />
<br />
4) The method achieved state of the art results on a hard Omniglot test set. <br />
<br />
= Open Source Code =<br />
<br />
Code for this paper can be found at: https://github.com/uber-common/differentiable-plasticity<br />
<br />
<br />
= Critiques =<br />
<br />
The paper addresses an important problem of learning to learn ("meta-learning") and provides a novel framework based on gradient descent to achieve this objective. This paper provides a large scope for future work as many widely used architectures like LSTMs could be tried along with a plastic component. It is also easy to see that the application of such approaches in deep reinforcement learning are also plentiful and there is a good possibility of beating the current baselines in many popular test beds like Atari games using plastic networks. This paper opens up possibilities for a whole class of meta-learning algorithms. <br />
<br />
With regards to the drawbacks of the paper, the paper does not mention how plastic networks will behave if the test sets are completely different from the training dataset. Will the performance be the same as non-plastic networks? It is not very clear if this method will be scalable as there are a large number of parameters to be determined even with the simplest of problems. Also, each experimental domain considered in this paper needed significantly different network architectures (for example in the Omniglot domain plasticity was applied only for the final layers). The paper does not mention any reasons for the specific decisions and if such differences will hold good for other similar problems as well. There has been work in transfer learning applied to both supervised learning and reinforcement learning problems. The authors should have ideally compared plastic networks to performances of some algorithms there as these methods transfer existing knowledge to other related problems and also prevent the need to start training from scratch much similar to the methods adopted in this paper.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/differentiableplasticity&diff=37468stat946F18/differentiableplasticity2018-11-01T01:53:26Z<p>Mpafla: /* Motivation */</p>
<hr />
<div>'''Differentiable Plasticity: ''' Summary of the ICML 2018 paper https://arxiv.org/abs/1804.02464<br />
<br />
= Presented by =<br />
<br />
1. Ganapathi Subramanian, Sriram [Quest ID: 20676799]<br />
<br />
= Motivation =<br />
1. Neural Networks naturally have a static architecture. Once a Neural Network is trained, the network architecture components (ex. network connections) cannot be changed and effectively learning stops with the training step. If a different task needs to be considered, then the agent must be trained again from scratch. <br />
<br />
2. Plasticity is the characteristic of biological systems like humans, which can change network connections over time. This enables lifelong learning in biological systems and thus is capable of adapting to dynamic changes in the environment with great sample efficiency in the data observed. This is called synaptic plasticity which is based on the Hebb's rule (i.e. if a neuron repeatedly takes part in making another neuron fire, the connection between them is strengthened). Neural networks are very far from achieving synaptic plasticity. <br />
<br />
3. Differential plasticity is a step in this direction. The plastic connections' behavior is trained using gradient descent so that the previously trained networks can adapt to changing conditions. <br />
<br />
Example: Using the current state of the art supervised learning examples, we can train Neural Networks to recognize specific letters that it has seen during training. Using lifelong learning the agent can know about any alphabet, including those that it has never been exposed to during training.<br />
<br />
= Objectives =<br />
The paper has the following objectives: <br />
<br />
1. To tackle to problem of meta-learning (learning to learn). <br />
<br />
2. To design neural networks with plastic connections with a special emphasis on gradient descent capability for backpropagation training. <br />
<br />
3. To use backpropagation to optimize both the base weights and the amount of plasticity in each connection. <br />
<br />
4. To demonstrate the performance of such networks on three complex and different domains namely complex pattern memorization, one shot classification and reinforcement learning.<br />
<br />
= Important Terms =<br />
<br />
Hebb’s rule: This is a famous rule in neuroscience. It defines the relationship of activities between neurons with their connection. It states that if a neuron repeatedly takes part in making another neuron fire, the connection between them is strengthened. Also summarized as "neurons that fire together, wire together".<br />
<br />
= Related Work =<br />
<br />
Previous Approaches to solving this problem are summarized as under: <br />
<br />
1. Train standard recurrent neural networks to incorporate past experience in their future responses within each episode. For the learning abilities, the RNN is attached with an external content-addressable memory bank. An attention mechanism within the controller network does the read-write to the memory bank and thus enables fast memorization. <br />
<br />
2. Augment each weight with a plastic component that automatically grows and decays as a function of inputs and outputs. All connection have the same non-trainable plasticity and only the corresponding weights are trained. Recent approaches have tried fast-weights which augments recurrent networks with fast-changing Hebbian weights and computes activations at each step. The network has a high bias towards the recently seen patterns. <br />
<br />
3. Optimize the learning rule itself instead of the connections. A parametrized learning rule is used where the structure of the network is fixed beforehand. <br />
<br />
4. Have all the weight updates to be computed on the fly by the network itself or by a separate network at each time step. Pros are the flexibility and the cons are the large learning burden placed on the network. <br />
<br />
5. Perform gradient descent via propagation during the episode. The meta-learning involves training the base network for it to be fine-tuned using additional gradient descent. <br />
<br />
6. For classification tasks, a separate embedding is trained to discriminate between different classes. Classification is then a comparison between the embedding of the test and example instances.<br />
<br />
The superiority of the trainable synaptic plasticity for the meta-learning approach are as follows: <br />
<br />
1. Great potential for flexibility. Example, Memory Networks enforce a specific memory storage model in which memories must be embedded in fixed-size vectors and retrieved through some attention mechanism. In contrast, trainable synaptic plasticity translates into very different forms of memory, the exact implementation of which can be determined<br />
by (trainable) network structure.<br />
<br />
2. Fixed-weight recurrent networks, meanwhile, require neurons to be used for both<br />
storage and computation which increases the computational burdens on neurons. This is avoided in the approach suggested in the paper. <br />
<br />
3. Non-trainable plasticity networks can exploit network connectivity for storage of short-term information, but their uniform, non-trainable plasticity imposes a stereotypical behavior on these memories. In the synaptic plasticity, the amount and rate of plasticity are actively molded by the mechanism itself. Also, it allows for more sustained memory. <br />
<br />
= Model =<br />
<br />
The formulation proposed in the paper is in such a way that the plastic and non-plastic components for each connection are kept separate, while multiple Hebbian rules can be easily defined. <br />
<br />
Model Components: <br />
<br />
1. A connection between any two neurons <math display = "inline">i</math> and <math display = "inline">j</math> has both a fixed component and a plastic component. <br />
<br />
2. The fixed part is just a traditional connection weight <math display = "inline">w_{i,j}</math> . The plastic part is stored in a Hebbian trace <math display = "inline">H_{i,j}</math>, which varies during a<br />
lifetime according to ongoing inputs and outputs. <br />
<br />
3. The relative importance of plastic and fixed components in the connection is structurally determined by the plasticity<br />
coefficient <math display = "inline">\alpha_{i,j}</math>, which multiplies the Hebbian trace to form<br />
the full plastic component of the connection. <br />
<br />
The network equations are as follows: <br />
<br />
<br />
<math display="block"><br />
x_j(t) = \sigma{\displaystyle \sum_{i \in \text{inputs}}[(w_{i,j}x_i(t-1) + \alpha_{i,j} H_{i,j}(t))x_i(t-1)] }<br />
</math><br />
<br />
<br />
<br />
<math display="block"><br />
H_{i,j}(t+1) = \eta x_i(t-1) x_j(t) + (1 - \eta) H_{i,j}(t) <br />
</math><br />
<br />
The <math display = "inline">x_j(t)</math> is the output of neuron <math display = "inline">j</math>. Here the first equation gives the activation function, where the <math display = "inline">w_{i,j}</math> is a fixed component and the remaining term (<math display = "inline"> \alpha_{i,j} H_{i,j}(t))x_i(t-1) </math>) is a plastic component. The <math display = "inline">\sigma</math> is a nonlinear function. It is always chosen to be tanh in this paper. The <math display = "inline">H_{i,j}</math> in the second equation is updated as a function of ongoing inputs and outputs. <br />
<br />
From first equation above, a connection can be fully fixed if <math display = "inline">\alpha = 0 </math> or fully plastic if <math display = "inline">w = 0</math> or have both a fixed and plastic components. <br />
<br />
<br />
<br />
The terms <math display = "inline">w_{i,j}</math> and <math display = "inline">\alpha_{i,j}</math> are the structural parameters trained by gradient descent. The <math display = "inline">\eta</math> which denotes the learning rate is also an optimized parameter of the network. After this training, the agent can learn automatically from ongoing experience. In equation 2 above, the <math display = "inline">\eta</math> could make the Hebbian traces to decay to 0 in the absence of input. So another form of the equation is as follows: <br />
<br />
<br />
<math display="block"><br />
H_{i,j}(t+1) = H_{i,j}(t) + \eta x_j(t)(x_i(t-1) - x_j(t)H_{i,j}(t))<br />
</math><br />
<br />
<br />
= Experiment1 - Binary Pattern Memorization =<br />
<br />
<br />
<br />
This test involves quickly memorizing sets of arbitrary high-dimensional patterns and reconstructing the same while being exposed to partial, degraded versions of them. This is a very simple test as it is already known that hand designed recurrent networks with a Hebbian plastic connection can already solve it for binary patterns.<br />
<br />
<br />
<br />
[[File:binarypatternrecog.png | 650px|thumb|center|Figure 1: Pattern Memorization experiment - Input Structure and Architecture]]<br />
<br />
<br />
<br />
Steps in the experiment: <br />
<br />
<br />
1) The network is a set of five binary patterns in succession as shown in the figure 1. Each of these patterns has 1000 elements each of which has one of the binary value (1 or -1). Here 1 corresponds to dark red and -1 corresponds to dark blue. <br />
<br />
2) The few shot learning paradigm is followed, where each pattern is shown for 10-time steps, with 3-time steps of zero input between the presentations and the whole sequence of patterns is presented 3 times in random order. <br />
<br />
3) One of the presented patterns is chosen in random order and degraded by setting half of its bits to 0. <br />
<br />
4) This degraded pattern is then fed to the network. The network has to reproduce the correct full pattern in its output using its memory that it developed during training. <br />
<br />
The architecture of the network is described as follows: <br />
<br />
1) It is a fully connected RNN with one neuron per pattern element, plus one fixed-output neuron. There are a total of 1001 neurons. <br />
<br />
2) Value of each neuron is clamped to the value of the corresponding element in the pattern if the value is not 0. If the value is 0, the corresponding neurons do not receive pattern input and must use what it gets from lateral connections and reconstruct the correct, expected output values. <br />
<br />
3) Outputs are read from the activation of the neurons. <br />
<br />
4) The performance evaluation is done by computing the loss between the final network output and the correct expected pattern. <br />
<br />
5) The gradient of the error over the <math display = "inline">w_{i,j}</math> and the <math display = "inline">\alpha_{i,j}</math> coefficients is computed by backpropagation and optimized through Adam solver with learning rate 0.001. <br />
<br />
6) The simple decaying Hebbian formula in Equation 2 is used to update the Hebbian traces. Each network has 2 trainable parameters <math display = "inline">w</math> and <math display = "inline">\alpha</math> for each connection, thus there are a total 1001 <math display = "inline">\times</math> 1001 <math display = "inline">\times</math> 2 = 2004002 trainable parameters. <br />
<br />
[[File:exp1results.png | 650px|thumb|center|Figure 2:Experiment1 - Pattern Memorization Results]]<br />
<br />
<br />
The results are shown in the figure 2 where 10 runs are considered. The error becomes quite low after about 200 episodes of training. <br />
<br />
[[File:exp1nonplasticresults.png| 650px|thumb|center|Figure 3: Pattern Memorization results with non plastic networks]]<br />
<br />
Comparison with Non-Plastic Networks: <br />
<br />
1) Nonplastic networks can solve this task but require additional neurons to solve this task in principle. In practice, the authors say that the task is not solved using Non-plastic RNN or LSTM. <br />
<br />
2) The figure 3 shows the results using non-plastic networks. The best results required the addition of 2000 extra neurons. <br />
<br />
3) For non-plastic RNN, the error flattens around 0.13 which is quite high. Using LSTMs, the task can be solved albeit imperfectly and also the error rate reduces drastically t0 around 0.001. <br />
<br />
4) The plastic network solves the task very quickly with the mean error going below 0.01 within 2000 episodes which are mentioned to be 250 times faster than the LSTM.<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
= Experiment2 - Memorizing network images=<br />
<br />
This task is an image reconstruction task that where a network is trained on a set of natural images which it looks to memorize. The natural images with graded pixel values contain more information per element as compared to the last experiment. So this experiment is inherently more complex than the previous ones. Then one image is chosen at random and half the image is displayed to the agent. The task is to complete the image. The paper shows that this method effectively solves this task which other state-of-the-art network architectures fail to solve. <br />
<br />
The experiment is as follows: <br />
<br />
1) Images are from the CIFAR-10 database where there are a total of 60000 images each of size 32 <math display = "inline">\times</math> 32. <br />
<br />
2) The architecture has 1025 neurons in total with a total of 2 <math display = "inline">\times</math> 1025 <math display = "inline">\times</math> 1025 = 2101250 parameters. <br />
<br />
3) Each episode has 3 pictures, shown 3 times for 20-time steps each time, with 3-time steps of zero input between the presentations. <br />
<br />
4) The images are degraded by zeroing out one full contiguous half of the image to prevent a trivial solution of simply reconstructing the missing pixel as the average of its neighbors.<br />
<br />
[[File:exp2results.png| 650px|thumb|center|Figure 4: Natural Image memorization results]]<br />
<br />
<br />
<br />
The results are shown in figure 4. The final output of the network is shown in the last column which is the reconstructed image. The results show that the model has learned to perform this task. <br />
<br />
[[File:exp2weights.png| 650px|thumb|center|Figure 5: Final matrices and plasticity coefficients]]<br />
<br />
The final weight matrix and plasticity coefficients matrix are shown in the figure 5. The plasticity matrix shows a structure related to the high correlation of neighboring pixels and half-field zeroing in test images. <br />
<br />
The full plastic network is compared against a similar architecture with shared plasticity coefficients, where all connections share the same <math display = "inline">\alpha</math> value. So, the single parameter is shared across all connections is trained. <br />
<br />
[[File:independentvsshared.png| 650px|thumb|center|Figure 6: Comparing independent and shared <math display = "inline">\alpha</math> value runs]]<br />
<br />
The figure 6 shows the result of comparison where the independent plasticity coefficient for each connection has better performances. Thus the structure observed in the weight matrices of the results is actually useful.<br />
<br />
<br />
= Experiment 3 - Omniglot task =<br />
<br />
This task involves handwritten symbol recognition. It is a standard task for one-shot and few-shot learning. <br />
<br />
Experimental Setup: <br />
<br />
1) The Omniglot data set is a collection of handwritten characters from various writing systems, including 20 instances each of 1,623 different handwritten characters, written by different subjects.<br />
<br />
2) In each episode, N character classes are randomly selected and K instances from each class are sampled. <br />
<br />
3) These instances, together with the class label (from 1 to N), are shown to the model. <br />
<br />
4) Then, a new, unlabelled instance is sampled from one of the N classes and shown to the model.<br />
<br />
5) Model performance is defined as the model’s accuracy in classifying this unlabelled example.<br />
<br />
Architecture: <br />
<br />
1) Model architecture has 4 convolutional layers with 3 <math display = "inline">\times</math> 3 receptive fields and 64 channels. <br />
<br />
2) All convolutions have a stride of 2 to reduce the dimensionality between layers. <br />
<br />
3) The output is a single vector of 64 features, which feeds into an N-way softmax. <br />
<br />
4) The label of the current character is also concurrently fed as a one-hot encoding to this softmax layer, to serve as a guide for the correct output when a label is present.<br />
<br />
Plasticity in the architecture: <br />
<br />
1) Plasticity is applied to the weights from the final layer to the softmax layer, leaving the rest of the convolutional embedding non- plastic. <br />
<br />
2) The expectation is that the convolutional architecture will learn an adequate discriminant between arbitrary handwritten characters and the plastic weights learns to memorize associations between observed patterns and outputs. <br />
<br />
Data Preparation: <br />
<br />
1) The dataset is augmented with rotations by multiples of <math display = "inline">90</math> degrees. <br />
<br />
2) It is divided into 1,523 classes for training and 100 classes (together with their augmentations) for testing. <br />
<br />
3) The networks are trained with an Adam optimizer with a learning rate 3 <math display = "inline">\times 10^{-5}</math>, multiplied by 2/3 every 1M episodes over 5,000,000 episodes. <br />
<br />
4) To evaluate final model performance, 10 models are trained with different random seeds and each of those is tested on 100 episodes using previously unseen test classes.<br />
<br />
Results: <br />
<br />
1) The overall accuracy (i.e. the proportion of episodes with correct classification, aggregated over all test episodes of all runs) is 98.3%, with a 95% confidence interval of 0.80%.<br />
<br />
2) The median accuracy across the 10 runs was 98.5%, indicating consistency in learning.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Memory Networks<br />
! Matching Networks<br />
! ProtoNets<br />
! Memory Module<br />
! MAML<br />
! SNAIL<br />
! DP(This paper)<br />
|-<br />
| 82.8%<br />
| 98.1%<br />
| 97.4%<br />
| 98.4%<br />
| 98.7% <math display = "inline">\pm</math> 0.4<br />
| 99.07% <math display = "inline">\pm</math> 0.16<br />
| 98.03% <math display = "inline">\pm</math> 0.80<br />
|}<br />
<br />
<br />
<br />
3) The above table shows the comparative performance across other non-plastic approaches. The results of the plastic approach are largely similar to those reported for the computationally intensive MAML method and the classification-specialized Matching Networks method. <br />
<br />
4) The performances are slightly below those reported for the SNAIL method, which trains a whole additional temporal-convolution network on top of the convolutional architecture thus having many more parameters.<br />
<br />
5) The conclusion is that a few plastic connections to the output of the network allow for competitive one-shot learning over arbitrary man-made visual symbols.<br />
<br />
= Experiment4 - Reinforcement learning Maze navigation task =<br />
<br />
This is a maze exploration task where the goal is to teach an agent to reach a goal. The plastic networks are shown to outperform non-plastic ones. <br />
<br />
Experimental setup: <br />
<br />
1) The maze is composed of 9 <math display = "inline">\times</math> 9 squares, surrounded by walls, in which every other square (in either direction) is occupied by a wall. <br />
<br />
[[File:exp4maze.png| 650px|thumb|center|Figure 7: Maze Environment]]<br />
<br />
<br />
2) The maze contains 16 wall square arranged in a regular grid as shown in the figure 7. <br />
<br />
3) At each episode, one non-wall square is randomly chosen as the reward location. When the agent hits this location, it receives a large reward (10.0) and is immediately transported to a random location in the maze Also a small negative reward of -0.1 is provided every time the agent tries to walk into a wall).<br />
<br />
4) Each episode lasts 250-time steps, during which the agent must accumulate as much reward as possible. The reward location is fixed within an episode and randomized across episodes. <br />
<br />
5) The reward is invisible to the agent, and thus the agent only knows it has hit the reward location by the activation of the reward input at the next step.<br />
<br />
6) Inputs to the agent consist of a binary vector describing the 3 <math display = "inline">\times</math> 3 neighborhood centered on the agent (each element being set to 1 or 0 if the corresponding square is or is not a wall), together with the reward at the previous time step. <br />
<br />
7) A2C algorithm is used to meta train the network. <br />
<br />
8) The experiments are run under three conditions: full differentiable plasticity, no plasticity at all, and homogeneous plasticity in which all connections share the same (learnable) <math display = "inline">\alpha</math> parameter. <br />
<br />
9) For each condition, 15 runs with different random seeds are performed. <br />
<br />
<br />
Architecture: <br />
<br />
1) It is a simple recurrent network with 200 neurons, with a softmax layer on top of it to select between the 4 possible actions (up, right, left or down).<br />
<br />
<br />
[[File:exp4performance.png| 650px|thumb|center|Figure 8: Performance curve for the maze navigation experiment]]<br />
<br />
<br />
Results: <br />
<br />
1) The results are shown in the figure 8. The plastic network shows considerably better performance as compared to the other networks.<br />
<br />
2) The non-plastic and homogeneous networks get stuck on a suboptimal policy. <br />
<br />
3) Thus, the conclusion is that, in this domain, individually sculpting the plasticity of each connection is crucial in reaping the benefits of plasticity for this task.<br />
<br />
= Conclusions =<br />
<br />
<br />
The important contributions from this paper are as follows: <br />
<br />
1) The results show that simple plastic models support efficient meta-learning.<br />
<br />
2) Gradient descent itself is shown to be capable of optimizing the plasticity of a meta-learning system. <br />
<br />
3) The meta-learning is shown to vastly outperform alternative options in the experiments considered. <br />
<br />
4) The method achieved state of the art results on a hard Omniglot test set. <br />
<br />
= Open Source Code =<br />
<br />
Code for this paper can be found at: https://github.com/uber-common/differentiable-plasticity<br />
<br />
<br />
= Critiques =<br />
<br />
The paper addresses an important problem of learning to learn ("meta-learning") and provides a novel framework based on gradient descent to achieve this objective. This paper provides a large scope for future work as many widely used architectures like LSTMs could be tried along with a plastic component. It is also easy to see that the application of such approaches in deep reinforcement learning are also plentiful and there is a good possibility of beating the current baselines in many popular test beds like Atari games using plastic networks. This paper opens up possibilities for a whole class of meta-learning algorithms. <br />
<br />
With regards to the drawbacks of the paper, the paper does not mention how plastic networks will behave if the test sets are completely different from the training dataset. Will the performance be the same as non-plastic networks? It is not very clear if this method will be scalable as there are a large number of parameters to be determined even with the simplest of problems. Also, each experimental domain considered in this paper needed significantly different network architectures (for example in the Omniglot domain plasticity was applied only for the final layers). The paper does not mention any reasons for the specific decisions and if such differences will hold good for other similar problems as well. There has been work in transfer learning applied to both supervised learning and reinforcement learning problems. The authors should have ideally compared plastic networks to performances of some algorithms there as these methods transfer existing knowledge to other related problems and also prevent the need to start training from scratch much similar to the methods adopted in this paper.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/differentiableplasticity&diff=37467stat946F18/differentiableplasticity2018-11-01T01:50:18Z<p>Mpafla: /* Motivation */</p>
<hr />
<div>'''Differentiable Plasticity: ''' Summary of the ICML 2018 paper https://arxiv.org/abs/1804.02464<br />
<br />
= Presented by =<br />
<br />
1. Ganapathi Subramanian, Sriram [Quest ID: 20676799]<br />
<br />
= Motivation =<br />
1. Neural Networks naturally have a static architecture. Once a Neural Network is trained, the network architecture components (ex. network connections) cannot be changed and effectively learning stops with the training step. If a different task needs to be considered, then the agent must be trained again from scratch. <br />
<br />
2. Plasticity is the characteristic of biological systems like humans, which is capable of changing the network connections over time. This enables lifelong learning in biological systems and thus is capable of adapting to dynamic changes in the environment with great sample efficiency in the data observed. This is called synaptic plasticity which is based on the Hebb's rule i.e. If a neuron repeatedly takes part in making another neuron fire, the connection between them is strengthened. Neural networks are very far from achieving synaptic plasticity. <br />
<br />
<br />
3. Differential plasticity is a step in this direction. The plastic connections' behavior is trained using gradient descent so that the previously trained networks can adapt to changing conditions. <br />
<br />
Example: Using the current state of the art supervised learning examples, we can train Neural Networks to recognize specific letters that it has seen during training. Using lifelong learning the agent can know about any alphabet, including those that it has never been exposed to during training.<br />
<br />
= Objectives =<br />
The paper has the following objectives: <br />
<br />
1. To tackle to problem of meta-learning (learning to learn). <br />
<br />
2. To design neural networks with plastic connections with a special emphasis on gradient descent capability for backpropagation training. <br />
<br />
3. To use backpropagation to optimize both the base weights and the amount of plasticity in each connection. <br />
<br />
4. To demonstrate the performance of such networks on three complex and different domains namely complex pattern memorization, one shot classification and reinforcement learning.<br />
<br />
= Important Terms =<br />
<br />
Hebb’s rule: This is a famous rule in neuroscience. It defines the relationship of activities between neurons with their connection. It states that if a neuron repeatedly takes part in making another neuron fire, the connection between them is strengthened. Also summarized as "neurons that fire together, wire together".<br />
<br />
= Related Work =<br />
<br />
Previous Approaches to solving this problem are summarized as under: <br />
<br />
1. Train standard recurrent neural networks to incorporate past experience in their future responses within each episode. For the learning abilities, the RNN is attached with an external content-addressable memory bank. An attention mechanism within the controller network does the read-write to the memory bank and thus enables fast memorization. <br />
<br />
2. Augment each weight with a plastic component that automatically grows and decays as a function of inputs and outputs. All connection have the same non-trainable plasticity and only the corresponding weights are trained. Recent approaches have tried fast-weights which augments recurrent networks with fast-changing Hebbian weights and computes activations at each step. The network has a high bias towards the recently seen patterns. <br />
<br />
3. Optimize the learning rule itself instead of the connections. A parametrized learning rule is used where the structure of the network is fixed beforehand. <br />
<br />
4. Have all the weight updates to be computed on the fly by the network itself or by a separate network at each time step. Pros are the flexibility and the cons are the large learning burden placed on the network. <br />
<br />
5. Perform gradient descent via propagation during the episode. The meta-learning involves training the base network for it to be fine-tuned using additional gradient descent. <br />
<br />
6. For classification tasks, a separate embedding is trained to discriminate between different classes. Classification is then a comparison between the embedding of the test and example instances.<br />
<br />
The superiority of the trainable synaptic plasticity for the meta-learning approach are as follows: <br />
<br />
1. Great potential for flexibility. Example, Memory Networks enforce a specific memory storage model in which memories must be embedded in fixed-size vectors and retrieved through some attention mechanism. In contrast, trainable synaptic plasticity translates into very different forms of memory, the exact implementation of which can be determined<br />
by (trainable) network structure.<br />
<br />
2. Fixed-weight recurrent networks, meanwhile, require neurons to be used for both<br />
storage and computation which increases the computational burdens on neurons. This is avoided in the approach suggested in the paper. <br />
<br />
3. Non-trainable plasticity networks can exploit network connectivity for storage of short-term information, but their uniform, non-trainable plasticity imposes a stereotypical behavior on these memories. In the synaptic plasticity, the amount and rate of plasticity are actively molded by the mechanism itself. Also, it allows for more sustained memory. <br />
<br />
= Model =<br />
<br />
The formulation proposed in the paper is in such a way that the plastic and non-plastic components for each connection are kept separate, while multiple Hebbian rules can be easily defined. <br />
<br />
Model Components: <br />
<br />
1. A connection between any two neurons <math display = "inline">i</math> and <math display = "inline">j</math> has both a fixed component and a plastic component. <br />
<br />
2. The fixed part is just a traditional connection weight <math display = "inline">w_{i,j}</math> . The plastic part is stored in a Hebbian trace <math display = "inline">H_{i,j}</math>, which varies during a<br />
lifetime according to ongoing inputs and outputs. <br />
<br />
3. The relative importance of plastic and fixed components in the connection is structurally determined by the plasticity<br />
coefficient <math display = "inline">\alpha_{i,j}</math>, which multiplies the Hebbian trace to form<br />
the full plastic component of the connection. <br />
<br />
The network equations are as follows: <br />
<br />
<br />
<math display="block"><br />
x_j(t) = \sigma{\displaystyle \sum_{i \in \text{inputs}}[(w_{i,j}x_i(t-1) + \alpha_{i,j} H_{i,j}(t))x_i(t-1)] }<br />
</math><br />
<br />
<br />
<br />
<math display="block"><br />
H_{i,j}(t+1) = \eta x_i(t-1) x_j(t) + (1 - \eta) H_{i,j}(t) <br />
</math><br />
<br />
The <math display = "inline">x_j(t)</math> is the output of neuron <math display = "inline">j</math>. Here the first equation gives the activation function, where the <math display = "inline">w_{i,j}</math> is a fixed component and the remaining term (<math display = "inline"> \alpha_{i,j} H_{i,j}(t))x_i(t-1) </math>) is a plastic component. The <math display = "inline">\sigma</math> is a nonlinear function. It is always chosen to be tanh in this paper. The <math display = "inline">H_{i,j}</math> in the second equation is updated as a function of ongoing inputs and outputs. <br />
<br />
From first equation above, a connection can be fully fixed if <math display = "inline">\alpha = 0 </math> or fully plastic if <math display = "inline">w = 0</math> or have both a fixed and plastic components. <br />
<br />
<br />
<br />
The terms <math display = "inline">w_{i,j}</math> and <math display = "inline">\alpha_{i,j}</math> are the structural parameters trained by gradient descent. The <math display = "inline">\eta</math> which denotes the learning rate is also an optimized parameter of the network. After this training, the agent can learn automatically from ongoing experience. In equation 2 above, the <math display = "inline">\eta</math> could make the Hebbian traces to decay to 0 in the absence of input. So another form of the equation is as follows: <br />
<br />
<br />
<math display="block"><br />
H_{i,j}(t+1) = H_{i,j}(t) + \eta x_j(t)(x_i(t-1) - x_j(t)H_{i,j}(t))<br />
</math><br />
<br />
<br />
= Experiment1 - Binary Pattern Memorization =<br />
<br />
<br />
<br />
This test involves quickly memorizing sets of arbitrary high-dimensional patterns and reconstructing the same while being exposed to partial, degraded versions of them. This is a very simple test as it is already known that hand designed recurrent networks with a Hebbian plastic connection can already solve it for binary patterns.<br />
<br />
<br />
<br />
[[File:binarypatternrecog.png | 650px|thumb|center|Figure 1: Pattern Memorization experiment - Input Structure and Architecture]]<br />
<br />
<br />
<br />
Steps in the experiment: <br />
<br />
<br />
1) The network is a set of five binary patterns in succession as shown in the figure 1. Each of these patterns has 1000 elements each of which has one of the binary value (1 or -1). Here 1 corresponds to dark red and -1 corresponds to dark blue. <br />
<br />
2) The few shot learning paradigm is followed, where each pattern is shown for 10-time steps, with 3-time steps of zero input between the presentations and the whole sequence of patterns is presented 3 times in random order. <br />
<br />
3) One of the presented patterns is chosen in random order and degraded by setting half of its bits to 0. <br />
<br />
4) This degraded pattern is then fed to the network. The network has to reproduce the correct full pattern in its output using its memory that it developed during training. <br />
<br />
The architecture of the network is described as follows: <br />
<br />
1) It is a fully connected RNN with one neuron per pattern element, plus one fixed-output neuron. There are a total of 1001 neurons. <br />
<br />
2) Value of each neuron is clamped to the value of the corresponding element in the pattern if the value is not 0. If the value is 0, the corresponding neurons do not receive pattern input and must use what it gets from lateral connections and reconstruct the correct, expected output values. <br />
<br />
3) Outputs are read from the activation of the neurons. <br />
<br />
4) The performance evaluation is done by computing the loss between the final network output and the correct expected pattern. <br />
<br />
5) The gradient of the error over the <math display = "inline">w_{i,j}</math> and the <math display = "inline">\alpha_{i,j}</math> coefficients is computed by backpropagation and optimized through Adam solver with learning rate 0.001. <br />
<br />
6) The simple decaying Hebbian formula in Equation 2 is used to update the Hebbian traces. Each network has 2 trainable parameters <math display = "inline">w</math> and <math display = "inline">\alpha</math> for each connection, thus there are a total 1001 <math display = "inline">\times</math> 1001 <math display = "inline">\times</math> 2 = 2004002 trainable parameters. <br />
<br />
[[File:exp1results.png | 650px|thumb|center|Figure 2:Experiment1 - Pattern Memorization Results]]<br />
<br />
<br />
The results are shown in the figure 2 where 10 runs are considered. The error becomes quite low after about 200 episodes of training. <br />
<br />
[[File:exp1nonplasticresults.png| 650px|thumb|center|Figure 3: Pattern Memorization results with non plastic networks]]<br />
<br />
Comparison with Non-Plastic Networks: <br />
<br />
1) Nonplastic networks can solve this task but require additional neurons to solve this task in principle. In practice, the authors say that the task is not solved using Non-plastic RNN or LSTM. <br />
<br />
2) The figure 3 shows the results using non-plastic networks. The best results required the addition of 2000 extra neurons. <br />
<br />
3) For non-plastic RNN, the error flattens around 0.13 which is quite high. Using LSTMs, the task can be solved albeit imperfectly and also the error rate reduces drastically t0 around 0.001. <br />
<br />
4) The plastic network solves the task very quickly with the mean error going below 0.01 within 2000 episodes which are mentioned to be 250 times faster than the LSTM.<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
= Experiment2 - Memorizing network images=<br />
<br />
This task is an image reconstruction task that where a network is trained on a set of natural images which it looks to memorize. The natural images with graded pixel values contain more information per element as compared to the last experiment. So this experiment is inherently more complex than the previous ones. Then one image is chosen at random and half the image is displayed to the agent. The task is to complete the image. The paper shows that this method effectively solves this task which other state-of-the-art network architectures fail to solve. <br />
<br />
The experiment is as follows: <br />
<br />
1) Images are from the CIFAR-10 database where there are a total of 60000 images each of size 32 <math display = "inline">\times</math> 32. <br />
<br />
2) The architecture has 1025 neurons in total with a total of 2 <math display = "inline">\times</math> 1025 <math display = "inline">\times</math> 1025 = 2101250 parameters. <br />
<br />
3) Each episode has 3 pictures, shown 3 times for 20-time steps each time, with 3-time steps of zero input between the presentations. <br />
<br />
4) The images are degraded by zeroing out one full contiguous half of the image to prevent a trivial solution of simply reconstructing the missing pixel as the average of its neighbors.<br />
<br />
[[File:exp2results.png| 650px|thumb|center|Figure 4: Natural Image memorization results]]<br />
<br />
<br />
<br />
The results are shown in figure 4. The final output of the network is shown in the last column which is the reconstructed image. The results show that the model has learned to perform this task. <br />
<br />
[[File:exp2weights.png| 650px|thumb|center|Figure 5: Final matrices and plasticity coefficients]]<br />
<br />
The final weight matrix and plasticity coefficients matrix are shown in the figure 5. The plasticity matrix shows a structure related to the high correlation of neighboring pixels and half-field zeroing in test images. <br />
<br />
The full plastic network is compared against a similar architecture with shared plasticity coefficients, where all connections share the same <math display = "inline">\alpha</math> value. So, the single parameter is shared across all connections is trained. <br />
<br />
[[File:independentvsshared.png| 650px|thumb|center|Figure 6: Comparing independent and shared <math display = "inline">\alpha</math> value runs]]<br />
<br />
The figure 6 shows the result of comparison where the independent plasticity coefficient for each connection has better performances. Thus the structure observed in the weight matrices of the results is actually useful.<br />
<br />
<br />
= Experiment 3 - Omniglot task =<br />
<br />
This task involves handwritten symbol recognition. It is a standard task for one-shot and few-shot learning. <br />
<br />
Experimental Setup: <br />
<br />
1) The Omniglot data set is a collection of handwritten characters from various writing systems, including 20 instances each of 1,623 different handwritten characters, written by different subjects.<br />
<br />
2) In each episode, N character classes are randomly selected and K instances from each class are sampled. <br />
<br />
3) These instances, together with the class label (from 1 to N), are shown to the model. <br />
<br />
4) Then, a new, unlabelled instance is sampled from one of the N classes and shown to the model.<br />
<br />
5) Model performance is defined as the model’s accuracy in classifying this unlabelled example.<br />
<br />
Architecture: <br />
<br />
1) Model architecture has 4 convolutional layers with 3 <math display = "inline">\times</math> 3 receptive fields and 64 channels. <br />
<br />
2) All convolutions have a stride of 2 to reduce the dimensionality between layers. <br />
<br />
3) The output is a single vector of 64 features, which feeds into an N-way softmax. <br />
<br />
4) The label of the current character is also concurrently fed as a one-hot encoding to this softmax layer, to serve as a guide for the correct output when a label is present.<br />
<br />
Plasticity in the architecture: <br />
<br />
1) Plasticity is applied to the weights from the final layer to the softmax layer, leaving the rest of the convolutional embedding non- plastic. <br />
<br />
2) The expectation is that the convolutional architecture will learn an adequate discriminant between arbitrary handwritten characters and the plastic weights learns to memorize associations between observed patterns and outputs. <br />
<br />
Data Preparation: <br />
<br />
1) The dataset is augmented with rotations by multiples of <math display = "inline">90</math> degrees. <br />
<br />
2) It is divided into 1,523 classes for training and 100 classes (together with their augmentations) for testing. <br />
<br />
3) The networks are trained with an Adam optimizer with a learning rate 3 <math display = "inline">\times 10^{-5}</math>, multiplied by 2/3 every 1M episodes over 5,000,000 episodes. <br />
<br />
4) To evaluate final model performance, 10 models are trained with different random seeds and each of those is tested on 100 episodes using previously unseen test classes.<br />
<br />
Results: <br />
<br />
1) The overall accuracy (i.e. the proportion of episodes with correct classification, aggregated over all test episodes of all runs) is 98.3%, with a 95% confidence interval of 0.80%.<br />
<br />
2) The median accuracy across the 10 runs was 98.5%, indicating consistency in learning.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Memory Networks<br />
! Matching Networks<br />
! ProtoNets<br />
! Memory Module<br />
! MAML<br />
! SNAIL<br />
! DP(This paper)<br />
|-<br />
| 82.8%<br />
| 98.1%<br />
| 97.4%<br />
| 98.4%<br />
| 98.7% <math display = "inline">\pm</math> 0.4<br />
| 99.07% <math display = "inline">\pm</math> 0.16<br />
| 98.03% <math display = "inline">\pm</math> 0.80<br />
|}<br />
<br />
<br />
<br />
3) The above table shows the comparative performance across other non-plastic approaches. The results of the plastic approach are largely similar to those reported for the computationally intensive MAML method and the classification-specialized Matching Networks method. <br />
<br />
4) The performances are slightly below those reported for the SNAIL method, which trains a whole additional temporal-convolution network on top of the convolutional architecture thus having many more parameters.<br />
<br />
5) The conclusion is that a few plastic connections to the output of the network allow for competitive one-shot learning over arbitrary man-made visual symbols.<br />
<br />
= Experiment4 - Reinforcement learning Maze navigation task =<br />
<br />
This is a maze exploration task where the goal is to teach an agent to reach a goal. The plastic networks are shown to outperform non-plastic ones. <br />
<br />
Experimental setup: <br />
<br />
1) The maze is composed of 9 <math display = "inline">\times</math> 9 squares, surrounded by walls, in which every other square (in either direction) is occupied by a wall. <br />
<br />
[[File:exp4maze.png| 650px|thumb|center|Figure 7: Maze Environment]]<br />
<br />
<br />
2) The maze contains 16 wall square arranged in a regular grid as shown in the figure 7. <br />
<br />
3) At each episode, one non-wall square is randomly chosen as the reward location. When the agent hits this location, it receives a large reward (10.0) and is immediately transported to a random location in the maze Also a small negative reward of -0.1 is provided every time the agent tries to walk into a wall).<br />
<br />
4) Each episode lasts 250-time steps, during which the agent must accumulate as much reward as possible. The reward location is fixed within an episode and randomized across episodes. <br />
<br />
5) The reward is invisible to the agent, and thus the agent only knows it has hit the reward location by the activation of the reward input at the next step.<br />
<br />
6) Inputs to the agent consist of a binary vector describing the 3 <math display = "inline">\times</math> 3 neighborhood centered on the agent (each element being set to 1 or 0 if the corresponding square is or is not a wall), together with the reward at the previous time step. <br />
<br />
7) A2C algorithm is used to meta train the network. <br />
<br />
8) The experiments are run under three conditions: full differentiable plasticity, no plasticity at all, and homogeneous plasticity in which all connections share the same (learnable) <math display = "inline">\alpha</math> parameter. <br />
<br />
9) For each condition, 15 runs with different random seeds are performed. <br />
<br />
<br />
Architecture: <br />
<br />
1) It is a simple recurrent network with 200 neurons, with a softmax layer on top of it to select between the 4 possible actions (up, right, left or down).<br />
<br />
<br />
[[File:exp4performance.png| 650px|thumb|center|Figure 8: Performance curve for the maze navigation experiment]]<br />
<br />
<br />
Results: <br />
<br />
1) The results are shown in the figure 8. The plastic network shows considerably better performance as compared to the other networks.<br />
<br />
2) The non-plastic and homogeneous networks get stuck on a suboptimal policy. <br />
<br />
3) Thus, the conclusion is that, in this domain, individually sculpting the plasticity of each connection is crucial in reaping the benefits of plasticity for this task.<br />
<br />
= Conclusions =<br />
<br />
<br />
The important contributions from this paper are as follows: <br />
<br />
1) The results show that simple plastic models support efficient meta-learning.<br />
<br />
2) Gradient descent itself is shown to be capable of optimizing the plasticity of a meta-learning system. <br />
<br />
3) The meta-learning is shown to vastly outperform alternative options in the experiments considered. <br />
<br />
4) The method achieved state of the art results on a hard Omniglot test set. <br />
<br />
= Open Source Code =<br />
<br />
Code for this paper can be found at: https://github.com/uber-common/differentiable-plasticity<br />
<br />
<br />
= Critiques =<br />
<br />
The paper addresses an important problem of learning to learn ("meta-learning") and provides a novel framework based on gradient descent to achieve this objective. This paper provides a large scope for future work as many widely used architectures like LSTMs could be tried along with a plastic component. It is also easy to see that the application of such approaches in deep reinforcement learning are also plentiful and there is a good possibility of beating the current baselines in many popular test beds like Atari games using plastic networks. This paper opens up possibilities for a whole class of meta-learning algorithms. <br />
<br />
With regards to the drawbacks of the paper, the paper does not mention how plastic networks will behave if the test sets are completely different from the training dataset. Will the performance be the same as non-plastic networks? It is not very clear if this method will be scalable as there are a large number of parameters to be determined even with the simplest of problems. Also, each experimental domain considered in this paper needed significantly different network architectures (for example in the Omniglot domain plasticity was applied only for the final layers). The paper does not mention any reasons for the specific decisions and if such differences will hold good for other similar problems as well. There has been work in transfer learning applied to both supervised learning and reinforcement learning problems. The authors should have ideally compared plastic networks to performances of some algorithms there as these methods transfer existing knowledge to other related problems and also prevent the need to start training from scratch much similar to the methods adopted in this paper.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Synthesizing_Programs_for_Images_usingReinforced_Adversarial_Learning&diff=37462Synthesizing Programs for Images usingReinforced Adversarial Learning2018-11-01T01:40:41Z<p>Mpafla: /* Motivation */</p>
<hr />
<div>'''Synthesizing Programs for Images using Reinforced Adversarial Learning: ''' Summary of the ICML 2018 paper <br />
<br />
Paper: [[http://proceedings.mlr.press/v80/ganin18a.html]]<br />
Video: [[https://www.youtube.com/watch?v=iSyvwAwa7vk&feature=youtu.be]]<br />
<br />
== Presented by ==<br />
<br />
1. Nekoei, Hadi [Quest ID: 20727088]<br />
<br />
= Motivation =<br />
<br />
Conventional neural generative models have major problems. <br />
<br />
* It is not clear how to inject knowledge about the data into the model. <br />
<br />
* Latent space is not easily interpretative. <br />
<br />
The provided solution in this paper is to generate programs to incorporate tools, e.g. graphics editors, illustration software, CAD. and '''creating more meaningful API(sequence of complex actions vs raw pixels)'''.<br />
<br />
= Introduction =<br />
<br />
Humans, frequently, use the ability to recover structured representation from raw sensation to understand their environment. Decomposing a picture of a hand-written character into strokes or understanding the layout of a building can be exploited to learn how actually our brain works.<br />
To address these problems, a new approach is presented for interpreting and generating images using Deep Reinforced Adversarial Learning in order to solve the need for a large amount of supervision and scalability to larger real-world datasets. In this approach, an adversarially trained agent '''(SPIRAL)''' generates a program which is executed by a graphics engine to generate images, either conditioned on data or unconditionally. The agent is rewarded by fooling a discriminator network and is trained with distributed reinforcement learning without any extra supervision. The discriminator network itself is trained to distinguish between generated and real images.<br />
<br />
[[File:Fig1 SPIRAL.PNG | 400px|center]]<br />
<br />
== Related Work ==<br />
Related works in this filed is summarized as follows:<br />
* There has been a huge amount of studies on inverting simulators to interpret images (Nair et al., 2008; Paysan et al., 2009; Mansinghka et al., 2013; Loper & Black, 2014; Kulkarni et al., 2015a; Jampani et al., 2015)<br />
<br />
* Inferring motor programs for reconstruction of MNIST digits (Nair & Hinton, 2006)<br />
<br />
* Visual program induction in the context of hand-written characters on the OMNIGLOT dataset (Lake et al., 2015)<br />
<br />
* inferring and learning feed-forward or recurrent procedures for image generation (LeCun et al., 2015; Hinton & Salakhutdinov, 2006; Goodfellow et al., 2014; Ackley et al., 1987; Kingma & Welling, 2013; Oord et al., 2016; Kulkarni et al., 2015b; Eslami et al., 2016; Reed et al., 2017; Gregor et al., 2015).<br />
<br />
'''However, all of these methods have limitations such as:''' <br />
<br />
* Scaling to larger real-world datasets<br />
<br />
* Requiring hand-crafted parses and supervision in the form of sketches and corresponding images<br />
<br />
* Lack the ability to infer structured representations of images<br />
<br />
= The SPIRAL Agent =<br />
=== Overview ===<br />
The paper aims to construct a generative model <math>\mathbf{G}</math> to take samples from a distribution <math>p_{d}</math>. The generative model consists of a recurrent network <math>\pi</math> (called policy network or agent) and an external rendering simulator R that accepts a sequence of commands from the agent and maps them into the domain of interest, e.g. R could be a CAD program rendering descriptions of primitives into 3D scenes. <br />
In order to train policy network <math>\pi</math>, the paper has exploited generative adversarial network. In this framework, the generator tries to fool a discriminator network which is trained to distinguish between real and fake samples. Thus, the distribution generated by <math>\mathbf{G}</math> approaches <math>pd</math>.<br />
<br />
== Objectives ==<br />
The authors give training objective for <math>\mathbf{G}</math> and <math>\mathbf{D}</math> as follows.<br />
<br />
'''Discriminator:''' Following (Gulrajani et al., 2017), the objective for <math>\mathbf{D}</math> is defined as: <br />
<br />
\begin{align}<br />
\mathcal{L}_D = -\mathbb{E}_{x\sim p_d}[D(x)] + \mathbb{E}_{x\sim p_g}[D(x)] + R<br />
\end{align}<br />
<br />
where <math>\mathbf{R}</math> is a regularization term softly constraining <math>\mathbf{D}</math> to stay in the set of Lipschitz continuous functions (for some fixed Lipschitz constant).<br />
<br />
'''Generator:''' To define the objective for <math>\mathbf{G}</math>, a variant of the REINFORCE (Williams, 1992) algorithm, advantage actor-critic (A2C) is employed:<br />
<br />
<br />
\begin{align}<br />
\mathcal{L}_G = -\sum_{t}log\pi(a_t|s_t;\theta)[R_t - V^{\pi}(s_t)]<br />
\end{align}<br />
<br />
<br />
where <math>V^{\pi}</math> is an approximation to the value function which is considered to be independent of theta, and <math>R_{t} = \sum_{t}^{N}r_{t}</math> is a <br />
1-sample Monte-Carlo estimate of the return. Rewards are set to:<br />
<br />
<math><br />
r_t = \left\{<br />
\begin{array}{@{} l c @{}}<br />
0 \text{ t N} \\<br />
D(\mathbb{R}(a_1, a_2, ..., a_N)) & \text{ t = N}<br />
\end{array}\right.<br />
\label{eq4}<br />
</math><br />
<br />
<br />
One interesting aspect of this new formulation is that <br />
the search can be biased by introducing intermediate rewards<br />
which may depend not only on the output of R but also on<br />
commands used to generate that output.<br />
<br />
== Conditional generation: ==<br />
In some cases such as producing a given image <math>x_{target}</math>, conditioning the model on auxiliary inputs is useful. That can be done by feeding <math>x_{target}</math> to both policy and discriminator networks as:<br />
<br />
<math><br />
p_g = R(p_a(a|x_{target}))<br />
</math><br />
<br />
While <math>p_{d}</math> becomes a Dirac-<math>\delta</math> function centered at <math>x_{target}</math>. <br />
For the first two terms in the objective function for D, they reduce to <br />
<br />
<math><br />
-D(x_{target}|x_{target})+ \mathbb{E}_{x\sim p_g}[D(x|x_{target})] <br />
</math><br />
<br />
It can be proven that for this particular setting of <math>p_{g}</math> and <math>p_{d}</math>, the <math>l2</math>-distance is an optimal discriminator. It may be as a poor candidate for the reward signal of the generator, even if it is not the only solution of the objective function for D.<br />
<br />
== Distributed Learning: ==<br />
The training pipeline is outlined in Figure 2b. It is an extension of the recently proposed '''IMPALA''' architecture (Espeholt et al., 2018). For training, three kinds of workers are defined:<br />
<br />
<br />
* Actors are responsible for generating the training trajectories through interaction between the policy network and the rendering simulator. Each trajectory contains a sequence <math>((\pi_{t}; a_{t}) | 1 \leq t \leq N)</math> as well as all intermediate<br />
renderings produced by R.<br />
<br />
<br />
* A policy learner receives trajectories from the actors, combines them into a batch and updates <math>\pi</math> by performing '''SGD''' step on <math>\mathcal{L}_G</math> (2). Following common practice (Mnih et al., 2016), <math>\mathcal{L}_G</math> is augmented with an entropy penalty encouraging exploration.<br />
<br />
<br />
* In contrast to the base '''IMPALA''' setup, an additional discriminator learner is defined. This worker consumes<br />
random examples from <math>p_{d}</math>, as well as generated data (final renders) coming from the actor workers, and optimizes <math>\mathcal{L}_D</math> (1).<br />
<br />
[[File:Fig2 SPIRAL Architecture.png | 700px|center]]<br />
<br />
'''Note:''' any trajectories is not omitted in the policy learner.<br />
Instead, the <math>D</math> updates is decoupled from the <math>\pi</math> updates<br />
by introducing a replay buffer that serves as a communication<br />
layer between the actors and the discriminator learner.<br />
That allows the latter to optimize <math>D</math> at a higher rate than<br />
the training of the policy network due to the difference in<br />
network sizes (<math>\pi</math> is a multi-step RNN, while <math>D</math> is a plain<br />
'''CNN'''). Even though sampling from a replay<br />
buffer inevitably results in smoothing of <math>p_{g}</math>, this<br />
setup is found to work well in practice.<br />
<br />
= Experiments=<br />
<br />
<br />
== Environments ==<br />
Two rendering environment is introduced. For MNIST, OMNIGLOT and CELEBA generation an open-source painting librabry LIMBYPAINT (libmypaint<br />
contributors, 2018).) is used. The agent controls a brush and produces<br />
a sequence of (possibly disjoint) strokes on a canvas<br />
C. The state of the environment is comprised of the contents<br />
of <math>C</math> as well as the current brush location <math>l_{t}</math>. Each action<br />
$a_{t}$ is a tuple of 8 discrete decisions <math>(a1t; a2t; ... ; a8t)</math> (see<br />
Figure 3). The first two components are the control point <math>p_{c}</math><br />
and the endpoint <math>l_{t+1}</math> of the stroke.<br />
<br />
[[File:Fig3_agent_action_space.PNG | 450px|center]]<br />
<br />
The next 5<br />
components represent the appearance of the stroke: the<br />
pressure that the agent applies to the brush (10 levels), the<br />
brush size, and the stroke color characterized by a mixture<br />
of red, green and blue (20 bins for each color component).<br />
The last element of at is a binary flag specifying the type<br />
of action: the agent can choose either to produce a stroke<br />
or to jump right to $l_{t+1}$.<br />
<br />
In the MUJOCO SCENES experiment, we render images<br />
using a MuJoCo-based environment (Todorov et al., 2012).<br />
At each time step, the agent has to decide on the object<br />
type (4 options), its location on a 16 $\times$ 16 grid, its size<br />
(3 options) and the color (3 color components with 4 bins<br />
each). The resulting tuple is sent to the environment, which<br />
adds an object to the scene according to the specification.<br />
<br />
== Datasets ==<br />
<br />
=== MNIST ===<br />
For the MNIST dataset, two sets of experiments are conducted:<br />
<br />
1- In this experiment, an unconditional agent is trained to model the data distribution. Along with the reward provided by the discriminator, a small negative reward is provided to the agent for each continuous sequence of strokes to encourage the agent to draw a digit in a continuous motion of stroke. Example of such generation is depicted in the Fig 4a. <br />
<br />
2- In the second experiment, an agent is trained to reproduce a given digit. <br />
Several examples of conditional generated digits are shown in Fig 4b. <br />
<br />
[[File:Fig4a MNIST.png | 450px|center]]<br />
<br />
=== OMNIGLOT ===<br />
Now the trained agents are tested in a similar but more challenging setting of handwritten characters. As can be seen in Fig 5a, the unconditional generation has a lower quality compared to digits in the previous dataset. The conditional agents, on the other hand, were able to reach a convincing quality (Fig 5b). Moreover, as OMNIGLOT has lots of different symbols, the model that we created was able to learn a general idea of image production without memorizing the training data. We tested this result by inputting new unseen line drawings to our trained agent. As we concluded, it provided excellent results as shown in Figure 6. <br />
<br />
[[File:Fig5 OMNIGLOT.png | 450px|center]]<br />
<br />
<br />
For the MNIST dataset, two kinds of rewards, discriminator score and <math>l^{2}-\text{distance}</math> has been compared. Note that the discriminator based approach has a significantly lower training time and lower final <math>l^{2}</math> error.<br />
Following (Sharma et al., 2017), also a “blind” version of the agent without feeding any intermediate canvas states as an input to <math>\pi</math> is trained. The training curve for this experiment is also reported in Fig 8a. <br />
(dotted blue line) The results of training agents with discriminator based and <math>l^{2}-\text{distance}</math> approach is shown in Fig 8a as well.<br />
<br />
=== CELEBA ===<br />
<br />
Since the ''libmypaint'' environment is also capable of producing<br />
complex color paintings, this direction is explored by<br />
training a conditional agent on the CELEBA dataset. In this<br />
experiment, the agent does not receive any intermediate rewards.<br />
In addition to the reconstruction reward (either <math>l^2</math> or<br />
discriminator-based), earth mover’s<br />
distance between the color histograms of the model’s output<br />
and <math>x_{target}</math> is penalized. (Figure 7)<br />
<br />
[[File:Fig6 CELEBA.png | 450px|center]]<br />
<br />
Although blurry, the model’s reconstruction closely matches<br />
the high-level structure of each image. For instance the<br />
background color, the position of the face and the color of<br />
the person’s hair. In some cases, shadows around eyes and<br />
the nose are visible.<br />
<br />
=== MUJOCO SCENES ===<br />
<br />
For the MUJOCO SCENES dataset, the trained agent is used to construct simple CAD programs that best explain input images. Here only the case of the conditional generation is considered. Like before, the reward function for the generator can be either the <math>l^2</math> score or the discriminator output. In addition, there are not any auxiliary reward signals. This model has the capacity to infer and represent up to 20 objects and their attributes due to its unrolled 20 time steps.<br />
<br />
As shown in Figure 8b, the agent trained to directly minimize<br />
<math>l^2</math> is unable to solve the task and has significantly<br />
higher pixel-wise error. In comparison, the discriminator based<br />
variant solves the task and produces near-perfect reconstructions<br />
on a holdout set (Figure 10).<br />
<br />
[[File:Fig8 MUJOCO_SCENES.png | 500px|center]]<br />
For this experiment, the total number of possible execution traces is <math>M^N</math>, where <math>M = 4·16^2·3·4^3·3 </math> is the total number of attribute settings for a single object and N = 20 is the length of an episode. Then a general-purpose Metropolis-Hastings inference algorithm that samples an execution trace defining attributes for a maximum of 20 primitives was run on a set of 100 images. These attributes are considered as latent variables. During each time step of the inference, the attribute blocks (including presence/absence tags) corresponding to a single object are evenly flipped over the appropriate range. The resulting trace is presented as an output sample by the environment, and then the output sample is accepted or rejected using the Metropolis-Hastings update rule, where the Gaussian likelihood is centered on the test image and the fixed diagonal covariance is 0.25. From Figure 9, the MCMC search baseline cannot solve the task even after a lot of evaluation.<br />
[[File:figure9 mcmc.PNG| 500px|center]]<br />
<br />
= Discussion =<br />
As in the OMNIGLOT<br />
experiment, the <math>l^2</math>-based agent demonstrates some<br />
improvements over the random policy but gets stuck and, as<br />
a result fails to learn sensible reconstructions (Figure 8b).<br />
<br />
[[File:Fig7 Results.png | 500px|center]]<br />
<br />
<br />
Scaling visual program synthesis to the real world and combinatorial<br />
datasets has been a challenge. It has been shown that it is possible to train an adversarial generative agent employing<br />
black-box rendering simulators. Our results indicate that<br />
using the Wasserstein discriminator’s output as a reward<br />
function with asynchronous reinforcement learning can provide<br />
a scaling path for visual program synthesis. The current<br />
exploration strategy used in the agent is entropy-based, but<br />
future work should address this limitation by employing sophisticated<br />
search algorithms for policy improvement. For<br />
instance, Monte Carlo Tree Search can be used, analogous<br />
to AlphaGo Zero (Silver et al., 2017). General-purpose<br />
inference algorithms could also be used for this purpose.<br />
<br />
= Future Work =<br />
Future work should explore different parameterizations of<br />
action spaces. For instance, the use of two arbitrary control<br />
points are perhaps not the best way to represent strokes, as it<br />
is hard to deal with straight lines. Actions could also directly parametrize 3D surfaces, planes and learned texture models<br />
to invert richer visual scenes. On the reward side, using<br />
a joint image-action discriminator similar to BiGAN/ALI<br />
(Donahue et al., 2016; Dumoulin et al., 2016) (in this case,<br />
the policy can be viewed as an encoder, while the renderer becomes<br />
a decoder) could result in a more meaningful learning<br />
signal, since D will be forced to focus on the semantics of<br />
the image.<br />
<br />
= References =<br />
<br />
# Yaroslav Ganin, Tejas Kulkarni, Igor Babuschkin, S.M. Ali Eslami, Oriol Vinyals, [[https://arxiv.org/abs/1804.01118]].</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Synthesizing_Programs_for_Images_usingReinforced_Adversarial_Learning&diff=37461Synthesizing Programs for Images usingReinforced Adversarial Learning2018-11-01T01:40:11Z<p>Mpafla: /* Motivation */</p>
<hr />
<div>'''Synthesizing Programs for Images using Reinforced Adversarial Learning: ''' Summary of the ICML 2018 paper <br />
<br />
Paper: [[http://proceedings.mlr.press/v80/ganin18a.html]]<br />
Video: [[https://www.youtube.com/watch?v=iSyvwAwa7vk&feature=youtu.be]]<br />
<br />
== Presented by ==<br />
<br />
1. Nekoei, Hadi [Quest ID: 20727088]<br />
<br />
= Motivation =<br />
<br />
Conventional neural generative models have major problems. <br />
<br />
* It is not clear how to inject knowledge about the data into the model. <br />
<br />
* Secondly, latent space is not easily interpretable. <br />
<br />
The provided solution in this paper is to generate programs to incorporate tools, e.g. graphics editors, illustration software, CAD. and '''creating more meaningful API(sequence of complex actions vs raw pixels)'''.<br />
<br />
= Introduction =<br />
<br />
Humans, frequently, use the ability to recover structured representation from raw sensation to understand their environment. Decomposing a picture of a hand-written character into strokes or understanding the layout of a building can be exploited to learn how actually our brain works.<br />
To address these problems, a new approach is presented for interpreting and generating images using Deep Reinforced Adversarial Learning in order to solve the need for a large amount of supervision and scalability to larger real-world datasets. In this approach, an adversarially trained agent '''(SPIRAL)''' generates a program which is executed by a graphics engine to generate images, either conditioned on data or unconditionally. The agent is rewarded by fooling a discriminator network and is trained with distributed reinforcement learning without any extra supervision. The discriminator network itself is trained to distinguish between generated and real images.<br />
<br />
[[File:Fig1 SPIRAL.PNG | 400px|center]]<br />
<br />
== Related Work ==<br />
Related works in this filed is summarized as follows:<br />
* There has been a huge amount of studies on inverting simulators to interpret images (Nair et al., 2008; Paysan et al., 2009; Mansinghka et al., 2013; Loper & Black, 2014; Kulkarni et al., 2015a; Jampani et al., 2015)<br />
<br />
* Inferring motor programs for reconstruction of MNIST digits (Nair & Hinton, 2006)<br />
<br />
* Visual program induction in the context of hand-written characters on the OMNIGLOT dataset (Lake et al., 2015)<br />
<br />
* inferring and learning feed-forward or recurrent procedures for image generation (LeCun et al., 2015; Hinton & Salakhutdinov, 2006; Goodfellow et al., 2014; Ackley et al., 1987; Kingma & Welling, 2013; Oord et al., 2016; Kulkarni et al., 2015b; Eslami et al., 2016; Reed et al., 2017; Gregor et al., 2015).<br />
<br />
'''However, all of these methods have limitations such as:''' <br />
<br />
* Scaling to larger real-world datasets<br />
<br />
* Requiring hand-crafted parses and supervision in the form of sketches and corresponding images<br />
<br />
* Lack the ability to infer structured representations of images<br />
<br />
= The SPIRAL Agent =<br />
=== Overview ===<br />
The paper aims to construct a generative model <math>\mathbf{G}</math> to take samples from a distribution <math>p_{d}</math>. The generative model consists of a recurrent network <math>\pi</math> (called policy network or agent) and an external rendering simulator R that accepts a sequence of commands from the agent and maps them into the domain of interest, e.g. R could be a CAD program rendering descriptions of primitives into 3D scenes. <br />
In order to train policy network <math>\pi</math>, the paper has exploited generative adversarial network. In this framework, the generator tries to fool a discriminator network which is trained to distinguish between real and fake samples. Thus, the distribution generated by <math>\mathbf{G}</math> approaches <math>pd</math>.<br />
<br />
== Objectives ==<br />
The authors give training objective for <math>\mathbf{G}</math> and <math>\mathbf{D}</math> as follows.<br />
<br />
'''Discriminator:''' Following (Gulrajani et al., 2017), the objective for <math>\mathbf{D}</math> is defined as: <br />
<br />
\begin{align}<br />
\mathcal{L}_D = -\mathbb{E}_{x\sim p_d}[D(x)] + \mathbb{E}_{x\sim p_g}[D(x)] + R<br />
\end{align}<br />
<br />
where <math>\mathbf{R}</math> is a regularization term softly constraining <math>\mathbf{D}</math> to stay in the set of Lipschitz continuous functions (for some fixed Lipschitz constant).<br />
<br />
'''Generator:''' To define the objective for <math>\mathbf{G}</math>, a variant of the REINFORCE (Williams, 1992) algorithm, advantage actor-critic (A2C) is employed:<br />
<br />
<br />
\begin{align}<br />
\mathcal{L}_G = -\sum_{t}log\pi(a_t|s_t;\theta)[R_t - V^{\pi}(s_t)]<br />
\end{align}<br />
<br />
<br />
where <math>V^{\pi}</math> is an approximation to the value function which is considered to be independent of theta, and <math>R_{t} = \sum_{t}^{N}r_{t}</math> is a <br />
1-sample Monte-Carlo estimate of the return. Rewards are set to:<br />
<br />
<math><br />
r_t = \left\{<br />
\begin{array}{@{} l c @{}}<br />
0 \text{ t N} \\<br />
D(\mathbb{R}(a_1, a_2, ..., a_N)) & \text{ t = N}<br />
\end{array}\right.<br />
\label{eq4}<br />
</math><br />
<br />
<br />
One interesting aspect of this new formulation is that <br />
the search can be biased by introducing intermediate rewards<br />
which may depend not only on the output of R but also on<br />
commands used to generate that output.<br />
<br />
== Conditional generation: ==<br />
In some cases such as producing a given image <math>x_{target}</math>, conditioning the model on auxiliary inputs is useful. That can be done by feeding <math>x_{target}</math> to both policy and discriminator networks as:<br />
<br />
<math><br />
p_g = R(p_a(a|x_{target}))<br />
</math><br />
<br />
While <math>p_{d}</math> becomes a Dirac-<math>\delta</math> function centered at <math>x_{target}</math>. <br />
For the first two terms in the objective function for D, they reduce to <br />
<br />
<math><br />
-D(x_{target}|x_{target})+ \mathbb{E}_{x\sim p_g}[D(x|x_{target})] <br />
</math><br />
<br />
It can be proven that for this particular setting of <math>p_{g}</math> and <math>p_{d}</math>, the <math>l2</math>-distance is an optimal discriminator. It may be as a poor candidate for the reward signal of the generator, even if it is not the only solution of the objective function for D.<br />
<br />
== Distributed Learning: ==<br />
The training pipeline is outlined in Figure 2b. It is an extension of the recently proposed '''IMPALA''' architecture (Espeholt et al., 2018). For training, three kinds of workers are defined:<br />
<br />
<br />
* Actors are responsible for generating the training trajectories through interaction between the policy network and the rendering simulator. Each trajectory contains a sequence <math>((\pi_{t}; a_{t}) | 1 \leq t \leq N)</math> as well as all intermediate<br />
renderings produced by R.<br />
<br />
<br />
* A policy learner receives trajectories from the actors, combines them into a batch and updates <math>\pi</math> by performing '''SGD''' step on <math>\mathcal{L}_G</math> (2). Following common practice (Mnih et al., 2016), <math>\mathcal{L}_G</math> is augmented with an entropy penalty encouraging exploration.<br />
<br />
<br />
* In contrast to the base '''IMPALA''' setup, an additional discriminator learner is defined. This worker consumes<br />
random examples from <math>p_{d}</math>, as well as generated data (final renders) coming from the actor workers, and optimizes <math>\mathcal{L}_D</math> (1).<br />
<br />
[[File:Fig2 SPIRAL Architecture.png | 700px|center]]<br />
<br />
'''Note:''' any trajectories is not omitted in the policy learner.<br />
Instead, the <math>D</math> updates is decoupled from the <math>\pi</math> updates<br />
by introducing a replay buffer that serves as a communication<br />
layer between the actors and the discriminator learner.<br />
That allows the latter to optimize <math>D</math> at a higher rate than<br />
the training of the policy network due to the difference in<br />
network sizes (<math>\pi</math> is a multi-step RNN, while <math>D</math> is a plain<br />
'''CNN'''). Even though sampling from a replay<br />
buffer inevitably results in smoothing of <math>p_{g}</math>, this<br />
setup is found to work well in practice.<br />
<br />
= Experiments=<br />
<br />
<br />
== Environments ==<br />
Two rendering environment is introduced. For MNIST, OMNIGLOT and CELEBA generation an open-source painting librabry LIMBYPAINT (libmypaint<br />
contributors, 2018).) is used. The agent controls a brush and produces<br />
a sequence of (possibly disjoint) strokes on a canvas<br />
C. The state of the environment is comprised of the contents<br />
of <math>C</math> as well as the current brush location <math>l_{t}</math>. Each action<br />
$a_{t}$ is a tuple of 8 discrete decisions <math>(a1t; a2t; ... ; a8t)</math> (see<br />
Figure 3). The first two components are the control point <math>p_{c}</math><br />
and the endpoint <math>l_{t+1}</math> of the stroke.<br />
<br />
[[File:Fig3_agent_action_space.PNG | 450px|center]]<br />
<br />
The next 5<br />
components represent the appearance of the stroke: the<br />
pressure that the agent applies to the brush (10 levels), the<br />
brush size, and the stroke color characterized by a mixture<br />
of red, green and blue (20 bins for each color component).<br />
The last element of at is a binary flag specifying the type<br />
of action: the agent can choose either to produce a stroke<br />
or to jump right to $l_{t+1}$.<br />
<br />
In the MUJOCO SCENES experiment, we render images<br />
using a MuJoCo-based environment (Todorov et al., 2012).<br />
At each time step, the agent has to decide on the object<br />
type (4 options), its location on a 16 $\times$ 16 grid, its size<br />
(3 options) and the color (3 color components with 4 bins<br />
each). The resulting tuple is sent to the environment, which<br />
adds an object to the scene according to the specification.<br />
<br />
== Datasets ==<br />
<br />
=== MNIST ===<br />
For the MNIST dataset, two sets of experiments are conducted:<br />
<br />
1- In this experiment, an unconditional agent is trained to model the data distribution. Along with the reward provided by the discriminator, a small negative reward is provided to the agent for each continuous sequence of strokes to encourage the agent to draw a digit in a continuous motion of stroke. Example of such generation is depicted in the Fig 4a. <br />
<br />
2- In the second experiment, an agent is trained to reproduce a given digit. <br />
Several examples of conditional generated digits are shown in Fig 4b. <br />
<br />
[[File:Fig4a MNIST.png | 450px|center]]<br />
<br />
=== OMNIGLOT ===<br />
Now the trained agents are tested in a similar but more challenging setting of handwritten characters. As can be seen in Fig 5a, the unconditional generation has a lower quality compared to digits in the previous dataset. The conditional agents, on the other hand, were able to reach a convincing quality (Fig 5b). Moreover, as OMNIGLOT has lots of different symbols, the model that we created was able to learn a general idea of image production without memorizing the training data. We tested this result by inputting new unseen line drawings to our trained agent. As we concluded, it provided excellent results as shown in Figure 6. <br />
<br />
[[File:Fig5 OMNIGLOT.png | 450px|center]]<br />
<br />
<br />
For the MNIST dataset, two kinds of rewards, discriminator score and <math>l^{2}-\text{distance}</math> has been compared. Note that the discriminator based approach has a significantly lower training time and lower final <math>l^{2}</math> error.<br />
Following (Sharma et al., 2017), also a “blind” version of the agent without feeding any intermediate canvas states as an input to <math>\pi</math> is trained. The training curve for this experiment is also reported in Fig 8a. <br />
(dotted blue line) The results of training agents with discriminator based and <math>l^{2}-\text{distance}</math> approach is shown in Fig 8a as well.<br />
<br />
=== CELEBA ===<br />
<br />
Since the ''libmypaint'' environment is also capable of producing<br />
complex color paintings, this direction is explored by<br />
training a conditional agent on the CELEBA dataset. In this<br />
experiment, the agent does not receive any intermediate rewards.<br />
In addition to the reconstruction reward (either <math>l^2</math> or<br />
discriminator-based), earth mover’s<br />
distance between the color histograms of the model’s output<br />
and <math>x_{target}</math> is penalized. (Figure 7)<br />
<br />
[[File:Fig6 CELEBA.png | 450px|center]]<br />
<br />
Although blurry, the model’s reconstruction closely matches<br />
the high-level structure of each image. For instance the<br />
background color, the position of the face and the color of<br />
the person’s hair. In some cases, shadows around eyes and<br />
the nose are visible.<br />
<br />
=== MUJOCO SCENES ===<br />
<br />
For the MUJOCO SCENES dataset, the trained agent is used to construct simple CAD programs that best explain input images. Here only the case of the conditional generation is considered. Like before, the reward function for the generator can be either the <math>l^2</math> score or the discriminator output. In addition, there are not any auxiliary reward signals. This model has the capacity to infer and represent up to 20 objects and their attributes due to its unrolled 20 time steps.<br />
<br />
As shown in Figure 8b, the agent trained to directly minimize<br />
<math>l^2</math> is unable to solve the task and has significantly<br />
higher pixel-wise error. In comparison, the discriminator based<br />
variant solves the task and produces near-perfect reconstructions<br />
on a holdout set (Figure 10).<br />
<br />
[[File:Fig8 MUJOCO_SCENES.png | 500px|center]]<br />
For this experiment, the total number of possible execution traces is <math>M^N</math>, where <math>M = 4·16^2·3·4^3·3 </math> is the total number of attribute settings for a single object and N = 20 is the length of an episode. Then a general-purpose Metropolis-Hastings inference algorithm that samples an execution trace defining attributes for a maximum of 20 primitives was run on a set of 100 images. These attributes are considered as latent variables. During each time step of the inference, the attribute blocks (including presence/absence tags) corresponding to a single object are evenly flipped over the appropriate range. The resulting trace is presented as an output sample by the environment, and then the output sample is accepted or rejected using the Metropolis-Hastings update rule, where the Gaussian likelihood is centered on the test image and the fixed diagonal covariance is 0.25. From Figure 9, the MCMC search baseline cannot solve the task even after a lot of evaluation.<br />
[[File:figure9 mcmc.PNG| 500px|center]]<br />
<br />
= Discussion =<br />
As in the OMNIGLOT<br />
experiment, the <math>l^2</math>-based agent demonstrates some<br />
improvements over the random policy but gets stuck and, as<br />
a result fails to learn sensible reconstructions (Figure 8b).<br />
<br />
[[File:Fig7 Results.png | 500px|center]]<br />
<br />
<br />
Scaling visual program synthesis to the real world and combinatorial<br />
datasets has been a challenge. It has been shown that it is possible to train an adversarial generative agent employing<br />
black-box rendering simulators. Our results indicate that<br />
using the Wasserstein discriminator’s output as a reward<br />
function with asynchronous reinforcement learning can provide<br />
a scaling path for visual program synthesis. The current<br />
exploration strategy used in the agent is entropy-based, but<br />
future work should address this limitation by employing sophisticated<br />
search algorithms for policy improvement. For<br />
instance, Monte Carlo Tree Search can be used, analogous<br />
to AlphaGo Zero (Silver et al., 2017). General-purpose<br />
inference algorithms could also be used for this purpose.<br />
<br />
= Future Work =<br />
Future work should explore different parameterizations of<br />
action spaces. For instance, the use of two arbitrary control<br />
points are perhaps not the best way to represent strokes, as it<br />
is hard to deal with straight lines. Actions could also directly parametrize 3D surfaces, planes and learned texture models<br />
to invert richer visual scenes. On the reward side, using<br />
a joint image-action discriminator similar to BiGAN/ALI<br />
(Donahue et al., 2016; Dumoulin et al., 2016) (in this case,<br />
the policy can be viewed as an encoder, while the renderer becomes<br />
a decoder) could result in a more meaningful learning<br />
signal, since D will be forced to focus on the semantics of<br />
the image.<br />
<br />
= References =<br />
<br />
# Yaroslav Ganin, Tejas Kulkarni, Igor Babuschkin, S.M. Ali Eslami, Oriol Vinyals, [[https://arxiv.org/abs/1804.01118]].</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18&diff=37040stat946F182018-10-22T21:35:11Z<p>Mpafla: /* Paper presentation */</p>
<hr />
<div>== [[F18-STAT946-Proposal| Project Proposal ]] ==<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Unsupervised_Machine_Translation_Using_Monolingual_Corpora_Only Summary]]<br />
|-<br />
|Oct 25 || Dhruv Kumar || 1 || Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs || [https://openreview.net/pdf?id=rkRwGg-0Z Paper] || <br />
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Beyond_Word_Importance_Contextual_Decomposition_to_Extract_Interactions_from_LSTMs Summary]<br />
|-<br />
|Oct 25 || Amirpasha Ghabussi || 2 || DCN+: Mixed Objective And Deep Residual Coattention for Question Answering || [https://openreview.net/pdf?id=H1meywxRW Paper] ||<br />
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=DCN_plus:_Mixed_Objective_And_Deep_Residual_Coattention_for_Question_Answering Summary]<br />
|-<br />
|Oct 25 || Juan Carrillo || 3 || Hierarchical Representations for Efficient Architecture Search || [https://arxiv.org/abs/1711.00436 Paper] || <br />
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Hierarchical_Representations_for_Efficient_Architecture_Search Summary]<br />
|-<br />
|Oct 30 || Manpreet Singh Minhas || 1|| End-to-end Active Object Tracking via Reinforcement Learning || [http://proceedings.mlr.press/v80/luo18a/luo18a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=End_to_end_Active_Object_Tracking_via_Reinforcement_Learning Summary]<br />
|-<br />
|Oct 30 || Marvin Pafla || 2 || Fairness Without Demographics in Repeated Loss Minimization || [http://proceedings.mlr.press/v80/hashimoto18a.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fairness_Without_Demographics_in_Repeated_Loss_Minimization Summary]<br />
|-<br />
|Oct 30 || Glen Chalatov || 3 || Pixels to Graphs by Associative Embedding || [http://papers.nips.cc/paper/6812-pixels-to-graphs-by-associative-embedding Paper] ||<br />
|-<br />
|Nov 1 || Sriram Ganapathi Subramanian || 1||Differentiable plasticity: training plastic neural networks with backpropagation || [http://proceedings.mlr.press/v80/miconi18a.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/differentiableplasticity Summary]<br />
|-<br />
|Nov 1 || Hadi Nekoei || 1|| Synthesizing Programs for Images using Reinforced Adversarial Learning || [http://proceedings.mlr.press/v80/ganin18a.html Paper] || <br />
|-<br />
|Nov 1 || Henry Chen || 1|| DeepVO: Towards end-to-end visual odometry with deep Recurrent Convolutional Neural Networks || [https://ieeexplore.ieee.org/abstract/document/7989236 Paper] || <br />
|-<br />
|Nov 6 || Nargess Heydari || 2 || || || <br />
|-<br />
|Nov 6 || Aravind Ravi || 3 || Towards Image Understanding from Deep Compression Without Decoding || [https://openreview.net/forum?id=HkXWCMbRW Paper] || <br />
|-<br />
|Nov 6 || Ronald Feng || 1 || Learning to Teach || [https://openreview.net/pdf?id=HJewuJWCZ Paper] || <br />
|-<br />
|Nov 8 || Neel Bhatt || 1 || Annotating Object Instances with a Polygon-RNN || [https://www.cs.utoronto.ca/~fidler/papers/paper_polyrnn.pdf Paper] || <br />
|-<br />
|Nov 8 || Jacob Manuel || 2 || || || <br />
|-<br />
|Nov 8 || Charupriya Sharma|| 2 || || || <br />
|-<br />
|NOv 13 || Sagar Rajendran || 1|| Zero-Shot Visual Imitation || [https://openreview.net/pdf?id=BkisuzWRW Paper] || <br />
|-<br />
|Nov 13 || Jiazhen Chen || 2|| || || <br />
|-<br />
|Nov 13 || Neil Budnarain || 2|| PixelNN: Example-Based Image Synthesis || [https://openreview.net/pdf?id=Syhr6pxCW Paper] || <br />
|-<br />
|NOv 15 || Zheng Ma || 3|| Reinforcement Learning of Theorem Proving || [https://arxiv.org/abs/1805.07563 Paper] || <br />
|-<br />
|Nov 15 || Abdul Khader Naik || 4|| || ||<br />
|-<br />
|Nov 15 || Johra Muhammad Moosa || 2|| Attend and Predict: Understanding Gene Regulation by Selective Attention on Chromatin || [https://papers.nips.cc/paper/7255-attend-and-predict-understanding-gene-regulation-by-selective-attention-on-chromatin.pdf Paper] || <br />
|-<br />
|NOv 20 || Zahra Rezapour Siahgourabi || 1|| || || <br />
|-<br />
|Nov 20 || Shubham Koundinya || 6|| || || <br />
|-<br />
|Nov 20 || Salman Khan || || Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples || [http://proceedings.mlr.press/v80/athalye18a.html paper] || <br />
|-<br />
|NOv 22 ||Soroush Ameli || 3|| Learning to Navigate in Cities Without a Map || [https://arxiv.org/abs/1804.00168 paper] || <br />
|-<br />
|Nov 22 ||Ivan Li || 23 || Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate || [https://arxiv.org/pdf/1806.05161v2.pdf Paper] ||<br />
|-<br />
|Nov 22 ||Sigeng Chen || 2 || || ||<br />
|-<br />
|Nov 27 || Aileen Li || 8|| Spatially Transformed Adversarial Examples ||[https://openreview.net/pdf?id=HyydRMZC- Paper] || <br />
|-<br />
|NOv 27 ||Xudong Peng || 9|| Multi-Scale Dense Networks for Resource Efficient Image Classification || [https://openreview.net/pdf?id=Hk2aImxAb Paper] || <br />
|-<br />
|Nov 27 ||Xinyue Zhang || 10|| An Inference-Based Policy Gradient Method for Learning Options || [http://proceedings.mlr.press/v80/smith18a/smith18a.pdf Paper] || <br />
|-<br />
|NOv 29 ||Junyi Zhang || 11|| || || <br />
|-<br />
|Nov 29 ||Travis Bender || 12|| Automatic Goal Generation for Reinforcement Learning Agents || [http://proceedings.mlr.press/v80/florensa18a/florensa18a.pdf Paper] ||<br />
|-<br />
|Nov 29 ||Patrick Li || 12|| Near Optimal Frequent Directions for Sketching Dense and Sparse Matrices || [https://www.cse.ust.hk/~huangzf/ICML18.pdf Paper] ||<br />
|-<br />
|Makup || Ruijie Zhang || 1 || Searching for Efficient Multi-Scale Architectures for Dense Image Prediction || [https://arxiv.org/pdf/1809.04184.pdf Paper]||<br />
|-<br />
|Makup || Ahmed Afify || 2||Don't Decay the Learning Rate, Increase the Batch Size || [https://openreview.net/pdf?id=B1Yy1BxCZ Paper]||<br />
|-<br />
|Makup || Gaurav Sahu || 3 || TBD || ||<br />
|-<br />
|Makup || Kashif Khan || 4 || Wasserstein Auto-Encoders || [https://arxiv.org/pdf/1711.01558.pdf Paper] ||<br />
|-<br />
|Makup || Shala Chen || || A NEURAL REPRESENTATION OF SKETCH DRAWINGS || ||<br />
|-<br />
|Makup || Ki Beom Lee || || || ||<br />
|-<br />
|Makup || Wesley Fisher || || Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling || [http://proceedings.mlr.press/v80/lee18b/lee18b.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Reinforcement_Learning_in_Continuous_Action_Spaces_a_Case_Study_in_the_Game_of_Simulated_Curling Summary]</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fairness_Without_Demographics_in_Repeated_Loss_Minimization&diff=37039Fairness Without Demographics in Repeated Loss Minimization2018-10-22T21:31:52Z<p>Mpafla: /* Representation Disparity */</p>
<hr />
<div>This page contains the summary of the paper "[http://proceedings.mlr.press/v80/hashimoto18a.html Fairness Without Demographics in Repeated Loss Minimization]" by Hashimoto, T. B., Srivastava, M., Namkoong, H., & Liang, P. which was published at the International Conference of Machine Learning (ICML) in 2018. <br />
<br />
=Introduction=<br />
<br />
Usually, machine learning models are minimized in their average loss to achieve high overall accuracy. While this works well for the majority, minority groups that use the system suffer high error rates because they contribute less data to the model. This phenomenon is known as '''''representation disparity''''' and has been observed in many models that, for instance, recognize faces, or identify language. This disparity even increases as minority users suffer higher error rates, and therefore, are less likely to use the system in the future. As a consequence minority groups further shrink, and therefore, less data is available for the next optimization of the model. With less data the disparity becomes even worse - a phenomenon referred to as '''''disparity amplification'''''. <br />
<br />
In this paper, Hashimoto et al. first show that standard '''''empirical risk minimization (ERM)''''' does not control the loss of minority groups, and thus causes representation disparity and its amplification over time (even if the model is fair in the beginning). Second, the researchers try to mitigate this unfairness by proposing the use of '''''distributionally robust optimization (DRO)'''''. Indeed Hashimoto et al. are able to show that DRO can bound the loss for minority groups and is fair on models that ERM turns unfair. <br />
<br />
===Note on Fairness===<br />
<br />
Hashimoto et al. follow the ''difference principle'' to achieve and measure fairness. It is defined as the maximization of the welfare of the worst-off group rather than the whole group (cf. utilitarianism).<br />
<br />
=Representation Disparity=<br />
<br />
If a user makes a query <math display="inline">Z \sim P</math>, the model <math display="inline">\theta \in \Theta</math> makes a prediction, and the user experiences loss <math display="inline">\ell (\theta; Z)</math>. The ''expected'' loss of of a model <math display="inline">\theta</math> is denoted as the risk <math display="inline">\mathcal{R}(\theta) = \mathbb{E}_{Z \sim P} [\ell (\theta; Z)] </math>. While the queries (i.e. users using the system and providing data to the model) are made by users from <math display="inline">K</math> latent groups such that <math display="inline">Z \sim P := \sum_{k \in [K]} \alpha_kP_k</math>, neither the actual population proportions <math display="inline">\alpha_k</math> nor the group distributions <math display="inline">P_k</math> are known. Therefore, Hashimoto et al.'s goal is to control the worst-case risk over ''all'' groups <math display="inline">K</math> (and not just the risk of the worst-off group):<br />
<br />
\begin{align}<br />
\mathcal{R}_{max}(\theta) := \underset{k \in [K]}{max} \mathcal{R}_k(\theta), \qquad \mathcal{R}_k(\theta) := \mathbb{E}_{P_k} [\ell (\theta; Z)]<br />
\end{align}<br />
<br />
There is high representation disparity if the expected loss of the model <math display="inline">\mathcal{R}(\theta)</math> is low, but the worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> of one group is high. This means that a model with high representation disparity performs well on average (i.e. has low overall loss), but fails to represent some groups <math display="inline">k</math> (i.e. the risk for the worst-off group is high). In order to make models with high representation disparity fairer, Hashimot et al. tried to minimize this worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> in general.<br />
<br />
=Disparity Amplification=<br />
<br />
Representation disparity can amplify as time passes and loss is minimized. Over <math display="inline">t = 1, 2, ..., T</math> minimization rounds the group proportions <math display="inline">\alpha_k^{(t)}</math> vary dependent on past losses. At each round the expected number of users <math display="inline">\lambda_k^{(t+1)}</math> from group <math display="inline">k</math> is determined by <br />
<br />
\begin{align}<br />
\lambda_k^{(t+1)} := \lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)})) + b_k<br />
\end{align}<br />
<br />
where <math display="inline">\lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)}))</math> describes the fraction of retained users from the previous optimization and <math display="inline">b_k</math> is the number of new users of group <math display="inline">k</math>. To put simply, the number of expected users of a group depend on the number of new users of that group, and the fraction of users that continue to use the system from the previous optimization step. If less and less users from minority groups return to the model (i.e. the model as a low retention rate of minority group users), Hashimoto et al. argue that the representation disparity amplifies. <br />
<br />
==Empirical Risk Minimization (ERM)==<br />
<br />
Without the knowledge of population proportions <math display="inline">\alpha_k^{(t)}</math>, the new user rate <math display="inline">b_k</math>, and the retention function <math display="inline">\nu</math> it is hard in practice, however, to control the worst-case risk over all time periods <math display="inline">\mathcal{R}_{max}^T</math> for all groups. That is why it is the standard approach to fit a sequence of models <math display="inline">\theta^{(t)}</math> by empirically approximating them. Using ERM, for instance, the optimal model is approached by minimizing the loss of the model:<br />
<br />
\begin{align}<br />
\theta^{(t)} = arg min_{\theta \in \Theta} \sum_i \ell(\theta; Z_i^{(t)})<br />
\end{align}<br />
<br />
However, ERM fails to prevent disparity amplification. By minimizing the expected loss of the model, minority groups experience higher loss (because the loss of the majority group is minimized), and do not return to use the system. In doing so, the population proportions <math display="inline">\alpha_k^{(t)}</math> shift, and certain minority groups contribute even less to the system. This is mirrored in the expected user count <math display="inline">\lambda^{(t)}</math> at each optimization point. In their paper Hashimoto et al. show that, if using ERM, <math display="inline">\lambda^{(t)}</math> is unstable because it loses its fair fixed point (i.e. the population fraction where risk minimization maintains the same population fraction over time). Therefore, ERM fails to control minority risk over time, and is considered unfair.<br />
<br />
=Distributonally Robust Optimization (DRO)=<br />
<br />
To overcome the unfairness of ERM, Hashimoto et al. developed a distributionally robust optimization (DRO). At this point the goal is still to minimize the worst-case group risk over a single time-step <math display="inline">\mathcal{R}_{max} (\theta^{(t)}) </math> (time steps are omitted in this sections' formulas). As previously mentioned, this is difficult to do because neither the population proportions <math display="inline">\alpha_k </math> nor group distributions <math display="inline">P_k </math> are known. Therefore, Hashimoto et al. developed an optimization technique that is robust "against '''''all''''' directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution <math display="inline">P_k </math> whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function <math display="inline">\mathcal{R}_{dro} </math> has to "up-weigh" data <math display="inline">Z</math> that cause high loss <math display="inline">\ell(\theta, Z)</math>. In other words, the risk function has to over-represent mixture components (i.e. group distributions <math display="inline">P_k </math>) in relation to their original mixture weights (i.e. the population proportions <math display="inline">\alpha_k </math>) for groups that suffer high loss. <br />
<br />
To do this Hashimoto et al. considered the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> around <math display="inline">P</math> within a certain limit (because obviously not every outlier should be up-weighed). This limit is described by the <math display="inline">\chi^2</math>-divergence (i.e. the distance, roughly speaking) between probability distributions. For two distributions <math display="inline">P</math> and <math display="inline">Q</math> the divergence is defined as <math display="inline">D_{\chi^2} (P || Q):= \int (\frac{dP}{dQ} - 1)^2</math>. With the help of the <math display="inline">\chi^2</math>-divergence, Hashimoto et al. defined the chi-squared ball <math display="inline">\mathcal{B}(P,r)</math> around the probability distribution P. This ball is defined so that <math display="inline">\mathcal{B}(P,r) := \{Q \ll P : D_{\chi^2} (Q || P) \leq r \}</math>. With the help of this ball the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> that lie inside the ball (i.e. within reasonable range) around the probability distribution <math display="inline">P</math> can be considered. This loss is given by<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]<br />
\end{align}<br />
<br />
which for <math display="inline">P:= \sum_{k \in [K]} \alpha_k P_k</math> for all models <math display="inline">\theta \in \Theta</math> where <math display="inline">r_k := (1/a_k -1)^2</math> bounds the risk <math display="inline">\mathcal{R}_k(\theta) \leq \mathcal{R}_{dro} (\theta; r_k)</math> for each group with risk <math display="inline">\mathcal{R}_k(\theta)</math>. Furthermore, if the lower bound on the group proportions <math display="inline">\alpha_{min} \leq min_{k \in [K]} \alpha_k</math> is specified, and the radius is defined as <math display="inline">r_{max} := (1/\alpha_{min} -1)^2</math>, the worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math> can be controlled by <math display="inline">\mathcal{R}_{dro} (\theta; r_{max}) </math> by forming an upper bound that can be minimized.<br />
<br />
==Optimization of DRO==<br />
<br />
To minimize <math display="inline">\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]</math> Hashimoto et al. look at the dual of this maximization problem (i.e. every maximization problem can be transformed into a minimization problem and vice-versa). This dual is given by the minimization problem<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) = \underset{\eta \in \mathbb{R}}{inf} \left\{ F(\theta; \eta):= C\left(\mathbb{E}_P \left[ [\ell(\theta;Z) - \eta]_+^2 \right] \right)^\frac{1}{2} + \eta \right\}<br />
\end{align}<br />
<br />
with <math display="inline">C = (2(1/a_{min} - 1)^2 + 1)^{1/2}</math>. <math display="inline">\eta</math> describes the dual variable (i.e. the variable that appears in creating the dual). Since <math display="inline">F(\theta; \eta)</math> involves an expectation <math display="inline">\mathbb{E}_P</math> over the data generating distribution <math display="inline">P</math>, <math display="inline">F(\theta; \eta)</math> can be directly minimized. For convex losses <math display="inline">\ell(\theta;Z)</math>, <math display="inline">F(\theta; \eta)</math> is convex, and can be minimized by performing a binary search over <math display="inline">\eta</math>. In their paper, Hashimoto et al. further show that optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> at each time step controls the ''future'' worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math>, and therefore retention rates. That means if the initial group proportions satisfy <math display="inline">\alpha_k^{(0)} \geq a_{min}</math>, and <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> is optimized for every time step (and therefore <math display="inline">\mathcal{R}_{max} (\theta) </math> is minimized), <math display="inline">\mathcal{R}_{max}^T (\theta) </math> over all time steps is controlled. In other words, optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> every time step is enough to avoid disparity amplification.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fairness_Without_Demographics_in_Repeated_Loss_Minimization&diff=37038Fairness Without Demographics in Repeated Loss Minimization2018-10-22T21:30:32Z<p>Mpafla: /* Representation Disparity */</p>
<hr />
<div>This page contains the summary of the paper "[http://proceedings.mlr.press/v80/hashimoto18a.html Fairness Without Demographics in Repeated Loss Minimization]" by Hashimoto, T. B., Srivastava, M., Namkoong, H., & Liang, P. which was published at the International Conference of Machine Learning (ICML) in 2018. <br />
<br />
=Introduction=<br />
<br />
Usually, machine learning models are minimized in their average loss to achieve high overall accuracy. While this works well for the majority, minority groups that use the system suffer high error rates because they contribute less data to the model. This phenomenon is known as '''''representation disparity''''' and has been observed in many models that, for instance, recognize faces, or identify language. This disparity even increases as minority users suffer higher error rates, and therefore, are less likely to use the system in the future. As a consequence minority groups further shrink, and therefore, less data is available for the next optimization of the model. With less data the disparity becomes even worse - a phenomenon referred to as '''''disparity amplification'''''. <br />
<br />
In this paper, Hashimoto et al. first show that standard '''''empirical risk minimization (ERM)''''' does not control the loss of minority groups, and thus causes representation disparity and its amplification over time (even if the model is fair in the beginning). Second, the researchers try to mitigate this unfairness by proposing the use of '''''distributionally robust optimization (DRO)'''''. Indeed Hashimoto et al. are able to show that DRO can bound the loss for minority groups and is fair on models that ERM turns unfair. <br />
<br />
===Note on Fairness===<br />
<br />
Hashimoto et al. follow the ''difference principle'' to achieve and measure fairness. It is defined as the maximization of the welfare of the worst-off group rather than the whole group (cf. utilitarianism).<br />
<br />
=Representation Disparity=<br />
<br />
If a user makes a query <math display="inline">Z \sim P</math>, the model <math display="inline">\theta \in \Theta</math> makes a prediction, and the user experiences loss <math display="inline">\ell (\theta; Z)</math>. The ''expected'' loss of of a model <math display="inline">\theta</math> is denoted as the risk <math display="inline">\mathcal{R}(\theta) = \mathbb{E}_{Z \sim P} [\ell (\theta; Z)] </math>. While the queries (i.e. users using the system and providing data to the model) are made by users from <math display="inline">K</math> latent groups such that <math display="inline">Z \sim P := \sum_{k \in [K]} \alpha_kP_k</math>, neither the actual population proportions <math display="inline">\alpha_k</math> nor the group distributions <math display="inline">P_k</math> are known. Therefore, Hashimoto et al.'s goal is to control the worst-case risk over ''all'' groups <math display="inline">K</math> (and not just the risk of the worst-off group):<br />
<br />
\begin{align}<br />
\mathcal{R}_{max}(\theta) := \underset{k \in [K]}{max} \mathcal{R}_k(\theta), \qquad \mathcal{R}_k(\theta) := \mathbb{E}_{P_k} [\ell (\theta; Z)]<br />
\end{align}<br />
<br />
There is high representation disparity if the expected loss of the model <math display="inline">\mathcal{R}(\theta)</math> is low, but the worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> of one group is high. This means that a model with high representation disparity performs well on average (i.e. has low overall loss), but fails to represent some groups <math display="inline">k</math> (i.e. the risk for the worst-off group is high). In order to make models with high representation disparity fairer, Hashimot et al. try to minimize this worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> in general.<br />
<br />
=Disparity Amplification=<br />
<br />
Representation disparity can amplify as time passes and loss is minimized. Over <math display="inline">t = 1, 2, ..., T</math> minimization rounds the group proportions <math display="inline">\alpha_k^{(t)}</math> vary dependent on past losses. At each round the expected number of users <math display="inline">\lambda_k^{(t+1)}</math> from group <math display="inline">k</math> is determined by <br />
<br />
\begin{align}<br />
\lambda_k^{(t+1)} := \lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)})) + b_k<br />
\end{align}<br />
<br />
where <math display="inline">\lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)}))</math> describes the fraction of retained users from the previous optimization and <math display="inline">b_k</math> is the number of new users of group <math display="inline">k</math>. To put simply, the number of expected users of a group depend on the number of new users of that group, and the fraction of users that continue to use the system from the previous optimization step. If less and less users from minority groups return to the model (i.e. the model as a low retention rate of minority group users), Hashimoto et al. argue that the representation disparity amplifies. <br />
<br />
==Empirical Risk Minimization (ERM)==<br />
<br />
Without the knowledge of population proportions <math display="inline">\alpha_k^{(t)}</math>, the new user rate <math display="inline">b_k</math>, and the retention function <math display="inline">\nu</math> it is hard in practice, however, to control the worst-case risk over all time periods <math display="inline">\mathcal{R}_{max}^T</math> for all groups. That is why it is the standard approach to fit a sequence of models <math display="inline">\theta^{(t)}</math> by empirically approximating them. Using ERM, for instance, the optimal model is approached by minimizing the loss of the model:<br />
<br />
\begin{align}<br />
\theta^{(t)} = arg min_{\theta \in \Theta} \sum_i \ell(\theta; Z_i^{(t)})<br />
\end{align}<br />
<br />
However, ERM fails to prevent disparity amplification. By minimizing the expected loss of the model, minority groups experience higher loss (because the loss of the majority group is minimized), and do not return to use the system. In doing so, the population proportions <math display="inline">\alpha_k^{(t)}</math> shift, and certain minority groups contribute even less to the system. This is mirrored in the expected user count <math display="inline">\lambda^{(t)}</math> at each optimization point. In their paper Hashimoto et al. show that, if using ERM, <math display="inline">\lambda^{(t)}</math> is unstable because it loses its fair fixed point (i.e. the population fraction where risk minimization maintains the same population fraction over time). Therefore, ERM fails to control minority risk over time, and is considered unfair.<br />
<br />
=Distributonally Robust Optimization (DRO)=<br />
<br />
To overcome the unfairness of ERM, Hashimoto et al. developed a distributionally robust optimization (DRO). At this point the goal is still to minimize the worst-case group risk over a single time-step <math display="inline">\mathcal{R}_{max} (\theta^{(t)}) </math> (time steps are omitted in this sections' formulas). As previously mentioned, this is difficult to do because neither the population proportions <math display="inline">\alpha_k </math> nor group distributions <math display="inline">P_k </math> are known. Therefore, Hashimoto et al. developed an optimization technique that is robust "against '''''all''''' directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution <math display="inline">P_k </math> whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function <math display="inline">\mathcal{R}_{dro} </math> has to "up-weigh" data <math display="inline">Z</math> that cause high loss <math display="inline">\ell(\theta, Z)</math>. In other words, the risk function has to over-represent mixture components (i.e. group distributions <math display="inline">P_k </math>) in relation to their original mixture weights (i.e. the population proportions <math display="inline">\alpha_k </math>) for groups that suffer high loss. <br />
<br />
To do this Hashimoto et al. considered the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> around <math display="inline">P</math> within a certain limit (because obviously not every outlier should be up-weighed). This limit is described by the <math display="inline">\chi^2</math>-divergence (i.e. the distance, roughly speaking) between probability distributions. For two distributions <math display="inline">P</math> and <math display="inline">Q</math> the divergence is defined as <math display="inline">D_{\chi^2} (P || Q):= \int (\frac{dP}{dQ} - 1)^2</math>. With the help of the <math display="inline">\chi^2</math>-divergence, Hashimoto et al. defined the chi-squared ball <math display="inline">\mathcal{B}(P,r)</math> around the probability distribution P. This ball is defined so that <math display="inline">\mathcal{B}(P,r) := \{Q \ll P : D_{\chi^2} (Q || P) \leq r \}</math>. With the help of this ball the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> that lie inside the ball (i.e. within reasonable range) around the probability distribution <math display="inline">P</math> can be considered. This loss is given by<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]<br />
\end{align}<br />
<br />
which for <math display="inline">P:= \sum_{k \in [K]} \alpha_k P_k</math> for all models <math display="inline">\theta \in \Theta</math> where <math display="inline">r_k := (1/a_k -1)^2</math> bounds the risk <math display="inline">\mathcal{R}_k(\theta) \leq \mathcal{R}_{dro} (\theta; r_k)</math> for each group with risk <math display="inline">\mathcal{R}_k(\theta)</math>. Furthermore, if the lower bound on the group proportions <math display="inline">\alpha_{min} \leq min_{k \in [K]} \alpha_k</math> is specified, and the radius is defined as <math display="inline">r_{max} := (1/\alpha_{min} -1)^2</math>, the worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math> can be controlled by <math display="inline">\mathcal{R}_{dro} (\theta; r_{max}) </math> by forming an upper bound that can be minimized.<br />
<br />
==Optimization of DRO==<br />
<br />
To minimize <math display="inline">\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]</math> Hashimoto et al. look at the dual of this maximization problem (i.e. every maximization problem can be transformed into a minimization problem and vice-versa). This dual is given by the minimization problem<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) = \underset{\eta \in \mathbb{R}}{inf} \left\{ F(\theta; \eta):= C\left(\mathbb{E}_P \left[ [\ell(\theta;Z) - \eta]_+^2 \right] \right)^\frac{1}{2} + \eta \right\}<br />
\end{align}<br />
<br />
with <math display="inline">C = (2(1/a_{min} - 1)^2 + 1)^{1/2}</math>. <math display="inline">\eta</math> describes the dual variable (i.e. the variable that appears in creating the dual). Since <math display="inline">F(\theta; \eta)</math> involves an expectation <math display="inline">\mathbb{E}_P</math> over the data generating distribution <math display="inline">P</math>, <math display="inline">F(\theta; \eta)</math> can be directly minimized. For convex losses <math display="inline">\ell(\theta;Z)</math>, <math display="inline">F(\theta; \eta)</math> is convex, and can be minimized by performing a binary search over <math display="inline">\eta</math>. In their paper, Hashimoto et al. further show that optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> at each time step controls the ''future'' worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math>, and therefore retention rates. That means if the initial group proportions satisfy <math display="inline">\alpha_k^{(0)} \geq a_{min}</math>, and <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> is optimized for every time step (and therefore <math display="inline">\mathcal{R}_{max} (\theta) </math> is minimized), <math display="inline">\mathcal{R}_{max}^T (\theta) </math> over all time steps is controlled. In other words, optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> every time step is enough to avoid disparity amplification.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fairness_Without_Demographics_in_Repeated_Loss_Minimization&diff=37037Fairness Without Demographics in Repeated Loss Minimization2018-10-22T21:30:07Z<p>Mpafla: /* Representation Disparity */</p>
<hr />
<div>This page contains the summary of the paper "[http://proceedings.mlr.press/v80/hashimoto18a.html Fairness Without Demographics in Repeated Loss Minimization]" by Hashimoto, T. B., Srivastava, M., Namkoong, H., & Liang, P. which was published at the International Conference of Machine Learning (ICML) in 2018. <br />
<br />
=Introduction=<br />
<br />
Usually, machine learning models are minimized in their average loss to achieve high overall accuracy. While this works well for the majority, minority groups that use the system suffer high error rates because they contribute less data to the model. This phenomenon is known as '''''representation disparity''''' and has been observed in many models that, for instance, recognize faces, or identify language. This disparity even increases as minority users suffer higher error rates, and therefore, are less likely to use the system in the future. As a consequence minority groups further shrink, and therefore, less data is available for the next optimization of the model. With less data the disparity becomes even worse - a phenomenon referred to as '''''disparity amplification'''''. <br />
<br />
In this paper, Hashimoto et al. first show that standard '''''empirical risk minimization (ERM)''''' does not control the loss of minority groups, and thus causes representation disparity and its amplification over time (even if the model is fair in the beginning). Second, the researchers try to mitigate this unfairness by proposing the use of '''''distributionally robust optimization (DRO)'''''. Indeed Hashimoto et al. are able to show that DRO can bound the loss for minority groups and is fair on models that ERM turns unfair. <br />
<br />
===Note on Fairness===<br />
<br />
Hashimoto et al. follow the ''difference principle'' to achieve and measure fairness. It is defined as the maximization of the welfare of the worst-off group rather than the whole group (cf. utilitarianism).<br />
<br />
=Representation Disparity=<br />
<br />
If a user makes a query <math display="inline">Z \sim P</math>, the model <math display="inline">\theta \in \Theta</math> makes a prediction, and the user experiences loss <math display="inline">\ell (\theta; Z)</math>. The ''expected'' loss of of a model <math display="inline">\theta</math> is denoted as the risk <math display="inline">\mathcal{R}(\theta) = \mathbb{E}_{Z \sim P} [\ell (\theta; Z)] </math>. While the queries (i.e. users using the system and providing data to the model) are made by users from <math display="inline">K</math> latent groups such that <math display="inline">Z \sim P := \sum_{k \in [k]} \alpha_kP_k</math>, neither the actual population proportions <math display="inline">\alpha_k</math> nor the group distributions <math display="inline">P_k</math> are known. Therefore, Hashimoto et al.'s goal is to control the worst-case risk over ''all'' groups <math display="inline">K</math> (and not just the risk of the worst-off group):<br />
<br />
\begin{align}<br />
\mathcal{R}_{max}(\theta) := \underset{k \in [K]}{max} \mathcal{R}_k(\theta), \qquad \mathcal{R}_k(\theta) := \mathbb{E}_{P_k} [\ell (\theta; Z)]<br />
\end{align}<br />
<br />
There is high representation disparity if the expected loss of the model <math display="inline">\mathcal{R}(\theta)</math> is low, but the worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> of one group is high. This means that a model with high representation disparity performs well on average (i.e. has low overall loss), but fails to represent some groups <math display="inline">k</math> (i.e. the risk for the worst-off group is high). In order to make models with high representation disparity fairer, Hashimot et al. try to minimize this worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> in general.<br />
<br />
=Disparity Amplification=<br />
<br />
Representation disparity can amplify as time passes and loss is minimized. Over <math display="inline">t = 1, 2, ..., T</math> minimization rounds the group proportions <math display="inline">\alpha_k^{(t)}</math> vary dependent on past losses. At each round the expected number of users <math display="inline">\lambda_k^{(t+1)}</math> from group <math display="inline">k</math> is determined by <br />
<br />
\begin{align}<br />
\lambda_k^{(t+1)} := \lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)})) + b_k<br />
\end{align}<br />
<br />
where <math display="inline">\lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)}))</math> describes the fraction of retained users from the previous optimization and <math display="inline">b_k</math> is the number of new users of group <math display="inline">k</math>. To put simply, the number of expected users of a group depend on the number of new users of that group, and the fraction of users that continue to use the system from the previous optimization step. If less and less users from minority groups return to the model (i.e. the model as a low retention rate of minority group users), Hashimoto et al. argue that the representation disparity amplifies. <br />
<br />
==Empirical Risk Minimization (ERM)==<br />
<br />
Without the knowledge of population proportions <math display="inline">\alpha_k^{(t)}</math>, the new user rate <math display="inline">b_k</math>, and the retention function <math display="inline">\nu</math> it is hard in practice, however, to control the worst-case risk over all time periods <math display="inline">\mathcal{R}_{max}^T</math> for all groups. That is why it is the standard approach to fit a sequence of models <math display="inline">\theta^{(t)}</math> by empirically approximating them. Using ERM, for instance, the optimal model is approached by minimizing the loss of the model:<br />
<br />
\begin{align}<br />
\theta^{(t)} = arg min_{\theta \in \Theta} \sum_i \ell(\theta; Z_i^{(t)})<br />
\end{align}<br />
<br />
However, ERM fails to prevent disparity amplification. By minimizing the expected loss of the model, minority groups experience higher loss (because the loss of the majority group is minimized), and do not return to use the system. In doing so, the population proportions <math display="inline">\alpha_k^{(t)}</math> shift, and certain minority groups contribute even less to the system. This is mirrored in the expected user count <math display="inline">\lambda^{(t)}</math> at each optimization point. In their paper Hashimoto et al. show that, if using ERM, <math display="inline">\lambda^{(t)}</math> is unstable because it loses its fair fixed point (i.e. the population fraction where risk minimization maintains the same population fraction over time). Therefore, ERM fails to control minority risk over time, and is considered unfair.<br />
<br />
=Distributonally Robust Optimization (DRO)=<br />
<br />
To overcome the unfairness of ERM, Hashimoto et al. developed a distributionally robust optimization (DRO). At this point the goal is still to minimize the worst-case group risk over a single time-step <math display="inline">\mathcal{R}_{max} (\theta^{(t)}) </math> (time steps are omitted in this sections' formulas). As previously mentioned, this is difficult to do because neither the population proportions <math display="inline">\alpha_k </math> nor group distributions <math display="inline">P_k </math> are known. Therefore, Hashimoto et al. developed an optimization technique that is robust "against '''''all''''' directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution <math display="inline">P_k </math> whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function <math display="inline">\mathcal{R}_{dro} </math> has to "up-weigh" data <math display="inline">Z</math> that cause high loss <math display="inline">\ell(\theta, Z)</math>. In other words, the risk function has to over-represent mixture components (i.e. group distributions <math display="inline">P_k </math>) in relation to their original mixture weights (i.e. the population proportions <math display="inline">\alpha_k </math>) for groups that suffer high loss. <br />
<br />
To do this Hashimoto et al. considered the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> around <math display="inline">P</math> within a certain limit (because obviously not every outlier should be up-weighed). This limit is described by the <math display="inline">\chi^2</math>-divergence (i.e. the distance, roughly speaking) between probability distributions. For two distributions <math display="inline">P</math> and <math display="inline">Q</math> the divergence is defined as <math display="inline">D_{\chi^2} (P || Q):= \int (\frac{dP}{dQ} - 1)^2</math>. With the help of the <math display="inline">\chi^2</math>-divergence, Hashimoto et al. defined the chi-squared ball <math display="inline">\mathcal{B}(P,r)</math> around the probability distribution P. This ball is defined so that <math display="inline">\mathcal{B}(P,r) := \{Q \ll P : D_{\chi^2} (Q || P) \leq r \}</math>. With the help of this ball the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> that lie inside the ball (i.e. within reasonable range) around the probability distribution <math display="inline">P</math> can be considered. This loss is given by<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]<br />
\end{align}<br />
<br />
which for <math display="inline">P:= \sum_{k \in [K]} \alpha_k P_k</math> for all models <math display="inline">\theta \in \Theta</math> where <math display="inline">r_k := (1/a_k -1)^2</math> bounds the risk <math display="inline">\mathcal{R}_k(\theta) \leq \mathcal{R}_{dro} (\theta; r_k)</math> for each group with risk <math display="inline">\mathcal{R}_k(\theta)</math>. Furthermore, if the lower bound on the group proportions <math display="inline">\alpha_{min} \leq min_{k \in [K]} \alpha_k</math> is specified, and the radius is defined as <math display="inline">r_{max} := (1/\alpha_{min} -1)^2</math>, the worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math> can be controlled by <math display="inline">\mathcal{R}_{dro} (\theta; r_{max}) </math> by forming an upper bound that can be minimized.<br />
<br />
==Optimization of DRO==<br />
<br />
To minimize <math display="inline">\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]</math> Hashimoto et al. look at the dual of this maximization problem (i.e. every maximization problem can be transformed into a minimization problem and vice-versa). This dual is given by the minimization problem<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) = \underset{\eta \in \mathbb{R}}{inf} \left\{ F(\theta; \eta):= C\left(\mathbb{E}_P \left[ [\ell(\theta;Z) - \eta]_+^2 \right] \right)^\frac{1}{2} + \eta \right\}<br />
\end{align}<br />
<br />
with <math display="inline">C = (2(1/a_{min} - 1)^2 + 1)^{1/2}</math>. <math display="inline">\eta</math> describes the dual variable (i.e. the variable that appears in creating the dual). Since <math display="inline">F(\theta; \eta)</math> involves an expectation <math display="inline">\mathbb{E}_P</math> over the data generating distribution <math display="inline">P</math>, <math display="inline">F(\theta; \eta)</math> can be directly minimized. For convex losses <math display="inline">\ell(\theta;Z)</math>, <math display="inline">F(\theta; \eta)</math> is convex, and can be minimized by performing a binary search over <math display="inline">\eta</math>. In their paper, Hashimoto et al. further show that optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> at each time step controls the ''future'' worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math>, and therefore retention rates. That means if the initial group proportions satisfy <math display="inline">\alpha_k^{(0)} \geq a_{min}</math>, and <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> is optimized for every time step (and therefore <math display="inline">\mathcal{R}_{max} (\theta) </math> is minimized), <math display="inline">\mathcal{R}_{max}^T (\theta) </math> over all time steps is controlled. In other words, optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> every time step is enough to avoid disparity amplification.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fairness_Without_Demographics_in_Repeated_Loss_Minimization&diff=37036Fairness Without Demographics in Repeated Loss Minimization2018-10-22T21:23:16Z<p>Mpafla: /* Distributonally Robust Optimization (DRO) */</p>
<hr />
<div>This page contains the summary of the paper "[http://proceedings.mlr.press/v80/hashimoto18a.html Fairness Without Demographics in Repeated Loss Minimization]" by Hashimoto, T. B., Srivastava, M., Namkoong, H., & Liang, P. which was published at the International Conference of Machine Learning (ICML) in 2018. <br />
<br />
=Introduction=<br />
<br />
Usually, machine learning models are minimized in their average loss to achieve high overall accuracy. While this works well for the majority, minority groups that use the system suffer high error rates because they contribute less data to the model. This phenomenon is known as '''''representation disparity''''' and has been observed in many models that, for instance, recognize faces, or identify language. This disparity even increases as minority users suffer higher error rates, and therefore, are less likely to use the system in the future. As a consequence minority groups further shrink, and therefore, less data is available for the next optimization of the model. With less data the disparity becomes even worse - a phenomenon referred to as '''''disparity amplification'''''. <br />
<br />
In this paper, Hashimoto et al. first show that standard '''''empirical risk minimization (ERM)''''' does not control the loss of minority groups, and thus causes representation disparity and its amplification over time (even if the model is fair in the beginning). Second, the researchers try to mitigate this unfairness by proposing the use of '''''distributionally robust optimization (DRO)'''''. Indeed Hashimoto et al. are able to show that DRO can bound the loss for minority groups and is fair on models that ERM turns unfair. <br />
<br />
===Note on Fairness===<br />
<br />
Hashimoto et al. follow the ''difference principle'' to achieve and measure fairness. It is defined as the maximization of the welfare of the worst-off group rather than the whole group (cf. utilitarianism).<br />
<br />
=Representation Disparity=<br />
<br />
If a user makes a query <math display="inline">Z \sim P</math>, the model <math display="inline">\theta \in \Theta</math> makes a prediction, and the user experiences loss <math display="inline">\ell (\theta; Z)</math>. The ''expected'' loss of of a model <math display="inline">\theta</math> is denoted as the risk <math display="inline">\mathcal{R}(\theta) = \mathbb{E}_{Z \sim P} [\ell (\theta; Z)] </math>. While the queries (i.e. users using the system and providing data to the model) are made by users from <math display="inline">K</math> latent groups such that <math display="inline">Z \sim P := \sum_{k \in [k]} \alpha_kP_k</math>, neither the actual population proportions <math display="inline">\alpha_k</math> nor the group distributions <math display="inline">P_k</math> are known. Therefore, Hashimoto et al.'s goal is to control the worst-case risk over ''all'' groups <math display="inline">K</math> (and not just the risk of the worst-off group):<br />
<br />
\begin{align}<br />
\mathcal{R}_{max}(\theta) := \underset{k \in [k]}{max} \mathcal{R}_k(\theta), \qquad \mathcal{R}_k(\theta) := \mathbb{E}_{P_k} [\ell (\theta; Z)]<br />
\end{align}<br />
<br />
There is high representation disparity if the expected loss of the model <math display="inline">\mathcal{R}(\theta)</math> is low, but the worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> of one group is high. This means that a model with high representation disparity performs well on average (i.e. has low overall loss), but fails to represent some groups <math display="inline">k</math> (i.e. the risk for the worst-off group is high). In order to make models with high representation disparity fairer, Hashimot et al. try to minimize this worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> in general.<br />
<br />
=Disparity Amplification=<br />
<br />
Representation disparity can amplify as time passes and loss is minimized. Over <math display="inline">t = 1, 2, ..., T</math> minimization rounds the group proportions <math display="inline">\alpha_k^{(t)}</math> vary dependent on past losses. At each round the expected number of users <math display="inline">\lambda_k^{(t+1)}</math> from group <math display="inline">k</math> is determined by <br />
<br />
\begin{align}<br />
\lambda_k^{(t+1)} := \lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)})) + b_k<br />
\end{align}<br />
<br />
where <math display="inline">\lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)}))</math> describes the fraction of retained users from the previous optimization and <math display="inline">b_k</math> is the number of new users of group <math display="inline">k</math>. To put simply, the number of expected users of a group depend on the number of new users of that group, and the fraction of users that continue to use the system from the previous optimization step. If less and less users from minority groups return to the model (i.e. the model as a low retention rate of minority group users), Hashimoto et al. argue that the representation disparity amplifies. <br />
<br />
==Empirical Risk Minimization (ERM)==<br />
<br />
Without the knowledge of population proportions <math display="inline">\alpha_k^{(t)}</math>, the new user rate <math display="inline">b_k</math>, and the retention function <math display="inline">\nu</math> it is hard in practice, however, to control the worst-case risk over all time periods <math display="inline">\mathcal{R}_{max}^T</math> for all groups. That is why it is the standard approach to fit a sequence of models <math display="inline">\theta^{(t)}</math> by empirically approximating them. Using ERM, for instance, the optimal model is approached by minimizing the loss of the model:<br />
<br />
\begin{align}<br />
\theta^{(t)} = arg min_{\theta \in \Theta} \sum_i \ell(\theta; Z_i^{(t)})<br />
\end{align}<br />
<br />
However, ERM fails to prevent disparity amplification. By minimizing the expected loss of the model, minority groups experience higher loss (because the loss of the majority group is minimized), and do not return to use the system. In doing so, the population proportions <math display="inline">\alpha_k^{(t)}</math> shift, and certain minority groups contribute even less to the system. This is mirrored in the expected user count <math display="inline">\lambda^{(t)}</math> at each optimization point. In their paper Hashimoto et al. show that, if using ERM, <math display="inline">\lambda^{(t)}</math> is unstable because it loses its fair fixed point (i.e. the population fraction where risk minimization maintains the same population fraction over time). Therefore, ERM fails to control minority risk over time, and is considered unfair.<br />
<br />
=Distributonally Robust Optimization (DRO)=<br />
<br />
To overcome the unfairness of ERM, Hashimoto et al. developed a distributionally robust optimization (DRO). At this point the goal is still to minimize the worst-case group risk over a single time-step <math display="inline">\mathcal{R}_{max} (\theta^{(t)}) </math> (time steps are omitted in this sections' formulas). As previously mentioned, this is difficult to do because neither the population proportions <math display="inline">\alpha_k </math> nor group distributions <math display="inline">P_k </math> are known. Therefore, Hashimoto et al. developed an optimization technique that is robust "against '''''all''''' directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution <math display="inline">P_k </math> whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function <math display="inline">\mathcal{R}_{dro} </math> has to "up-weigh" data <math display="inline">Z</math> that cause high loss <math display="inline">\ell(\theta, Z)</math>. In other words, the risk function has to over-represent mixture components (i.e. group distributions <math display="inline">P_k </math>) in relation to their original mixture weights (i.e. the population proportions <math display="inline">\alpha_k </math>) for groups that suffer high loss. <br />
<br />
To do this Hashimoto et al. considered the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> around <math display="inline">P</math> within a certain limit (because obviously not every outlier should be up-weighed). This limit is described by the <math display="inline">\chi^2</math>-divergence (i.e. the distance, roughly speaking) between probability distributions. For two distributions <math display="inline">P</math> and <math display="inline">Q</math> the divergence is defined as <math display="inline">D_{\chi^2} (P || Q):= \int (\frac{dP}{dQ} - 1)^2</math>. With the help of the <math display="inline">\chi^2</math>-divergence, Hashimoto et al. defined the chi-squared ball <math display="inline">\mathcal{B}(P,r)</math> around the probability distribution P. This ball is defined so that <math display="inline">\mathcal{B}(P,r) := \{Q \ll P : D_{\chi^2} (Q || P) \leq r \}</math>. With the help of this ball the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> that lie inside the ball (i.e. within reasonable range) around the probability distribution <math display="inline">P</math> can be considered. This loss is given by<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]<br />
\end{align}<br />
<br />
which for <math display="inline">P:= \sum_{k \in [K]} \alpha_k P_k</math> for all models <math display="inline">\theta \in \Theta</math> where <math display="inline">r_k := (1/a_k -1)^2</math> bounds the risk <math display="inline">\mathcal{R}_k(\theta) \leq \mathcal{R}_{dro} (\theta; r_k)</math> for each group with risk <math display="inline">\mathcal{R}_k(\theta)</math>. Furthermore, if the lower bound on the group proportions <math display="inline">\alpha_{min} \leq min_{k \in [K]} \alpha_k</math> is specified, and the radius is defined as <math display="inline">r_{max} := (1/\alpha_{min} -1)^2</math>, the worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math> can be controlled by <math display="inline">\mathcal{R}_{dro} (\theta; r_{max}) </math> by forming an upper bound that can be minimized.<br />
<br />
==Optimization of DRO==<br />
<br />
To minimize <math display="inline">\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]</math> Hashimoto et al. look at the dual of this maximization problem (i.e. every maximization problem can be transformed into a minimization problem and vice-versa). This dual is given by the minimization problem<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) = \underset{\eta \in \mathbb{R}}{inf} \left\{ F(\theta; \eta):= C\left(\mathbb{E}_P \left[ [\ell(\theta;Z) - \eta]_+^2 \right] \right)^\frac{1}{2} + \eta \right\}<br />
\end{align}<br />
<br />
with <math display="inline">C = (2(1/a_{min} - 1)^2 + 1)^{1/2}</math>. <math display="inline">\eta</math> describes the dual variable (i.e. the variable that appears in creating the dual). Since <math display="inline">F(\theta; \eta)</math> involves an expectation <math display="inline">\mathbb{E}_P</math> over the data generating distribution <math display="inline">P</math>, <math display="inline">F(\theta; \eta)</math> can be directly minimized. For convex losses <math display="inline">\ell(\theta;Z)</math>, <math display="inline">F(\theta; \eta)</math> is convex, and can be minimized by performing a binary search over <math display="inline">\eta</math>. In their paper, Hashimoto et al. further show that optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> at each time step controls the ''future'' worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math>, and therefore retention rates. That means if the initial group proportions satisfy <math display="inline">\alpha_k^{(0)} \geq a_{min}</math>, and <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> is optimized for every time step (and therefore <math display="inline">\mathcal{R}_{max} (\theta) </math> is minimized), <math display="inline">\mathcal{R}_{max}^T (\theta) </math> over all time steps is controlled. In other words, optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> every time step is enough to avoid disparity amplification.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fairness_Without_Demographics_in_Repeated_Loss_Minimization&diff=37035Fairness Without Demographics in Repeated Loss Minimization2018-10-22T21:20:05Z<p>Mpafla: /* Distributonally Robust Optimization (DRO) */</p>
<hr />
<div>This page contains the summary of the paper "[http://proceedings.mlr.press/v80/hashimoto18a.html Fairness Without Demographics in Repeated Loss Minimization]" by Hashimoto, T. B., Srivastava, M., Namkoong, H., & Liang, P. which was published at the International Conference of Machine Learning (ICML) in 2018. <br />
<br />
=Introduction=<br />
<br />
Usually, machine learning models are minimized in their average loss to achieve high overall accuracy. While this works well for the majority, minority groups that use the system suffer high error rates because they contribute less data to the model. This phenomenon is known as '''''representation disparity''''' and has been observed in many models that, for instance, recognize faces, or identify language. This disparity even increases as minority users suffer higher error rates, and therefore, are less likely to use the system in the future. As a consequence minority groups further shrink, and therefore, less data is available for the next optimization of the model. With less data the disparity becomes even worse - a phenomenon referred to as '''''disparity amplification'''''. <br />
<br />
In this paper, Hashimoto et al. first show that standard '''''empirical risk minimization (ERM)''''' does not control the loss of minority groups, and thus causes representation disparity and its amplification over time (even if the model is fair in the beginning). Second, the researchers try to mitigate this unfairness by proposing the use of '''''distributionally robust optimization (DRO)'''''. Indeed Hashimoto et al. are able to show that DRO can bound the loss for minority groups and is fair on models that ERM turns unfair. <br />
<br />
===Note on Fairness===<br />
<br />
Hashimoto et al. follow the ''difference principle'' to achieve and measure fairness. It is defined as the maximization of the welfare of the worst-off group rather than the whole group (cf. utilitarianism).<br />
<br />
=Representation Disparity=<br />
<br />
If a user makes a query <math display="inline">Z \sim P</math>, the model <math display="inline">\theta \in \Theta</math> makes a prediction, and the user experiences loss <math display="inline">\ell (\theta; Z)</math>. The ''expected'' loss of of a model <math display="inline">\theta</math> is denoted as the risk <math display="inline">\mathcal{R}(\theta) = \mathbb{E}_{Z \sim P} [\ell (\theta; Z)] </math>. While the queries (i.e. users using the system and providing data to the model) are made by users from <math display="inline">K</math> latent groups such that <math display="inline">Z \sim P := \sum_{k \in [k]} \alpha_kP_k</math>, neither the actual population proportions <math display="inline">\alpha_k</math> nor the group distributions <math display="inline">P_k</math> are known. Therefore, Hashimoto et al.'s goal is to control the worst-case risk over ''all'' groups <math display="inline">K</math> (and not just the risk of the worst-off group):<br />
<br />
\begin{align}<br />
\mathcal{R}_{max}(\theta) := \underset{k \in [k]}{max} \mathcal{R}_k(\theta), \qquad \mathcal{R}_k(\theta) := \mathbb{E}_{P_k} [\ell (\theta; Z)]<br />
\end{align}<br />
<br />
There is high representation disparity if the expected loss of the model <math display="inline">\mathcal{R}(\theta)</math> is low, but the worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> of one group is high. This means that a model with high representation disparity performs well on average (i.e. has low overall loss), but fails to represent some groups <math display="inline">k</math> (i.e. the risk for the worst-off group is high). In order to make models with high representation disparity fairer, Hashimot et al. try to minimize this worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> in general.<br />
<br />
=Disparity Amplification=<br />
<br />
Representation disparity can amplify as time passes and loss is minimized. Over <math display="inline">t = 1, 2, ..., T</math> minimization rounds the group proportions <math display="inline">\alpha_k^{(t)}</math> vary dependent on past losses. At each round the expected number of users <math display="inline">\lambda_k^{(t+1)}</math> from group <math display="inline">k</math> is determined by <br />
<br />
\begin{align}<br />
\lambda_k^{(t+1)} := \lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)})) + b_k<br />
\end{align}<br />
<br />
where <math display="inline">\lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)}))</math> describes the fraction of retained users from the previous optimization and <math display="inline">b_k</math> is the number of new users of group <math display="inline">k</math>. To put simply, the number of expected users of a group depend on the number of new users of that group, and the fraction of users that continue to use the system from the previous optimization step. If less and less users from minority groups return to the model (i.e. the model as a low retention rate of minority group users), Hashimoto et al. argue that the representation disparity amplifies. <br />
<br />
==Empirical Risk Minimization (ERM)==<br />
<br />
Without the knowledge of population proportions <math display="inline">\alpha_k^{(t)}</math>, the new user rate <math display="inline">b_k</math>, and the retention function <math display="inline">\nu</math> it is hard in practice, however, to control the worst-case risk over all time periods <math display="inline">\mathcal{R}_{max}^T</math> for all groups. That is why it is the standard approach to fit a sequence of models <math display="inline">\theta^{(t)}</math> by empirically approximating them. Using ERM, for instance, the optimal model is approached by minimizing the loss of the model:<br />
<br />
\begin{align}<br />
\theta^{(t)} = arg min_{\theta \in \Theta} \sum_i \ell(\theta; Z_i^{(t)})<br />
\end{align}<br />
<br />
However, ERM fails to prevent disparity amplification. By minimizing the expected loss of the model, minority groups experience higher loss (because the loss of the majority group is minimized), and do not return to use the system. In doing so, the population proportions <math display="inline">\alpha_k^{(t)}</math> shift, and certain minority groups contribute even less to the system. This is mirrored in the expected user count <math display="inline">\lambda^{(t)}</math> at each optimization point. In their paper Hashimoto et al. show that, if using ERM, <math display="inline">\lambda^{(t)}</math> is unstable because it loses its fair fixed point (i.e. the population fraction where risk minimization maintains the same population fraction over time). Therefore, ERM fails to control minority risk over time, and is considered unfair.<br />
<br />
=Distributonally Robust Optimization (DRO)=<br />
<br />
To overcome the unfairness of ERM, Hashimoto et al. developed a distributionally robust optimization (DRO). At this point the goal is still to minimize the worst-case group risk over a single time-step <math display="inline">\mathcal{R}_{max} (\theta^{(t)}) </math> (time steps are omitted in this sections' formulas). As previously mentioned, this is difficult to do because neither the population proportions <math display="inline">\alpha_k </math> nor group distributions <math display="inline">P_k </math> are known. Therefore, Hashimoto et al. developed an optimization technique that is robust "against '''''all''''' directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution <math display="inline">P_k </math> whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function <math display="inline">\mathcal{R}_{dro} </math> has to "up-weigh" data <math display="inline">Z</math> that cause high loss <math display="inline">\ell(\theta, Z)</math>. In other words, the risk function has to over-represent mixture components (i.e. group distributions <math display="inline">P_k </math>) in relation to their original mixture weights (i.e. the population proportions <math display="inline">\alpha_k </math>) for groups that suffer high loss. <br />
<br />
To do this Hashimoto et al. considered the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> around <math display="inline">P</math> within a certain limit (because obviously not every outlier should be up-weighed). This limit is described by the <math display="inline">\chi^2</math>-divergence (i.e. the distance, roughly speaking) between probability distributions. For two distributions <math display="inline">P</math> and <math display="inline">Q</math> the divergence is defined as <math display="inline">D_{\chi^2} (P || Q):= \int (\frac{dP}{dQ} - 1)^2</math>. With the help of the <math display="inline">\chi^2</math>-divergence, Hashimoto et al. defined the chi-squared ball <math display="inline">\mathcal{B}(P,r)</math> around the probability distribution P. This ball is defined so that <math display="inline">\mathcal{B}(P,r) := \{Q \ll P : D_{\chi^2} (Q || P) \leq r \}</math>. With the help of this ball the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> that lie inside the ball (i.e. within reasonable range) around the probability distribution <math display="inline">P</math> can be considered. This loss is given by<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]<br />
\end{align}<br />
<br />
which for <math display="inline">P:= \sum_{k \in [K]} \alpha_k P_k</math> for all models <math display="inline">\theta \in \Theta</math> where <math display="inline">r_k := (1/a_k -1)^2</math> bounds the risk <math display="inline">\mathcal{R}_k(\theta) \leq \mathcal{R}_{dro} (\theta; r_k)</math> for each group with risk <math display="inline">\mathcal{R}_k(\theta)</math>. Furthermore, if we specify a lower bound on the group proportions <math display="inline">\alpha_{min} \leq min_{k \in [K]} \alpha_k</math>, and define <math display="inline">r_{max} := (1/\alpha_{min} -1)^2</math>, the worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math> can be controlled by <math display="inline">\mathcal{R}_{dro} (\theta; r_{max}) </math> by forming an upper bound that can be minimized.<br />
<br />
==Optimization of DRO==<br />
<br />
To minimize <math display="inline">\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]</math> Hashimoto et al. look at the dual of this maximization problem (i.e. every maximization problem can be transformed into a minimization problem and vice-versa). This dual is given by the minimization problem<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) = \underset{\eta \in \mathbb{R}}{inf} \left\{ F(\theta; \eta):= C\left(\mathbb{E}_P \left[ [\ell(\theta;Z) - \eta]_+^2 \right] \right)^\frac{1}{2} + \eta \right\}<br />
\end{align}<br />
<br />
with <math display="inline">C = (2(1/a_{min} - 1)^2 + 1)^{1/2}</math>. <math display="inline">\eta</math> describes the dual variable (i.e. the variable that appears in creating the dual). Since <math display="inline">F(\theta; \eta)</math> involves an expectation <math display="inline">\mathbb{E}_P</math> over the data generating distribution <math display="inline">P</math>, <math display="inline">F(\theta; \eta)</math> can be directly minimized. For convex losses <math display="inline">\ell(\theta;Z)</math>, <math display="inline">F(\theta; \eta)</math> is convex, and can be minimized by performing a binary search over <math display="inline">\eta</math>. In their paper, Hashimoto et al. further show that optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> at each time step controls the ''future'' worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math>, and therefore retention rates. That means if the initial group proportions satisfy <math display="inline">\alpha_k^{(0)} \geq a_{min}</math>, and <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> is optimized for every time step (and therefore <math display="inline">\mathcal{R}_{max} (\theta) </math> is minimized), <math display="inline">\mathcal{R}_{max}^T (\theta) </math> over all time steps is controlled. In other words, optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> every time step is enough to avoid disparity amplification.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fairness_Without_Demographics_in_Repeated_Loss_Minimization&diff=37034Fairness Without Demographics in Repeated Loss Minimization2018-10-22T21:13:33Z<p>Mpafla: /* Distributonally Robust Optimization (DRO) */</p>
<hr />
<div>This page contains the summary of the paper "[http://proceedings.mlr.press/v80/hashimoto18a.html Fairness Without Demographics in Repeated Loss Minimization]" by Hashimoto, T. B., Srivastava, M., Namkoong, H., & Liang, P. which was published at the International Conference of Machine Learning (ICML) in 2018. <br />
<br />
=Introduction=<br />
<br />
Usually, machine learning models are minimized in their average loss to achieve high overall accuracy. While this works well for the majority, minority groups that use the system suffer high error rates because they contribute less data to the model. This phenomenon is known as '''''representation disparity''''' and has been observed in many models that, for instance, recognize faces, or identify language. This disparity even increases as minority users suffer higher error rates, and therefore, are less likely to use the system in the future. As a consequence minority groups further shrink, and therefore, less data is available for the next optimization of the model. With less data the disparity becomes even worse - a phenomenon referred to as '''''disparity amplification'''''. <br />
<br />
In this paper, Hashimoto et al. first show that standard '''''empirical risk minimization (ERM)''''' does not control the loss of minority groups, and thus causes representation disparity and its amplification over time (even if the model is fair in the beginning). Second, the researchers try to mitigate this unfairness by proposing the use of '''''distributionally robust optimization (DRO)'''''. Indeed Hashimoto et al. are able to show that DRO can bound the loss for minority groups and is fair on models that ERM turns unfair. <br />
<br />
===Note on Fairness===<br />
<br />
Hashimoto et al. follow the ''difference principle'' to achieve and measure fairness. It is defined as the maximization of the welfare of the worst-off group rather than the whole group (cf. utilitarianism).<br />
<br />
=Representation Disparity=<br />
<br />
If a user makes a query <math display="inline">Z \sim P</math>, the model <math display="inline">\theta \in \Theta</math> makes a prediction, and the user experiences loss <math display="inline">\ell (\theta; Z)</math>. The ''expected'' loss of of a model <math display="inline">\theta</math> is denoted as the risk <math display="inline">\mathcal{R}(\theta) = \mathbb{E}_{Z \sim P} [\ell (\theta; Z)] </math>. While the queries (i.e. users using the system and providing data to the model) are made by users from <math display="inline">K</math> latent groups such that <math display="inline">Z \sim P := \sum_{k \in [k]} \alpha_kP_k</math>, neither the actual population proportions <math display="inline">\alpha_k</math> nor the group distributions <math display="inline">P_k</math> are known. Therefore, Hashimoto et al.'s goal is to control the worst-case risk over ''all'' groups <math display="inline">K</math> (and not just the risk of the worst-off group):<br />
<br />
\begin{align}<br />
\mathcal{R}_{max}(\theta) := \underset{k \in [k]}{max} \mathcal{R}_k(\theta), \qquad \mathcal{R}_k(\theta) := \mathbb{E}_{P_k} [\ell (\theta; Z)]<br />
\end{align}<br />
<br />
There is high representation disparity if the expected loss of the model <math display="inline">\mathcal{R}(\theta)</math> is low, but the worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> of one group is high. This means that a model with high representation disparity performs well on average (i.e. has low overall loss), but fails to represent some groups <math display="inline">k</math> (i.e. the risk for the worst-off group is high). In order to make models with high representation disparity fairer, Hashimot et al. try to minimize this worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> in general.<br />
<br />
=Disparity Amplification=<br />
<br />
Representation disparity can amplify as time passes and loss is minimized. Over <math display="inline">t = 1, 2, ..., T</math> minimization rounds the group proportions <math display="inline">\alpha_k^{(t)}</math> vary dependent on past losses. At each round the expected number of users <math display="inline">\lambda_k^{(t+1)}</math> from group <math display="inline">k</math> is determined by <br />
<br />
\begin{align}<br />
\lambda_k^{(t+1)} := \lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)})) + b_k<br />
\end{align}<br />
<br />
where <math display="inline">\lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)}))</math> describes the fraction of retained users from the previous optimization and <math display="inline">b_k</math> is the number of new users of group <math display="inline">k</math>. To put simply, the number of expected users of a group depend on the number of new users of that group, and the fraction of users that continue to use the system from the previous optimization step. If less and less users from minority groups return to the model (i.e. the model as a low retention rate of minority group users), Hashimoto et al. argue that the representation disparity amplifies. <br />
<br />
==Empirical Risk Minimization (ERM)==<br />
<br />
Without the knowledge of population proportions <math display="inline">\alpha_k^{(t)}</math>, the new user rate <math display="inline">b_k</math>, and the retention function <math display="inline">\nu</math> it is hard in practice, however, to control the worst-case risk over all time periods <math display="inline">\mathcal{R}_{max}^T</math> for all groups. That is why it is the standard approach to fit a sequence of models <math display="inline">\theta^{(t)}</math> by empirically approximating them. Using ERM, for instance, the optimal model is approached by minimizing the loss of the model:<br />
<br />
\begin{align}<br />
\theta^{(t)} = arg min_{\theta \in \Theta} \sum_i \ell(\theta; Z_i^{(t)})<br />
\end{align}<br />
<br />
However, ERM fails to prevent disparity amplification. By minimizing the expected loss of the model, minority groups experience higher loss (because the loss of the majority group is minimized), and do not return to use the system. In doing so, the population proportions <math display="inline">\alpha_k^{(t)}</math> shift, and certain minority groups contribute even less to the system. This is mirrored in the expected user count <math display="inline">\lambda^{(t)}</math> at each optimization point. In their paper Hashimoto et al. show that, if using ERM, <math display="inline">\lambda^{(t)}</math> is unstable because it loses its fair fixed point (i.e. the population fraction where risk minimization maintains the same population fraction over time). Therefore, ERM fails to control minority risk over time, and is considered unfair.<br />
<br />
=Distributonally Robust Optimization (DRO)=<br />
<br />
To overcome the unfairness of ERM, Hashimoto et al. developed a distributionally robust optimization (DRO). At this point the goal is still to minimize the worst-case group risk over a single time-step <math display="inline">\mathcal{R}_{max} (\theta^{(t)}) </math> (time steps are omitted in this sections' formulas). As previously mentioned, this is difficult to do because neither the population proportions <math display="inline">\alpha_k </math> nor group distributions <math display="inline">P_k </math> are known. Therefore, Hashimoto et al. developed an optimization technique that is robust "against '''''all''''' directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution <math display="inline">P_k </math> whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function <math display="inline">\mathcal{R}_{dro} </math> has to "up-weigh" data <math display="inline">Z</math> that cause high loss <math display="inline">\ell(\theta, Z)</math>. In other words, the risk function has to over-represent mixture components (i.e. group distributions <math display="inline">P_k </math>) in relation to their original mixture weights (i.e. the population proportions <math display="inline">\alpha_k </math>) for groups that suffer high loss. <br />
<br />
To do this Hashimoto et al. considered the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> around <math display="inline">P</math> within a certain limit. This limit is described by the <math display="inline">\chi^2</math>-divergence (i.e. the distance, roughly speaking) between probability distributions. For two distributions <math display="inline">P</math> and <math display="inline">Q</math> the divergence is defined as <math display="inline">D_{\chi^2} (P || Q):= \int (\frac{dP}{dQ} - 1)^2</math>. With the help of the <math display="inline">\chi^2</math>-divergence, Hashimoto et al. defined the chi-squared ball <math display="inline">\mathcal{B}(P,r)</math> around the probability distribution P. This ball is defined so that <math display="inline">\mathcal{B}(P,r) := \{Q \ll P : D_{\chi^2} (Q || P) \leq r \}</math>. With the help of this ball the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> that lie inside the ball (i.e. within reasonable range) around the probability distribution <math display="inline">P</math> can be considered. This loss is given by<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]<br />
\end{align}<br />
<br />
which for <math display="inline">P:= \sum_{k \in [K]} \alpha_k P_k</math> for all models <math display="inline">\theta \in \Theta</math> where <math display="inline">r_k := (1/a_k -1)^2</math> bounds the risk <math display="inline">\mathcal{R}_k(\theta) \leq \mathcal{R}_{dro} (\theta; r_k)</math> for each group with risk <math display="inline">\mathcal{R}_k(\theta)</math>. Furthermore, if we specify a lower bound on the group proportions <math display="inline">\alpha_{min} \leq min_{k \in [K]} \alpha_k</math>, and define <math display="inline">r_{max} := (1/\alpha_{min} -1)^2</math>, the worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math> can be controlled by <math display="inline">\mathcal{R}_{dro} (\theta; r_{max}) </math> by forming an upper bound that can be minimized.<br />
<br />
==Optimization of DRO==<br />
<br />
To minimize <math display="inline">\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]</math> Hashimoto et al. look at the dual of this maximization problem (i.e. every maximization problem can be transformed into a minimization problem and vice-versa). This dual is given by the minimization problem<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) = \underset{\eta \in \mathbb{R}}{inf} \left\{ F(\theta; \eta):= C\left(\mathbb{E}_P \left[ [\ell(\theta;Z) - \eta]_+^2 \right] \right)^\frac{1}{2} + \eta \right\}<br />
\end{align}<br />
<br />
with <math display="inline">C = (2(1/a_{min} - 1)^2 + 1)^{1/2}</math>. <math display="inline">\eta</math> describes the dual variable (i.e. the variable that appears in creating the dual). Since <math display="inline">F(\theta; \eta)</math> involves an expectation <math display="inline">\mathbb{E}_P</math> over the data generating distribution <math display="inline">P</math>, <math display="inline">F(\theta; \eta)</math> can be directly minimized. For convex losses <math display="inline">\ell(\theta;Z)</math>, <math display="inline">F(\theta; \eta)</math> is convex, and can be minimized by performing a binary search over <math display="inline">\eta</math>. In their paper, Hashimoto et al. further show that optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> at each time step controls the ''future'' worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math>, and therefore retention rates. That means if the initial group proportions satisfy <math display="inline">\alpha_k^{(0)} \geq a_{min}</math>, and <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> is optimized for every time step (and therefore <math display="inline">\mathcal{R}_{max} (\theta) </math> is minimized), <math display="inline">\mathcal{R}_{max}^T (\theta) </math> over all time steps is controlled. In other words, optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> every time step is enough to avoid disparity amplification.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fairness_Without_Demographics_in_Repeated_Loss_Minimization&diff=37033Fairness Without Demographics in Repeated Loss Minimization2018-10-22T21:10:15Z<p>Mpafla: /* Distributonally Robust Optimization (DRO) */</p>
<hr />
<div>This page contains the summary of the paper "[http://proceedings.mlr.press/v80/hashimoto18a.html Fairness Without Demographics in Repeated Loss Minimization]" by Hashimoto, T. B., Srivastava, M., Namkoong, H., & Liang, P. which was published at the International Conference of Machine Learning (ICML) in 2018. <br />
<br />
=Introduction=<br />
<br />
Usually, machine learning models are minimized in their average loss to achieve high overall accuracy. While this works well for the majority, minority groups that use the system suffer high error rates because they contribute less data to the model. This phenomenon is known as '''''representation disparity''''' and has been observed in many models that, for instance, recognize faces, or identify language. This disparity even increases as minority users suffer higher error rates, and therefore, are less likely to use the system in the future. As a consequence minority groups further shrink, and therefore, less data is available for the next optimization of the model. With less data the disparity becomes even worse - a phenomenon referred to as '''''disparity amplification'''''. <br />
<br />
In this paper, Hashimoto et al. first show that standard '''''empirical risk minimization (ERM)''''' does not control the loss of minority groups, and thus causes representation disparity and its amplification over time (even if the model is fair in the beginning). Second, the researchers try to mitigate this unfairness by proposing the use of '''''distributionally robust optimization (DRO)'''''. Indeed Hashimoto et al. are able to show that DRO can bound the loss for minority groups and is fair on models that ERM turns unfair. <br />
<br />
===Note on Fairness===<br />
<br />
Hashimoto et al. follow the ''difference principle'' to achieve and measure fairness. It is defined as the maximization of the welfare of the worst-off group rather than the whole group (cf. utilitarianism).<br />
<br />
=Representation Disparity=<br />
<br />
If a user makes a query <math display="inline">Z \sim P</math>, the model <math display="inline">\theta \in \Theta</math> makes a prediction, and the user experiences loss <math display="inline">\ell (\theta; Z)</math>. The ''expected'' loss of of a model <math display="inline">\theta</math> is denoted as the risk <math display="inline">\mathcal{R}(\theta) = \mathbb{E}_{Z \sim P} [\ell (\theta; Z)] </math>. While the queries (i.e. users using the system and providing data to the model) are made by users from <math display="inline">K</math> latent groups such that <math display="inline">Z \sim P := \sum_{k \in [k]} \alpha_kP_k</math>, neither the actual population proportions <math display="inline">\alpha_k</math> nor the group distributions <math display="inline">P_k</math> are known. Therefore, Hashimoto et al.'s goal is to control the worst-case risk over ''all'' groups <math display="inline">K</math> (and not just the risk of the worst-off group):<br />
<br />
\begin{align}<br />
\mathcal{R}_{max}(\theta) := \underset{k \in [k]}{max} \mathcal{R}_k(\theta), \qquad \mathcal{R}_k(\theta) := \mathbb{E}_{P_k} [\ell (\theta; Z)]<br />
\end{align}<br />
<br />
There is high representation disparity if the expected loss of the model <math display="inline">\mathcal{R}(\theta)</math> is low, but the worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> of one group is high. This means that a model with high representation disparity performs well on average (i.e. has low overall loss), but fails to represent some groups <math display="inline">k</math> (i.e. the risk for the worst-off group is high). In order to make models with high representation disparity fairer, Hashimot et al. try to minimize this worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> in general.<br />
<br />
=Disparity Amplification=<br />
<br />
Representation disparity can amplify as time passes and loss is minimized. Over <math display="inline">t = 1, 2, ..., T</math> minimization rounds the group proportions <math display="inline">\alpha_k^{(t)}</math> vary dependent on past losses. At each round the expected number of users <math display="inline">\lambda_k^{(t+1)}</math> from group <math display="inline">k</math> is determined by <br />
<br />
\begin{align}<br />
\lambda_k^{(t+1)} := \lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)})) + b_k<br />
\end{align}<br />
<br />
where <math display="inline">\lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)}))</math> describes the fraction of retained users from the previous optimization and <math display="inline">b_k</math> is the number of new users of group <math display="inline">k</math>. To put simply, the number of expected users of a group depend on the number of new users of that group, and the fraction of users that continue to use the system from the previous optimization step. If less and less users from minority groups return to the model (i.e. the model as a low retention rate of minority group users), Hashimoto et al. argue that the representation disparity amplifies. <br />
<br />
==Empirical Risk Minimization (ERM)==<br />
<br />
Without the knowledge of population proportions <math display="inline">\alpha_k^{(t)}</math>, the new user rate <math display="inline">b_k</math>, and the retention function <math display="inline">\nu</math> it is hard in practice, however, to control the worst-case risk over all time periods <math display="inline">\mathcal{R}_{max}^T</math> for all groups. That is why it is the standard approach to fit a sequence of models <math display="inline">\theta^{(t)}</math> by empirically approximating them. Using ERM, for instance, the optimal model is approached by minimizing the loss of the model:<br />
<br />
\begin{align}<br />
\theta^{(t)} = arg min_{\theta \in \Theta} \sum_i \ell(\theta; Z_i^{(t)})<br />
\end{align}<br />
<br />
However, ERM fails to prevent disparity amplification. By minimizing the expected loss of the model, minority groups experience higher loss (because the loss of the majority group is minimized), and do not return to use the system. In doing so, the population proportions <math display="inline">\alpha_k^{(t)}</math> shift, and certain minority groups contribute even less to the system. This is mirrored in the expected user count <math display="inline">\lambda^{(t)}</math> at each optimization point. In their paper Hashimoto et al. show that, if using ERM, <math display="inline">\lambda^{(t)}</math> is unstable because it loses its fair fixed point (i.e. the population fraction where risk minimization maintains the same population fraction over time). Therefore, ERM fails to control minority risk over time, and is considered unfair.<br />
<br />
=Distributonally Robust Optimization (DRO)=<br />
<br />
To overcome the unfairness of ERM, Hashimoto et al. developed a distributionally robust optimization (DRO). At this point the goal is still to minimize the worst-case group risk over a single time-step <math display="inline">\mathcal{R}_{max} (\theta^{(t)}) </math>. As previously mentioned, this is difficult to do because neither the population proportions <math display="inline">\alpha_k </math> nor group distributions <math display="inline">P_k </math> are known. Therefore, Hashimoto et al. developed an optimization technique that is robust "against '''''all''''' directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution <math display="inline">P_k </math> whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function <math display="inline">\mathcal{R}_{dro} </math> has to "up-weigh" data <math display="inline">Z</math> that cause high loss <math display="inline">\ell(\theta, Z)</math>. In other words, the risk function has to over-represent mixture components (i.e. group distributions <math display="inline">P_k </math>) in relation to their original mixture weights (i.e. the population proportions <math display="inline">\alpha_k </math>) for groups that suffer high loss. <br />
<br />
To do this Hashimoto et al. considered the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> around <math display="inline">P</math> within a certain limit. This limit is described by the <math display="inline">\chi^2</math>-divergence (i.e. the distance, roughly speaking) between probability distributions. For two distributions <math display="inline">P</math> and <math display="inline">Q</math> the divergence is defined as <math display="inline">D_{\chi^2} (P || Q):= \int (\frac{dP}{dQ} - 1)^2</math>. With the help of the <math display="inline">\chi^2</math>-divergence, Hashimoto et al. defined the chi-squared ball <math display="inline">\mathcal{B}(P,r)</math> around the probability distribution P. This ball is defined so that <math display="inline">\mathcal{B}(P,r) := \{Q \ll P : D_{\chi^2} (Q || P) \leq r \}</math>. With the help of this ball the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> that lie inside the ball (i.e. within reasonable range) around the probability distribution <math display="inline">P</math> can be considered. This loss is given by<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]<br />
\end{align}<br />
<br />
which for <math display="inline">P:= \sum_{k \in [K]} \alpha_k P_k</math> for all models <math display="inline">\theta \in \Theta</math> where <math display="inline">r_k := (1/a_k -1)^2</math> bounds the risk <math display="inline">\mathcal{R}_k(\theta) \leq \mathcal{R}_{dro} (\theta; r_k)</math> for each group with risk <math display="inline">\mathcal{R}_k(\theta)</math>. Furthermore, if we specify a lower bound on the group proportions <math display="inline">\alpha_{min} \leq min_{k \in [K]} \alpha_k</math>, and define <math display="inline">r_{max} := (1/\alpha_{min} -1)^2</math>, the worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math> can be controlled by <math display="inline">\mathcal{R}_{dro} (\theta; r_{max}) </math> by forming an upper bound that can be minimized.<br />
<br />
==Optimization of DRO==<br />
<br />
To minimize <math display="inline">\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]</math> Hashimoto et al. look at the dual of this maximization problem (i.e. every maximization problem can be transformed into a minimization problem and vice-versa). This dual is given by the minimization problem<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) = \underset{\eta \in \mathbb{R}}{inf} \left\{ F(\theta; \eta):= C\left(\mathbb{E}_P \left[ [\ell(\theta;Z) - \eta]_+^2 \right] \right)^\frac{1}{2} + \eta \right\}<br />
\end{align}<br />
<br />
with <math display="inline">C = (2(1/a_{min} - 1)^2 + 1)^{1/2}</math>. <math display="inline">\eta</math> describes the dual variable (i.e. the variable that appears in creating the dual). Since <math display="inline">F(\theta; \eta)</math> involves an expectation <math display="inline">\mathbb{E}_P</math> over the data generating distribution <math display="inline">P</math>, <math display="inline">F(\theta; \eta)</math> can be directly minimized. For convex losses <math display="inline">\ell(\theta;Z)</math>, <math display="inline">F(\theta; \eta)</math> is convex, and can be minimized by performing a binary search over <math display="inline">\eta</math>. In their paper, Hashimoto et al. further show that optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> at each time step controls the ''future'' worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math>, and therefore retention rates. That means if the initial group proportions satisfy <math display="inline">\alpha_k^{(0)} \geq a_{min}</math>, and <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> is optimized for every time step (and therefore <math display="inline">\mathcal{R}_{max} (\theta) </math> is minimized), <math display="inline">\mathcal{R}_{max}^T (\theta) </math> over all time steps is controlled. In other words, optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> every time step is enough to avoid disparity amplification.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fairness_Without_Demographics_in_Repeated_Loss_Minimization&diff=37032Fairness Without Demographics in Repeated Loss Minimization2018-10-22T19:11:31Z<p>Mpafla: /* Distributonally Robust Optimization (DRO) */</p>
<hr />
<div>This page contains the summary of the paper "[http://proceedings.mlr.press/v80/hashimoto18a.html Fairness Without Demographics in Repeated Loss Minimization]" by Hashimoto, T. B., Srivastava, M., Namkoong, H., & Liang, P. which was published at the International Conference of Machine Learning (ICML) in 2018. <br />
<br />
=Introduction=<br />
<br />
Usually, machine learning models are minimized in their average loss to achieve high overall accuracy. While this works well for the majority, minority groups that use the system suffer high error rates because they contribute less data to the model. This phenomenon is known as '''''representation disparity''''' and has been observed in many models that, for instance, recognize faces, or identify language. This disparity even increases as minority users suffer higher error rates, and therefore, are less likely to use the system in the future. As a consequence minority groups further shrink, and therefore, less data is available for the next optimization of the model. With less data the disparity becomes even worse - a phenomenon referred to as '''''disparity amplification'''''. <br />
<br />
In this paper, Hashimoto et al. first show that standard '''''empirical risk minimization (ERM)''''' does not control the loss of minority groups, and thus causes representation disparity and its amplification over time (even if the model is fair in the beginning). Second, the researchers try to mitigate this unfairness by proposing the use of '''''distributionally robust optimization (DRO)'''''. Indeed Hashimoto et al. are able to show that DRO can bound the loss for minority groups and is fair on models that ERM turns unfair. <br />
<br />
===Note on Fairness===<br />
<br />
Hashimoto et al. follow the ''difference principle'' to achieve and measure fairness. It is defined as the maximization of the welfare of the worst-off group rather than the whole group (cf. utilitarianism).<br />
<br />
=Representation Disparity=<br />
<br />
If a user makes a query <math display="inline">Z \sim P</math>, the model <math display="inline">\theta \in \Theta</math> makes a prediction, and the user experiences loss <math display="inline">\ell (\theta; Z)</math>. The ''expected'' loss of of a model <math display="inline">\theta</math> is denoted as the risk <math display="inline">\mathcal{R}(\theta) = \mathbb{E}_{Z \sim P} [\ell (\theta; Z)] </math>. While the queries (i.e. users using the system and providing data to the model) are made by users from <math display="inline">K</math> latent groups such that <math display="inline">Z \sim P := \sum_{k \in [k]} \alpha_kP_k</math>, neither the actual population proportions <math display="inline">\alpha_k</math> nor the group distributions <math display="inline">P_k</math> are known. Therefore, Hashimoto et al.'s goal is to control the worst-case risk over ''all'' groups <math display="inline">K</math> (and not just the risk of the worst-off group):<br />
<br />
\begin{align}<br />
\mathcal{R}_{max}(\theta) := \underset{k \in [k]}{max} \mathcal{R}_k(\theta), \qquad \mathcal{R}_k(\theta) := \mathbb{E}_{P_k} [\ell (\theta; Z)]<br />
\end{align}<br />
<br />
There is high representation disparity if the expected loss of the model <math display="inline">\mathcal{R}(\theta)</math> is low, but the worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> of one group is high. This means that a model with high representation disparity performs well on average (i.e. has low overall loss), but fails to represent some groups <math display="inline">k</math> (i.e. the risk for the worst-off group is high). In order to make models with high representation disparity fairer, Hashimot et al. try to minimize this worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> in general.<br />
<br />
=Disparity Amplification=<br />
<br />
Representation disparity can amplify as time passes and loss is minimized. Over <math display="inline">t = 1, 2, ..., T</math> minimization rounds the group proportions <math display="inline">\alpha_k^{(t)}</math> vary dependent on past losses. At each round the expected number of users <math display="inline">\lambda_k^{(t+1)}</math> from group <math display="inline">k</math> is determined by <br />
<br />
\begin{align}<br />
\lambda_k^{(t+1)} := \lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)})) + b_k<br />
\end{align}<br />
<br />
where <math display="inline">\lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)}))</math> describes the fraction of retained users from the previous optimization and <math display="inline">b_k</math> is the number of new users of group <math display="inline">k</math>. To put simply, the number of expected users of a group depend on the number of new users of that group, and the fraction of users that continue to use the system from the previous optimization step. If less and less users from minority groups return to the model (i.e. the model as a low retention rate of minority group users), Hashimoto et al. argue that the representation disparity amplifies. <br />
<br />
==Empirical Risk Minimization (ERM)==<br />
<br />
Without the knowledge of population proportions <math display="inline">\alpha_k^{(t)}</math>, the new user rate <math display="inline">b_k</math>, and the retention function <math display="inline">\nu</math> it is hard in practice, however, to control the worst-case risk over all time periods <math display="inline">\mathcal{R}_{max}^T</math> for all groups. That is why it is the standard approach to fit a sequence of models <math display="inline">\theta^{(t)}</math> by empirically approximating them. Using ERM, for instance, the optimal model is approached by minimizing the loss of the model:<br />
<br />
\begin{align}<br />
\theta^{(t)} = arg min_{\theta \in \Theta} \sum_i \ell(\theta; Z_i^{(t)})<br />
\end{align}<br />
<br />
However, ERM fails to prevent disparity amplification. By minimizing the expected loss of the model, minority groups experience higher loss (because the loss of the majority group is minimized), and do not return to use the system. In doing so, the population proportions <math display="inline">\alpha_k^{(t)}</math> shift, and certain minority groups contribute even less to the system. This is mirrored in the expected user count <math display="inline">\lambda^{(t)}</math> at each optimization point. In their paper Hashimoto et al. show that, if using ERM, <math display="inline">\lambda^{(t)}</math> is unstable because it loses its fair fixed point (i.e. the population fraction where risk minimization maintains the same population fraction over time). Therefore, ERM fails to control minority risk over time, and is considered unfair.<br />
<br />
=Distributonally Robust Optimization (DRO)=<br />
<br />
To overcome the unfairness of ERM, Hashimoto et al. developed a distributionally robust optimization (DRO). At this point the goal is still to minimize the worst-case group risk over a single time-step <math display="inline">\mathcal{R}_{max} (\theta^{(t)}) </math>. As previously mentioned, this is difficult to do because neither the population proportions <math display="inline">\alpha_k </math> nor group distributions <math display="inline">P_k </math> are known. Therefore, Hashimoto et al. developed an optimization technique that is robust "against '''''all''''' directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution <math display="inline">P_k </math> whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function <math display="inline">\mathcal{R}_{dro} </math> has to "up-weigh" data <math display="inline">Z</math> that cause high loss <math display="inline">\ell(\theta, Z)</math>. In other words, the risk function has to over-represent mixture components (i.e. group distributions <math display="inline">P_k </math>) in relation to their original mixture weights (i.e. the population proportions <math display="inline">\alpha_k </math>) for groups that suffer high loss. <br />
<br />
To do this Hashimoto et al. considered the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> around <math display="inline">P</math> within a certain limit. This limit is described by the <math display="inline">\chi^2</math>-divergence (i.e. the distance, roughly speaking) between probability distributions. For two distributions <math display="inline">P</math> and <math display="inline">Q</math> the divergence is defined as <math display="inline">D_{\chi^2} (P || Q):= \int (\frac{dP}{dQ} - 1)^2</math>. With the help of the <math display="inline">\chi^2</math>-divergence, Hashimoto et al. defined the chi-squared ball <math display="inline">\mathcal{B}(P,r)</math> around the probability distribution P. This ball is defined so that <math display="inline">\mathcal{B}(P,r) := \{Q \ll P : D_{\chi^2} (Q || P) \leq r \}</math>. It is exactly this ball that gives us the opportunity to consider the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> that lie inside the ball (i.e. within reasonable range) around the probability distribution <math display="inline">P</math>. This loss is given by<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]<br />
\end{align}<br />
<br />
which for <math display="inline">P:= \sum_{k \in [K]} \alpha_k P_k</math> for all models <math display="inline">\theta \in \Theta</math> where <math display="inline">r_k := (1/a_k -1)^2</math> bounds the risk <math display="inline">\mathcal{R}_k(\theta) \leq \mathcal{R}_{dro} (\theta; r_k)</math> for each group with risk <math display="inline">\mathcal{R}_k(\theta)</math>. Furthermore, if we specify a lower bound on the group proportions <math display="inline">\alpha_{min} \leq min_{k \in [K]} \alpha_k</math>, and define <math display="inline">r_{max} := (1/\alpha_{min} -1)^2</math>, the worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math> can be controlled by <math display="inline">\mathcal{R}_{dro} (\theta; r_{max}) </math> by forming an upper bound that can be minimized.<br />
<br />
==Optimization of DRO==<br />
<br />
To minimize <math display="inline">\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]</math> Hashimoto et al. look at the dual of this maximization problem (i.e. every maximization problem can be transformed into a minimization problem and vice-versa). This dual is given by the minimization problem<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) = \underset{\eta \in \mathbb{R}}{inf} \left\{ F(\theta; \eta):= C\left(\mathbb{E}_P \left[ [\ell(\theta;Z) - \eta]_+^2 \right] \right)^\frac{1}{2} + \eta \right\}<br />
\end{align}<br />
<br />
with <math display="inline">C = (2(1/a_{min} - 1)^2 + 1)^{1/2}</math>. <math display="inline">\eta</math> describes the dual variable (i.e. the variable that appears in creating the dual). Since <math display="inline">F(\theta; \eta)</math> involves an expectation <math display="inline">\mathbb{E}_P</math> over the data generating distribution <math display="inline">P</math>, <math display="inline">F(\theta; \eta)</math> can be directly minimized. For convex losses <math display="inline">\ell(\theta;Z)</math>, <math display="inline">F(\theta; \eta)</math> is convex, and can be minimized by performing a binary search over <math display="inline">\eta</math>. In their paper, Hashimoto et al. further show that optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> at each time step controls the ''future'' worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math>, and therefore retention rates. That means if the initial group proportions satisfy <math display="inline">\alpha_k^{(0)} \geq a_{min}</math>, and <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> is optimized for every time step (and therefore <math display="inline">\mathcal{R}_{max} (\theta) </math> is minimized), <math display="inline">\mathcal{R}_{max}^T (\theta) </math> over all time steps is controlled. In other words, optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> every time step is enough to avoid disparity amplification.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fairness_Without_Demographics_in_Repeated_Loss_Minimization&diff=37031Fairness Without Demographics in Repeated Loss Minimization2018-10-22T19:10:39Z<p>Mpafla: /* Distributonally Robust Optimization (DRO) */</p>
<hr />
<div>This page contains the summary of the paper "[http://proceedings.mlr.press/v80/hashimoto18a.html Fairness Without Demographics in Repeated Loss Minimization]" by Hashimoto, T. B., Srivastava, M., Namkoong, H., & Liang, P. which was published at the International Conference of Machine Learning (ICML) in 2018. <br />
<br />
=Introduction=<br />
<br />
Usually, machine learning models are minimized in their average loss to achieve high overall accuracy. While this works well for the majority, minority groups that use the system suffer high error rates because they contribute less data to the model. This phenomenon is known as '''''representation disparity''''' and has been observed in many models that, for instance, recognize faces, or identify language. This disparity even increases as minority users suffer higher error rates, and therefore, are less likely to use the system in the future. As a consequence minority groups further shrink, and therefore, less data is available for the next optimization of the model. With less data the disparity becomes even worse - a phenomenon referred to as '''''disparity amplification'''''. <br />
<br />
In this paper, Hashimoto et al. first show that standard '''''empirical risk minimization (ERM)''''' does not control the loss of minority groups, and thus causes representation disparity and its amplification over time (even if the model is fair in the beginning). Second, the researchers try to mitigate this unfairness by proposing the use of '''''distributionally robust optimization (DRO)'''''. Indeed Hashimoto et al. are able to show that DRO can bound the loss for minority groups and is fair on models that ERM turns unfair. <br />
<br />
===Note on Fairness===<br />
<br />
Hashimoto et al. follow the ''difference principle'' to achieve and measure fairness. It is defined as the maximization of the welfare of the worst-off group rather than the whole group (cf. utilitarianism).<br />
<br />
=Representation Disparity=<br />
<br />
If a user makes a query <math display="inline">Z \sim P</math>, the model <math display="inline">\theta \in \Theta</math> makes a prediction, and the user experiences loss <math display="inline">\ell (\theta; Z)</math>. The ''expected'' loss of of a model <math display="inline">\theta</math> is denoted as the risk <math display="inline">\mathcal{R}(\theta) = \mathbb{E}_{Z \sim P} [\ell (\theta; Z)] </math>. While the queries (i.e. users using the system and providing data to the model) are made by users from <math display="inline">K</math> latent groups such that <math display="inline">Z \sim P := \sum_{k \in [k]} \alpha_kP_k</math>, neither the actual population proportions <math display="inline">\alpha_k</math> nor the group distributions <math display="inline">P_k</math> are known. Therefore, Hashimoto et al.'s goal is to control the worst-case risk over ''all'' groups <math display="inline">K</math> (and not just the risk of the worst-off group):<br />
<br />
\begin{align}<br />
\mathcal{R}_{max}(\theta) := \underset{k \in [k]}{max} \mathcal{R}_k(\theta), \qquad \mathcal{R}_k(\theta) := \mathbb{E}_{P_k} [\ell (\theta; Z)]<br />
\end{align}<br />
<br />
There is high representation disparity if the expected loss of the model <math display="inline">\mathcal{R}(\theta)</math> is low, but the worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> of one group is high. This means that a model with high representation disparity performs well on average (i.e. has low overall loss), but fails to represent some groups <math display="inline">k</math> (i.e. the risk for the worst-off group is high). In order to make models with high representation disparity fairer, Hashimot et al. try to minimize this worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> in general.<br />
<br />
=Disparity Amplification=<br />
<br />
Representation disparity can amplify as time passes and loss is minimized. Over <math display="inline">t = 1, 2, ..., T</math> minimization rounds the group proportions <math display="inline">\alpha_k^{(t)}</math> vary dependent on past losses. At each round the expected number of users <math display="inline">\lambda_k^{(t+1)}</math> from group <math display="inline">k</math> is determined by <br />
<br />
\begin{align}<br />
\lambda_k^{(t+1)} := \lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)})) + b_k<br />
\end{align}<br />
<br />
where <math display="inline">\lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)}))</math> describes the fraction of retained users from the previous optimization and <math display="inline">b_k</math> is the number of new users of group <math display="inline">k</math>. To put simply, the number of expected users of a group depend on the number of new users of that group, and the fraction of users that continue to use the system from the previous optimization step. If less and less users from minority groups return to the model (i.e. the model as a low retention rate of minority group users), Hashimoto et al. argue that the representation disparity amplifies. <br />
<br />
==Empirical Risk Minimization (ERM)==<br />
<br />
Without the knowledge of population proportions <math display="inline">\alpha_k^{(t)}</math>, the new user rate <math display="inline">b_k</math>, and the retention function <math display="inline">\nu</math> it is hard in practice, however, to control the worst-case risk over all time periods <math display="inline">\mathcal{R}_{max}^T</math> for all groups. That is why it is the standard approach to fit a sequence of models <math display="inline">\theta^{(t)}</math> by empirically approximating them. Using ERM, for instance, the optimal model is approached by minimizing the loss of the model:<br />
<br />
\begin{align}<br />
\theta^{(t)} = arg min_{\theta \in \Theta} \sum_i \ell(\theta; Z_i^{(t)})<br />
\end{align}<br />
<br />
However, ERM fails to prevent disparity amplification. By minimizing the expected loss of the model, minority groups experience higher loss (because the loss of the majority group is minimized), and do not return to use the system. In doing so, the population proportions <math display="inline">\alpha_k^{(t)}</math> shift, and certain minority groups contribute even less to the system. This is mirrored in the expected user count <math display="inline">\lambda^{(t)}</math> at each optimization point. In their paper Hashimoto et al. show that, if using ERM, <math display="inline">\lambda^{(t)}</math> is unstable because it loses its fair fixed point (i.e. the population fraction where risk minimization maintains the same population fraction over time). Therefore, ERM fails to control minority risk over time, and is considered unfair.<br />
<br />
=Distributonally Robust Optimization (DRO)=<br />
<br />
To overcome the unfairness of ERM, Hashimoto et al. developed a distributionally robust optimization (DRO). At this point the goal is still to minimize the worst-case group risk over a single time-step <math display="inline">\mathcal{R}_{max} (\theta^{(t)}) </math>. As previously mentioned, this is difficult to do because neither the population proportions <math display="inline">\alpha_k </math> nor group distributions <math display="inline">P_k </math> are known. Therefore, Hashimoto et al. developed an optimization technique that is robust "against '''''all''''' directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution <math display="inline">P_k </math> whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function <math display="inline">\mathcal{R}_{dro} </math> has to "up-weigh" data <math display="inline">Z</math> that cause high loss <math display="inline">\ell(\theta, Z)</math>. In other words, the risk function has to over-represent mixture components (i.e. group distributions <math display="inline">P_k </math>) in relation to their original mixture weights (i.e. the population proportions <math display="inline">\alpha_k </math>) for groups that suffer high loss. <br />
<br />
To do this Hashimoto et al. considered the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> around <math display="inline">P</math> within a certain limit. This limit is described by the <math display="inline">\chi^2</math>-divergence (i.e. the distance, roughly speaking) between probability distributions. For two distributions <math display="inline">P</math> and <math display="inline">Q</math> the divergence is defined as <math display="inline">D_{\chi^2} (P || Q):= \int (\frac{dP}{dQ} - 1)^2</math>. With the help of the <math display="inline">\chi^2</math>-divergence, Hashimoto et al. define the chi-squared ball <math display="inline">\mathcal{B}(P,r)</math> around the probability distribution P. This ball is defined so that <math display="inline">\mathcal{B}(P,r) := \{Q \ll P : D_{\chi^2} (Q || P) \leq r \}</math>. It is exactly this ball that gives us the opportunity to consider the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> that lie inside the ball (i.e. within reasonable range) around the probability distribution <math display="inline">P</math>. This loss is given by<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]<br />
\end{align}<br />
<br />
which for <math display="inline">P:= \sum_{k \in [K]} \alpha_k P_k</math> for all models <math display="inline">\theta \in \Theta</math> where <math display="inline">r_k := (1/a_k -1)^2</math> bounds the risk <math display="inline">\mathcal{R}_k(\theta) \leq \mathcal{R}_{dro} (\theta; r_k)</math> for each group with risk <math display="inline">\mathcal{R}_k(\theta)</math>. Furthermore, if we specify a lower bound on the group proportions <math display="inline">\alpha_{min} \leq min_{k \in [K]} \alpha_k</math>, and define <math display="inline">r_{max} := (1/\alpha_{min} -1)^2</math>, the worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math> can be controlled by <math display="inline">\mathcal{R}_{dro} (\theta; r_{max}) </math> by forming an upper bound that can be minimized.<br />
<br />
==Optimization of DRO==<br />
<br />
To minimize <math display="inline">\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]</math> Hashimoto et al. look at the dual of this maximization problem (i.e. every maximization problem can be transformed into a minimization problem and vice-versa). This dual is given by the minimization problem<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) = \underset{\eta \in \mathbb{R}}{inf} \left\{ F(\theta; \eta):= C\left(\mathbb{E}_P \left[ [\ell(\theta;Z) - \eta]_+^2 \right] \right)^\frac{1}{2} + \eta \right\}<br />
\end{align}<br />
<br />
with <math display="inline">C = (2(1/a_{min} - 1)^2 + 1)^{1/2}</math>. <math display="inline">\eta</math> describes the dual variable (i.e. the variable that appears in creating the dual). Since <math display="inline">F(\theta; \eta)</math> involves an expectation <math display="inline">\mathbb{E}_P</math> over the data generating distribution <math display="inline">P</math>, <math display="inline">F(\theta; \eta)</math> can be directly minimized. For convex losses <math display="inline">\ell(\theta;Z)</math>, <math display="inline">F(\theta; \eta)</math> is convex, and can be minimized by performing a binary search over <math display="inline">\eta</math>. In their paper, Hashimoto et al. further show that optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> at each time step controls the ''future'' worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math>, and therefore retention rates. That means if the initial group proportions satisfy <math display="inline">\alpha_k^{(0)} \geq a_{min}</math>, and <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> is optimized for every time step (and therefore <math display="inline">\mathcal{R}_{max} (\theta) </math> is minimized), <math display="inline">\mathcal{R}_{max}^T (\theta) </math> over all time steps is controlled. In other words, optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> every time step is enough to avoid disparity amplification.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fairness_Without_Demographics_in_Repeated_Loss_Minimization&diff=37030Fairness Without Demographics in Repeated Loss Minimization2018-10-22T19:09:49Z<p>Mpafla: /* Distributonally Robust Optimization (DRO) */</p>
<hr />
<div>This page contains the summary of the paper "[http://proceedings.mlr.press/v80/hashimoto18a.html Fairness Without Demographics in Repeated Loss Minimization]" by Hashimoto, T. B., Srivastava, M., Namkoong, H., & Liang, P. which was published at the International Conference of Machine Learning (ICML) in 2018. <br />
<br />
=Introduction=<br />
<br />
Usually, machine learning models are minimized in their average loss to achieve high overall accuracy. While this works well for the majority, minority groups that use the system suffer high error rates because they contribute less data to the model. This phenomenon is known as '''''representation disparity''''' and has been observed in many models that, for instance, recognize faces, or identify language. This disparity even increases as minority users suffer higher error rates, and therefore, are less likely to use the system in the future. As a consequence minority groups further shrink, and therefore, less data is available for the next optimization of the model. With less data the disparity becomes even worse - a phenomenon referred to as '''''disparity amplification'''''. <br />
<br />
In this paper, Hashimoto et al. first show that standard '''''empirical risk minimization (ERM)''''' does not control the loss of minority groups, and thus causes representation disparity and its amplification over time (even if the model is fair in the beginning). Second, the researchers try to mitigate this unfairness by proposing the use of '''''distributionally robust optimization (DRO)'''''. Indeed Hashimoto et al. are able to show that DRO can bound the loss for minority groups and is fair on models that ERM turns unfair. <br />
<br />
===Note on Fairness===<br />
<br />
Hashimoto et al. follow the ''difference principle'' to achieve and measure fairness. It is defined as the maximization of the welfare of the worst-off group rather than the whole group (cf. utilitarianism).<br />
<br />
=Representation Disparity=<br />
<br />
If a user makes a query <math display="inline">Z \sim P</math>, the model <math display="inline">\theta \in \Theta</math> makes a prediction, and the user experiences loss <math display="inline">\ell (\theta; Z)</math>. The ''expected'' loss of of a model <math display="inline">\theta</math> is denoted as the risk <math display="inline">\mathcal{R}(\theta) = \mathbb{E}_{Z \sim P} [\ell (\theta; Z)] </math>. While the queries (i.e. users using the system and providing data to the model) are made by users from <math display="inline">K</math> latent groups such that <math display="inline">Z \sim P := \sum_{k \in [k]} \alpha_kP_k</math>, neither the actual population proportions <math display="inline">\alpha_k</math> nor the group distributions <math display="inline">P_k</math> are known. Therefore, Hashimoto et al.'s goal is to control the worst-case risk over ''all'' groups <math display="inline">K</math> (and not just the risk of the worst-off group):<br />
<br />
\begin{align}<br />
\mathcal{R}_{max}(\theta) := \underset{k \in [k]}{max} \mathcal{R}_k(\theta), \qquad \mathcal{R}_k(\theta) := \mathbb{E}_{P_k} [\ell (\theta; Z)]<br />
\end{align}<br />
<br />
There is high representation disparity if the expected loss of the model <math display="inline">\mathcal{R}(\theta)</math> is low, but the worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> of one group is high. This means that a model with high representation disparity performs well on average (i.e. has low overall loss), but fails to represent some groups <math display="inline">k</math> (i.e. the risk for the worst-off group is high). In order to make models with high representation disparity fairer, Hashimot et al. try to minimize this worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> in general.<br />
<br />
=Disparity Amplification=<br />
<br />
Representation disparity can amplify as time passes and loss is minimized. Over <math display="inline">t = 1, 2, ..., T</math> minimization rounds the group proportions <math display="inline">\alpha_k^{(t)}</math> vary dependent on past losses. At each round the expected number of users <math display="inline">\lambda_k^{(t+1)}</math> from group <math display="inline">k</math> is determined by <br />
<br />
\begin{align}<br />
\lambda_k^{(t+1)} := \lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)})) + b_k<br />
\end{align}<br />
<br />
where <math display="inline">\lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)}))</math> describes the fraction of retained users from the previous optimization and <math display="inline">b_k</math> is the number of new users of group <math display="inline">k</math>. To put simply, the number of expected users of a group depend on the number of new users of that group, and the fraction of users that continue to use the system from the previous optimization step. If less and less users from minority groups return to the model (i.e. the model as a low retention rate of minority group users), Hashimoto et al. argue that the representation disparity amplifies. <br />
<br />
==Empirical Risk Minimization (ERM)==<br />
<br />
Without the knowledge of population proportions <math display="inline">\alpha_k^{(t)}</math>, the new user rate <math display="inline">b_k</math>, and the retention function <math display="inline">\nu</math> it is hard in practice, however, to control the worst-case risk over all time periods <math display="inline">\mathcal{R}_{max}^T</math> for all groups. That is why it is the standard approach to fit a sequence of models <math display="inline">\theta^{(t)}</math> by empirically approximating them. Using ERM, for instance, the optimal model is approached by minimizing the loss of the model:<br />
<br />
\begin{align}<br />
\theta^{(t)} = arg min_{\theta \in \Theta} \sum_i \ell(\theta; Z_i^{(t)})<br />
\end{align}<br />
<br />
However, ERM fails to prevent disparity amplification. By minimizing the expected loss of the model, minority groups experience higher loss (because the loss of the majority group is minimized), and do not return to use the system. In doing so, the population proportions <math display="inline">\alpha_k^{(t)}</math> shift, and certain minority groups contribute even less to the system. This is mirrored in the expected user count <math display="inline">\lambda^{(t)}</math> at each optimization point. In their paper Hashimoto et al. show that, if using ERM, <math display="inline">\lambda^{(t)}</math> is unstable because it loses its fair fixed point (i.e. the population fraction where risk minimization maintains the same population fraction over time). Therefore, ERM fails to control minority risk over time, and is considered unfair.<br />
<br />
=Distributonally Robust Optimization (DRO)=<br />
<br />
To overcome the unfairness of ERM, Hashimoto et al. developed a distributionally robust optimization (DRO). At this point the goal is still to minimize the worst-case group risk over a single time-step <math display="inline">\mathcal{R}_{max} (\theta^{(t)}) </math>. As previously mentioned, this is difficult to do because neither the population proportions <math display="inline">\alpha_k </math> nor group distributions <math display="inline">P_k </math> are known. Therefore, Hashimoto et al. developed an optimization technique that is robust "against '''''all''''' directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution <math display="inline">P_k </math> whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function <math display="inline">\mathcal{R}_{dro} </math> has to "up-weigh" data <math display="inline">Z</math> that cause high loss <math display="inline">\ell(\theta, Z)</math>. In other words, the risk function has to over-represent mixture components (i.e. group distributions <math display="inline">P_k </math>) in relation to their original mixture weights (i.e. the population proportions <math display="inline">\alpha_k </math>) for groups that suffer high loss. <br />
<br />
To do this we need to consider the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> around <math display="inline">P</math> within a certain limit. This limit is described by the <math display="inline">\chi^2</math>-divergence (i.e. the distance, roughly speaking) between probability distributions. For two distributions <math display="inline">P</math> and <math display="inline">Q</math> the divergence is defined as <math display="inline">D_{\chi^2} (P || Q):= \int (\frac{dP}{dQ} - 1)^2</math>. With the help of the <math display="inline">\chi^2</math>-divergence, Hashimoto et al. define the chi-squared ball <math display="inline">\mathcal{B}(P,r)</math> around the probability distribution P. This ball is defined so that <math display="inline">\mathcal{B}(P,r) := \{Q \ll P : D_{\chi^2} (Q || P) \leq r \}</math>. It is exactly this ball that gives us the opportunity to consider the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> that lie inside the ball (i.e. within reasonable range) around the probability distribution <math display="inline">P</math>. This loss is given by<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]<br />
\end{align}<br />
<br />
which for <math display="inline">P:= \sum_{k \in [K]} \alpha_k P_k</math> for all models <math display="inline">\theta \in \Theta</math> where <math display="inline">r_k := (1/a_k -1)^2</math> bounds the risk <math display="inline">\mathcal{R}_k(\theta) \leq \mathcal{R}_{dro} (\theta; r_k)</math> for each group with risk <math display="inline">\mathcal{R}_k(\theta)</math>. Furthermore, if we specify a lower bound on the group proportions <math display="inline">\alpha_{min} \leq min_{k \in [K]} \alpha_k</math>, and define <math display="inline">r_{max} := (1/\alpha_{min} -1)^2</math>, the worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math> can be controlled by <math display="inline">\mathcal{R}_{dro} (\theta; r_{max}) </math> by forming an upper bound that can be minimized.<br />
<br />
==Optimization of DRO==<br />
<br />
To minimize <math display="inline">\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]</math> Hashimoto et al. look at the dual of this maximization problem (i.e. every maximization problem can be transformed into a minimization problem and vice-versa). This dual is given by the minimization problem<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) = \underset{\eta \in \mathbb{R}}{inf} \left\{ F(\theta; \eta):= C\left(\mathbb{E}_P \left[ [\ell(\theta;Z) - \eta]_+^2 \right] \right)^\frac{1}{2} + \eta \right\}<br />
\end{align}<br />
<br />
with <math display="inline">C = (2(1/a_{min} - 1)^2 + 1)^{1/2}</math>. <math display="inline">\eta</math> describes the dual variable (i.e. the variable that appears in creating the dual). Since <math display="inline">F(\theta; \eta)</math> involves an expectation <math display="inline">\mathbb{E}_P</math> over the data generating distribution <math display="inline">P</math>, <math display="inline">F(\theta; \eta)</math> can be directly minimized. For convex losses <math display="inline">\ell(\theta;Z)</math>, <math display="inline">F(\theta; \eta)</math> is convex, and can be minimized by performing a binary search over <math display="inline">\eta</math>. In their paper, Hashimoto et al. further show that optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> at each time step controls the ''future'' worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math>, and therefore retention rates. That means if the initial group proportions satisfy <math display="inline">\alpha_k^{(0)} \geq a_{min}</math>, and <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> is optimized for every time step (and therefore <math display="inline">\mathcal{R}_{max} (\theta) </math> is minimized), <math display="inline">\mathcal{R}_{max}^T (\theta) </math> over all time steps is controlled. In other words, optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> every time step is enough to avoid disparity amplification.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fairness_Without_Demographics_in_Repeated_Loss_Minimization&diff=37029Fairness Without Demographics in Repeated Loss Minimization2018-10-22T19:09:02Z<p>Mpafla: /* Distributonally Robust Optimization (DRO) */</p>
<hr />
<div>This page contains the summary of the paper "[http://proceedings.mlr.press/v80/hashimoto18a.html Fairness Without Demographics in Repeated Loss Minimization]" by Hashimoto, T. B., Srivastava, M., Namkoong, H., & Liang, P. which was published at the International Conference of Machine Learning (ICML) in 2018. <br />
<br />
=Introduction=<br />
<br />
Usually, machine learning models are minimized in their average loss to achieve high overall accuracy. While this works well for the majority, minority groups that use the system suffer high error rates because they contribute less data to the model. This phenomenon is known as '''''representation disparity''''' and has been observed in many models that, for instance, recognize faces, or identify language. This disparity even increases as minority users suffer higher error rates, and therefore, are less likely to use the system in the future. As a consequence minority groups further shrink, and therefore, less data is available for the next optimization of the model. With less data the disparity becomes even worse - a phenomenon referred to as '''''disparity amplification'''''. <br />
<br />
In this paper, Hashimoto et al. first show that standard '''''empirical risk minimization (ERM)''''' does not control the loss of minority groups, and thus causes representation disparity and its amplification over time (even if the model is fair in the beginning). Second, the researchers try to mitigate this unfairness by proposing the use of '''''distributionally robust optimization (DRO)'''''. Indeed Hashimoto et al. are able to show that DRO can bound the loss for minority groups and is fair on models that ERM turns unfair. <br />
<br />
===Note on Fairness===<br />
<br />
Hashimoto et al. follow the ''difference principle'' to achieve and measure fairness. It is defined as the maximization of the welfare of the worst-off group rather than the whole group (cf. utilitarianism).<br />
<br />
=Representation Disparity=<br />
<br />
If a user makes a query <math display="inline">Z \sim P</math>, the model <math display="inline">\theta \in \Theta</math> makes a prediction, and the user experiences loss <math display="inline">\ell (\theta; Z)</math>. The ''expected'' loss of of a model <math display="inline">\theta</math> is denoted as the risk <math display="inline">\mathcal{R}(\theta) = \mathbb{E}_{Z \sim P} [\ell (\theta; Z)] </math>. While the queries (i.e. users using the system and providing data to the model) are made by users from <math display="inline">K</math> latent groups such that <math display="inline">Z \sim P := \sum_{k \in [k]} \alpha_kP_k</math>, neither the actual population proportions <math display="inline">\alpha_k</math> nor the group distributions <math display="inline">P_k</math> are known. Therefore, Hashimoto et al.'s goal is to control the worst-case risk over ''all'' groups <math display="inline">K</math> (and not just the risk of the worst-off group):<br />
<br />
\begin{align}<br />
\mathcal{R}_{max}(\theta) := \underset{k \in [k]}{max} \mathcal{R}_k(\theta), \qquad \mathcal{R}_k(\theta) := \mathbb{E}_{P_k} [\ell (\theta; Z)]<br />
\end{align}<br />
<br />
There is high representation disparity if the expected loss of the model <math display="inline">\mathcal{R}(\theta)</math> is low, but the worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> of one group is high. This means that a model with high representation disparity performs well on average (i.e. has low overall loss), but fails to represent some groups <math display="inline">k</math> (i.e. the risk for the worst-off group is high). In order to make models with high representation disparity fairer, Hashimot et al. try to minimize this worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> in general.<br />
<br />
=Disparity Amplification=<br />
<br />
Representation disparity can amplify as time passes and loss is minimized. Over <math display="inline">t = 1, 2, ..., T</math> minimization rounds the group proportions <math display="inline">\alpha_k^{(t)}</math> vary dependent on past losses. At each round the expected number of users <math display="inline">\lambda_k^{(t+1)}</math> from group <math display="inline">k</math> is determined by <br />
<br />
\begin{align}<br />
\lambda_k^{(t+1)} := \lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)})) + b_k<br />
\end{align}<br />
<br />
where <math display="inline">\lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)}))</math> describes the fraction of retained users from the previous optimization and <math display="inline">b_k</math> is the number of new users of group <math display="inline">k</math>. To put simply, the number of expected users of a group depend on the number of new users of that group, and the fraction of users that continue to use the system from the previous optimization step. If less and less users from minority groups return to the model (i.e. the model as a low retention rate of minority group users), Hashimoto et al. argue that the representation disparity amplifies. <br />
<br />
==Empirical Risk Minimization (ERM)==<br />
<br />
Without the knowledge of population proportions <math display="inline">\alpha_k^{(t)}</math>, the new user rate <math display="inline">b_k</math>, and the retention function <math display="inline">\nu</math> it is hard in practice, however, to control the worst-case risk over all time periods <math display="inline">\mathcal{R}_{max}^T</math> for all groups. That is why it is the standard approach to fit a sequence of models <math display="inline">\theta^{(t)}</math> by empirically approximating them. Using ERM, for instance, the optimal model is approached by minimizing the loss of the model:<br />
<br />
\begin{align}<br />
\theta^{(t)} = arg min_{\theta \in \Theta} \sum_i \ell(\theta; Z_i^{(t)})<br />
\end{align}<br />
<br />
However, ERM fails to prevent disparity amplification. By minimizing the expected loss of the model, minority groups experience higher loss (because the loss of the majority group is minimized), and do not return to use the system. In doing so, the population proportions <math display="inline">\alpha_k^{(t)}</math> shift, and certain minority groups contribute even less to the system. This is mirrored in the expected user count <math display="inline">\lambda^{(t)}</math> at each optimization point. In their paper Hashimoto et al. show that, if using ERM, <math display="inline">\lambda^{(t)}</math> is unstable because it loses its fair fixed point (i.e. the population fraction where risk minimization maintains the same population fraction over time). Therefore, ERM fails to control minority risk over time, and is considered unfair.<br />
<br />
=Distributonally Robust Optimization (DRO)=<br />
<br />
To overcome the unfairness of ERM, Hashimoto et al. developed a distributionally robust optimization (DRO). At this point the goal is still to minimize the worst-case group risk over a single time-step <math display="inline">\mathcal{R}_{max} (\theta^{(t)}) </math>. As previously mentioned, this is difficult to do because neither the population proportions <math display="inline">\alpha_k </math> nor group distributions <math display="inline">P_k </math> are known. Therefore, Hashimoto et al. developed an optimization technique that is robust "against '''''all''''' directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution <math display="inline">P_k </math> whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function <math display="inline">\mathcal{R}_{dro} </math> has to "up-weigh" data <math display="inline">Z</math> that cause high loss <math display="inline">\ell(\theta, Z)</math>. In other words, the risk function has to over-represent mixture components (i.e. group distributions <math display="inline">\{P_k\} </math>) in relation to their original mixture weights (i.e. the population proportions <math display="inline">\{\alpha_k\} </math>) for groups that suffer high loss. <br />
<br />
To do this we need to consider the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> around <math display="inline">P</math> within a certain limit. This limit is described by the <math display="inline">\chi^2</math>-divergence (i.e. the distance, roughly speaking) between probability distributions. For two distributions <math display="inline">P</math> and <math display="inline">Q</math> the divergence is defined as <math display="inline">D_{\chi^2} (P || Q):= \int (\frac{dP}{dQ} - 1)^2</math>. With the help of the <math display="inline">\chi^2</math>-divergence, Hashimoto et al. define the chi-squared ball <math display="inline">\mathcal{B}(P,r)</math> around the probability distribution P. This ball is defined so that <math display="inline">\mathcal{B}(P,r) := \{Q \ll P : D_{\chi^2} (Q || P) \leq r \}</math>. It is exactly this ball that gives us the opportunity to consider the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> that lie inside the ball (i.e. within reasonable range) around the probability distribution <math display="inline">P</math>. This loss is given by<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]<br />
\end{align}<br />
<br />
which for <math display="inline">P:= \sum_{k \in [K]} \alpha_k P_k</math> for all models <math display="inline">\theta \in \Theta</math> where <math display="inline">r_k := (1/a_k -1)^2</math> bounds the risk <math display="inline">\mathcal{R}_k(\theta) \leq \mathcal{R}_{dro} (\theta; r_k)</math> for each group with risk <math display="inline">\mathcal{R}_k(\theta)</math>. Furthermore, if we specify a lower bound on the group proportions <math display="inline">\alpha_{min} \leq min_{k \in [K]} \alpha_k</math>, and define <math display="inline">r_{max} := (1/\alpha_{min} -1)^2</math>, the worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math> can be controlled by <math display="inline">\mathcal{R}_{dro} (\theta; r_{max}) </math> by forming an upper bound that can be minimized.<br />
<br />
==Optimization of DRO==<br />
<br />
To minimize <math display="inline">\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]</math> Hashimoto et al. look at the dual of this maximization problem (i.e. every maximization problem can be transformed into a minimization problem and vice-versa). This dual is given by the minimization problem<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) = \underset{\eta \in \mathbb{R}}{inf} \left\{ F(\theta; \eta):= C\left(\mathbb{E}_P \left[ [\ell(\theta;Z) - \eta]_+^2 \right] \right)^\frac{1}{2} + \eta \right\}<br />
\end{align}<br />
<br />
with <math display="inline">C = (2(1/a_{min} - 1)^2 + 1)^{1/2}</math>. <math display="inline">\eta</math> describes the dual variable (i.e. the variable that appears in creating the dual). Since <math display="inline">F(\theta; \eta)</math> involves an expectation <math display="inline">\mathbb{E}_P</math> over the data generating distribution <math display="inline">P</math>, <math display="inline">F(\theta; \eta)</math> can be directly minimized. For convex losses <math display="inline">\ell(\theta;Z)</math>, <math display="inline">F(\theta; \eta)</math> is convex, and can be minimized by performing a binary search over <math display="inline">\eta</math>. In their paper, Hashimoto et al. further show that optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> at each time step controls the ''future'' worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math>, and therefore retention rates. That means if the initial group proportions satisfy <math display="inline">\alpha_k^{(0)} \geq a_{min}</math>, and <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> is optimized for every time step (and therefore <math display="inline">\mathcal{R}_{max} (\theta) </math> is minimized), <math display="inline">\mathcal{R}_{max}^T (\theta) </math> over all time steps is controlled. In other words, optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> every time step is enough to avoid disparity amplification.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fairness_Without_Demographics_in_Repeated_Loss_Minimization&diff=37028Fairness Without Demographics in Repeated Loss Minimization2018-10-22T19:08:31Z<p>Mpafla: /* Distributonally Robust Optimization (DRO) */</p>
<hr />
<div>This page contains the summary of the paper "[http://proceedings.mlr.press/v80/hashimoto18a.html Fairness Without Demographics in Repeated Loss Minimization]" by Hashimoto, T. B., Srivastava, M., Namkoong, H., & Liang, P. which was published at the International Conference of Machine Learning (ICML) in 2018. <br />
<br />
=Introduction=<br />
<br />
Usually, machine learning models are minimized in their average loss to achieve high overall accuracy. While this works well for the majority, minority groups that use the system suffer high error rates because they contribute less data to the model. This phenomenon is known as '''''representation disparity''''' and has been observed in many models that, for instance, recognize faces, or identify language. This disparity even increases as minority users suffer higher error rates, and therefore, are less likely to use the system in the future. As a consequence minority groups further shrink, and therefore, less data is available for the next optimization of the model. With less data the disparity becomes even worse - a phenomenon referred to as '''''disparity amplification'''''. <br />
<br />
In this paper, Hashimoto et al. first show that standard '''''empirical risk minimization (ERM)''''' does not control the loss of minority groups, and thus causes representation disparity and its amplification over time (even if the model is fair in the beginning). Second, the researchers try to mitigate this unfairness by proposing the use of '''''distributionally robust optimization (DRO)'''''. Indeed Hashimoto et al. are able to show that DRO can bound the loss for minority groups and is fair on models that ERM turns unfair. <br />
<br />
===Note on Fairness===<br />
<br />
Hashimoto et al. follow the ''difference principle'' to achieve and measure fairness. It is defined as the maximization of the welfare of the worst-off group rather than the whole group (cf. utilitarianism).<br />
<br />
=Representation Disparity=<br />
<br />
If a user makes a query <math display="inline">Z \sim P</math>, the model <math display="inline">\theta \in \Theta</math> makes a prediction, and the user experiences loss <math display="inline">\ell (\theta; Z)</math>. The ''expected'' loss of of a model <math display="inline">\theta</math> is denoted as the risk <math display="inline">\mathcal{R}(\theta) = \mathbb{E}_{Z \sim P} [\ell (\theta; Z)] </math>. While the queries (i.e. users using the system and providing data to the model) are made by users from <math display="inline">K</math> latent groups such that <math display="inline">Z \sim P := \sum_{k \in [k]} \alpha_kP_k</math>, neither the actual population proportions <math display="inline">\alpha_k</math> nor the group distributions <math display="inline">P_k</math> are known. Therefore, Hashimoto et al.'s goal is to control the worst-case risk over ''all'' groups <math display="inline">K</math> (and not just the risk of the worst-off group):<br />
<br />
\begin{align}<br />
\mathcal{R}_{max}(\theta) := \underset{k \in [k]}{max} \mathcal{R}_k(\theta), \qquad \mathcal{R}_k(\theta) := \mathbb{E}_{P_k} [\ell (\theta; Z)]<br />
\end{align}<br />
<br />
There is high representation disparity if the expected loss of the model <math display="inline">\mathcal{R}(\theta)</math> is low, but the worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> of one group is high. This means that a model with high representation disparity performs well on average (i.e. has low overall loss), but fails to represent some groups <math display="inline">k</math> (i.e. the risk for the worst-off group is high). In order to make models with high representation disparity fairer, Hashimot et al. try to minimize this worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> in general.<br />
<br />
=Disparity Amplification=<br />
<br />
Representation disparity can amplify as time passes and loss is minimized. Over <math display="inline">t = 1, 2, ..., T</math> minimization rounds the group proportions <math display="inline">\alpha_k^{(t)}</math> vary dependent on past losses. At each round the expected number of users <math display="inline">\lambda_k^{(t+1)}</math> from group <math display="inline">k</math> is determined by <br />
<br />
\begin{align}<br />
\lambda_k^{(t+1)} := \lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)})) + b_k<br />
\end{align}<br />
<br />
where <math display="inline">\lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)}))</math> describes the fraction of retained users from the previous optimization and <math display="inline">b_k</math> is the number of new users of group <math display="inline">k</math>. To put simply, the number of expected users of a group depend on the number of new users of that group, and the fraction of users that continue to use the system from the previous optimization step. If less and less users from minority groups return to the model (i.e. the model as a low retention rate of minority group users), Hashimoto et al. argue that the representation disparity amplifies. <br />
<br />
==Empirical Risk Minimization (ERM)==<br />
<br />
Without the knowledge of population proportions <math display="inline">\alpha_k^{(t)}</math>, the new user rate <math display="inline">b_k</math>, and the retention function <math display="inline">\nu</math> it is hard in practice, however, to control the worst-case risk over all time periods <math display="inline">\mathcal{R}_{max}^T</math> for all groups. That is why it is the standard approach to fit a sequence of models <math display="inline">\theta^{(t)}</math> by empirically approximating them. Using ERM, for instance, the optimal model is approached by minimizing the loss of the model:<br />
<br />
\begin{align}<br />
\theta^{(t)} = arg min_{\theta \in \Theta} \sum_i \ell(\theta; Z_i^{(t)})<br />
\end{align}<br />
<br />
However, ERM fails to prevent disparity amplification. By minimizing the expected loss of the model, minority groups experience higher loss (because the loss of the majority group is minimized), and do not return to use the system. In doing so, the population proportions <math display="inline">\alpha_k^{(t)}</math> shift, and certain minority groups contribute even less to the system. This is mirrored in the expected user count <math display="inline">\lambda^{(t)}</math> at each optimization point. In their paper Hashimoto et al. show that, if using ERM, <math display="inline">\lambda^{(t)}</math> is unstable because it loses its fair fixed point (i.e. the population fraction where risk minimization maintains the same population fraction over time). Therefore, ERM fails to control minority risk over time, and is considered unfair.<br />
<br />
=Distributonally Robust Optimization (DRO)=<br />
<br />
To overcome the unfairness of ERM, Hashimoto et al. developed a distributionally robust optimization (DRO). At this point the goal is still to minimize the worst-case group risk over a single time-step <math display="inline">\mathcal{R}_{max} (\theta^{(t)}) </math>. As previously mentioned, this is difficult to do because neither the population proportions <math display="inline">\{\alpha_k\} </math> nor group distributions <math display="inline">\{P_k\} </math> are known. Therefore, Hashimoto et al. developed an optimization technique that is robust "against '''''all''''' directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution <math display="inline">P_k </math> whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function <math display="inline">\mathcal{R}_{dro} </math> has to "up-weigh" data <math display="inline">Z</math> that cause high loss <math display="inline">\ell(\theta, Z)</math>. In other words, the risk function has to over-represent mixture components (i.e. group distributions <math display="inline">\{P_k\} </math>) in relation to their original mixture weights (i.e. the population proportions <math display="inline">\{\alpha_k\} </math>) for groups that suffer high loss. <br />
<br />
To do this we need to consider the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> around <math display="inline">P</math> within a certain limit. This limit is described by the <math display="inline">\chi^2</math>-divergence (i.e. the distance, roughly speaking) between probability distributions. For two distributions <math display="inline">P</math> and <math display="inline">Q</math> the divergence is defined as <math display="inline">D_{\chi^2} (P || Q):= \int (\frac{dP}{dQ} - 1)^2</math>. With the help of the <math display="inline">\chi^2</math>-divergence, Hashimoto et al. define the chi-squared ball <math display="inline">\mathcal{B}(P,r)</math> around the probability distribution P. This ball is defined so that <math display="inline">\mathcal{B}(P,r) := \{Q \ll P : D_{\chi^2} (Q || P) \leq r \}</math>. It is exactly this ball that gives us the opportunity to consider the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> that lie inside the ball (i.e. within reasonable range) around the probability distribution <math display="inline">P</math>. This loss is given by<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]<br />
\end{align}<br />
<br />
which for <math display="inline">P:= \sum_{k \in [K]} \alpha_k P_k</math> for all models <math display="inline">\theta \in \Theta</math> where <math display="inline">r_k := (1/a_k -1)^2</math> bounds the risk <math display="inline">\mathcal{R}_k(\theta) \leq \mathcal{R}_{dro} (\theta; r_k)</math> for each group with risk <math display="inline">\mathcal{R}_k(\theta)</math>. Furthermore, if we specify a lower bound on the group proportions <math display="inline">\alpha_{min} \leq min_{k \in [K]} \alpha_k</math>, and define <math display="inline">r_{max} := (1/\alpha_{min} -1)^2</math>, the worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math> can be controlled by <math display="inline">\mathcal{R}_{dro} (\theta; r_{max}) </math> by forming an upper bound that can be minimized.<br />
<br />
==Optimization of DRO==<br />
<br />
To minimize <math display="inline">\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]</math> Hashimoto et al. look at the dual of this maximization problem (i.e. every maximization problem can be transformed into a minimization problem and vice-versa). This dual is given by the minimization problem<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) = \underset{\eta \in \mathbb{R}}{inf} \left\{ F(\theta; \eta):= C\left(\mathbb{E}_P \left[ [\ell(\theta;Z) - \eta]_+^2 \right] \right)^\frac{1}{2} + \eta \right\}<br />
\end{align}<br />
<br />
with <math display="inline">C = (2(1/a_{min} - 1)^2 + 1)^{1/2}</math>. <math display="inline">\eta</math> describes the dual variable (i.e. the variable that appears in creating the dual). Since <math display="inline">F(\theta; \eta)</math> involves an expectation <math display="inline">\mathbb{E}_P</math> over the data generating distribution <math display="inline">P</math>, <math display="inline">F(\theta; \eta)</math> can be directly minimized. For convex losses <math display="inline">\ell(\theta;Z)</math>, <math display="inline">F(\theta; \eta)</math> is convex, and can be minimized by performing a binary search over <math display="inline">\eta</math>. In their paper, Hashimoto et al. further show that optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> at each time step controls the ''future'' worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math>, and therefore retention rates. That means if the initial group proportions satisfy <math display="inline">\alpha_k^{(0)} \geq a_{min}</math>, and <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> is optimized for every time step (and therefore <math display="inline">\mathcal{R}_{max} (\theta) </math> is minimized), <math display="inline">\mathcal{R}_{max}^T (\theta) </math> over all time steps is controlled. In other words, optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> every time step is enough to avoid disparity amplification.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fairness_Without_Demographics_in_Repeated_Loss_Minimization&diff=37027Fairness Without Demographics in Repeated Loss Minimization2018-10-22T19:06:16Z<p>Mpafla: /* Empirical Risk Minimization (ERM) */</p>
<hr />
<div>This page contains the summary of the paper "[http://proceedings.mlr.press/v80/hashimoto18a.html Fairness Without Demographics in Repeated Loss Minimization]" by Hashimoto, T. B., Srivastava, M., Namkoong, H., & Liang, P. which was published at the International Conference of Machine Learning (ICML) in 2018. <br />
<br />
=Introduction=<br />
<br />
Usually, machine learning models are minimized in their average loss to achieve high overall accuracy. While this works well for the majority, minority groups that use the system suffer high error rates because they contribute less data to the model. This phenomenon is known as '''''representation disparity''''' and has been observed in many models that, for instance, recognize faces, or identify language. This disparity even increases as minority users suffer higher error rates, and therefore, are less likely to use the system in the future. As a consequence minority groups further shrink, and therefore, less data is available for the next optimization of the model. With less data the disparity becomes even worse - a phenomenon referred to as '''''disparity amplification'''''. <br />
<br />
In this paper, Hashimoto et al. first show that standard '''''empirical risk minimization (ERM)''''' does not control the loss of minority groups, and thus causes representation disparity and its amplification over time (even if the model is fair in the beginning). Second, the researchers try to mitigate this unfairness by proposing the use of '''''distributionally robust optimization (DRO)'''''. Indeed Hashimoto et al. are able to show that DRO can bound the loss for minority groups and is fair on models that ERM turns unfair. <br />
<br />
===Note on Fairness===<br />
<br />
Hashimoto et al. follow the ''difference principle'' to achieve and measure fairness. It is defined as the maximization of the welfare of the worst-off group rather than the whole group (cf. utilitarianism).<br />
<br />
=Representation Disparity=<br />
<br />
If a user makes a query <math display="inline">Z \sim P</math>, the model <math display="inline">\theta \in \Theta</math> makes a prediction, and the user experiences loss <math display="inline">\ell (\theta; Z)</math>. The ''expected'' loss of of a model <math display="inline">\theta</math> is denoted as the risk <math display="inline">\mathcal{R}(\theta) = \mathbb{E}_{Z \sim P} [\ell (\theta; Z)] </math>. While the queries (i.e. users using the system and providing data to the model) are made by users from <math display="inline">K</math> latent groups such that <math display="inline">Z \sim P := \sum_{k \in [k]} \alpha_kP_k</math>, neither the actual population proportions <math display="inline">\alpha_k</math> nor the group distributions <math display="inline">P_k</math> are known. Therefore, Hashimoto et al.'s goal is to control the worst-case risk over ''all'' groups <math display="inline">K</math> (and not just the risk of the worst-off group):<br />
<br />
\begin{align}<br />
\mathcal{R}_{max}(\theta) := \underset{k \in [k]}{max} \mathcal{R}_k(\theta), \qquad \mathcal{R}_k(\theta) := \mathbb{E}_{P_k} [\ell (\theta; Z)]<br />
\end{align}<br />
<br />
There is high representation disparity if the expected loss of the model <math display="inline">\mathcal{R}(\theta)</math> is low, but the worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> of one group is high. This means that a model with high representation disparity performs well on average (i.e. has low overall loss), but fails to represent some groups <math display="inline">k</math> (i.e. the risk for the worst-off group is high). In order to make models with high representation disparity fairer, Hashimot et al. try to minimize this worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> in general.<br />
<br />
=Disparity Amplification=<br />
<br />
Representation disparity can amplify as time passes and loss is minimized. Over <math display="inline">t = 1, 2, ..., T</math> minimization rounds the group proportions <math display="inline">\alpha_k^{(t)}</math> vary dependent on past losses. At each round the expected number of users <math display="inline">\lambda_k^{(t+1)}</math> from group <math display="inline">k</math> is determined by <br />
<br />
\begin{align}<br />
\lambda_k^{(t+1)} := \lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)})) + b_k<br />
\end{align}<br />
<br />
where <math display="inline">\lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)}))</math> describes the fraction of retained users from the previous optimization and <math display="inline">b_k</math> is the number of new users of group <math display="inline">k</math>. To put simply, the number of expected users of a group depend on the number of new users of that group, and the fraction of users that continue to use the system from the previous optimization step. If less and less users from minority groups return to the model (i.e. the model as a low retention rate of minority group users), Hashimoto et al. argue that the representation disparity amplifies. <br />
<br />
==Empirical Risk Minimization (ERM)==<br />
<br />
Without the knowledge of population proportions <math display="inline">\alpha_k^{(t)}</math>, the new user rate <math display="inline">b_k</math>, and the retention function <math display="inline">\nu</math> it is hard in practice, however, to control the worst-case risk over all time periods <math display="inline">\mathcal{R}_{max}^T</math> for all groups. That is why it is the standard approach to fit a sequence of models <math display="inline">\theta^{(t)}</math> by empirically approximating them. Using ERM, for instance, the optimal model is approached by minimizing the loss of the model:<br />
<br />
\begin{align}<br />
\theta^{(t)} = arg min_{\theta \in \Theta} \sum_i \ell(\theta; Z_i^{(t)})<br />
\end{align}<br />
<br />
However, ERM fails to prevent disparity amplification. By minimizing the expected loss of the model, minority groups experience higher loss (because the loss of the majority group is minimized), and do not return to use the system. In doing so, the population proportions <math display="inline">\alpha_k^{(t)}</math> shift, and certain minority groups contribute even less to the system. This is mirrored in the expected user count <math display="inline">\lambda^{(t)}</math> at each optimization point. In their paper Hashimoto et al. show that, if using ERM, <math display="inline">\lambda^{(t)}</math> is unstable because it loses its fair fixed point (i.e. the population fraction where risk minimization maintains the same population fraction over time). Therefore, ERM fails to control minority risk over time, and is considered unfair.<br />
<br />
=Distributonally Robust Optimization (DRO)=<br />
<br />
At this point our goal is to minimize the worst-case group risk over a single time-step <math display="inline">\mathcal{R}_{max} (\theta^{(t)}) </math>. As previously mentioned, this is difficult to do because neither the population proportions <math display="inline">\{\alpha_k\} </math> nor group distributions <math display="inline">\{P_k\} </math> are known. Therefore, Hashimoto et al. developed an optimization technique that is robust "against '''''all''''' directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution <math display="inline">P_k </math> whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function <math display="inline">\mathcal{R}_{dro} </math> has to "up-weigh" data <math display="inline">Z</math> that cause high loss <math display="inline">\ell(\theta, Z)</math>. In other words, the risk function has to over-represent mixture components (i.e. group distributions <math display="inline">\{P_k\} </math>) in relation to their original mixture weights (i.e. the population proportions <math display="inline">\{\alpha_k\} </math>) for groups that suffer high loss. <br />
<br />
To do this we need to consider the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> around <math display="inline">P</math> within a certain limit. This limit is described by the <math display="inline">\chi^2</math>-divergence (i.e. the distance, roughly speaking) between probability distributions. For two distributions <math display="inline">P</math> and <math display="inline">Q</math> the divergence is defined as <math display="inline">D_{\chi^2} (P || Q):= \int (\frac{dP}{dQ} - 1)^2</math>. With the help of the <math display="inline">\chi^2</math>-divergence, Hashimoto et al. define the chi-squared ball <math display="inline">\mathcal{B}(P,r)</math> around the probability distribution P. This ball is defined so that <math display="inline">\mathcal{B}(P,r) := \{Q \ll P : D_{\chi^2} (Q || P) \leq r \}</math>. It is exactly this ball that gives us the opportunity to consider the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> that lie inside the ball (i.e. within reasonable range) around the probability distribution <math display="inline">P</math>. This loss is given by<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]<br />
\end{align}<br />
<br />
which for <math display="inline">P:= \sum_{k \in [K]} \alpha_k P_k</math> for all models <math display="inline">\theta \in \Theta</math> where <math display="inline">r_k := (1/a_k -1)^2</math> bounds the risk <math display="inline">\mathcal{R}_k(\theta) \leq \mathcal{R}_{dro} (\theta; r_k)</math> for each group with risk <math display="inline">\mathcal{R}_k(\theta)</math>. Furthermore, if we specify a lower bound on the group proportions <math display="inline">\alpha_{min} \leq min_{k \in [K]} \alpha_k</math>, and define <math display="inline">r_{max} := (1/\alpha_{min} -1)^2</math>, the worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math> can be controlled by <math display="inline">\mathcal{R}_{dro} (\theta; r_{max}) </math> by forming an upper bound that can be minimized.<br />
<br />
==Optimization of DRO==<br />
<br />
To minimize <math display="inline">\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]</math> Hashimoto et al. look at the dual of this maximization problem (i.e. every maximization problem can be transformed into a minimization problem and vice-versa). This dual is given by the minimization problem<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) = \underset{\eta \in \mathbb{R}}{inf} \left\{ F(\theta; \eta):= C\left(\mathbb{E}_P \left[ [\ell(\theta;Z) - \eta]_+^2 \right] \right)^\frac{1}{2} + \eta \right\}<br />
\end{align}<br />
<br />
with <math display="inline">C = (2(1/a_{min} - 1)^2 + 1)^{1/2}</math>. <math display="inline">\eta</math> describes the dual variable (i.e. the variable that appears in creating the dual). Since <math display="inline">F(\theta; \eta)</math> involves an expectation <math display="inline">\mathbb{E}_P</math> over the data generating distribution <math display="inline">P</math>, <math display="inline">F(\theta; \eta)</math> can be directly minimized. For convex losses <math display="inline">\ell(\theta;Z)</math>, <math display="inline">F(\theta; \eta)</math> is convex, and can be minimized by performing a binary search over <math display="inline">\eta</math>. In their paper, Hashimoto et al. further show that optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> at each time step controls the ''future'' worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math>, and therefore retention rates. That means if the initial group proportions satisfy <math display="inline">\alpha_k^{(0)} \geq a_{min}</math>, and <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> is optimized for every time step (and therefore <math display="inline">\mathcal{R}_{max} (\theta) </math> is minimized), <math display="inline">\mathcal{R}_{max}^T (\theta) </math> over all time steps is controlled. In other words, optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> every time step is enough to avoid disparity amplification.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fairness_Without_Demographics_in_Repeated_Loss_Minimization&diff=37026Fairness Without Demographics in Repeated Loss Minimization2018-10-22T19:05:18Z<p>Mpafla: /* Empirical Risk Minimization (ERM) */</p>
<hr />
<div>This page contains the summary of the paper "[http://proceedings.mlr.press/v80/hashimoto18a.html Fairness Without Demographics in Repeated Loss Minimization]" by Hashimoto, T. B., Srivastava, M., Namkoong, H., & Liang, P. which was published at the International Conference of Machine Learning (ICML) in 2018. <br />
<br />
=Introduction=<br />
<br />
Usually, machine learning models are minimized in their average loss to achieve high overall accuracy. While this works well for the majority, minority groups that use the system suffer high error rates because they contribute less data to the model. This phenomenon is known as '''''representation disparity''''' and has been observed in many models that, for instance, recognize faces, or identify language. This disparity even increases as minority users suffer higher error rates, and therefore, are less likely to use the system in the future. As a consequence minority groups further shrink, and therefore, less data is available for the next optimization of the model. With less data the disparity becomes even worse - a phenomenon referred to as '''''disparity amplification'''''. <br />
<br />
In this paper, Hashimoto et al. first show that standard '''''empirical risk minimization (ERM)''''' does not control the loss of minority groups, and thus causes representation disparity and its amplification over time (even if the model is fair in the beginning). Second, the researchers try to mitigate this unfairness by proposing the use of '''''distributionally robust optimization (DRO)'''''. Indeed Hashimoto et al. are able to show that DRO can bound the loss for minority groups and is fair on models that ERM turns unfair. <br />
<br />
===Note on Fairness===<br />
<br />
Hashimoto et al. follow the ''difference principle'' to achieve and measure fairness. It is defined as the maximization of the welfare of the worst-off group rather than the whole group (cf. utilitarianism).<br />
<br />
=Representation Disparity=<br />
<br />
If a user makes a query <math display="inline">Z \sim P</math>, the model <math display="inline">\theta \in \Theta</math> makes a prediction, and the user experiences loss <math display="inline">\ell (\theta; Z)</math>. The ''expected'' loss of of a model <math display="inline">\theta</math> is denoted as the risk <math display="inline">\mathcal{R}(\theta) = \mathbb{E}_{Z \sim P} [\ell (\theta; Z)] </math>. While the queries (i.e. users using the system and providing data to the model) are made by users from <math display="inline">K</math> latent groups such that <math display="inline">Z \sim P := \sum_{k \in [k]} \alpha_kP_k</math>, neither the actual population proportions <math display="inline">\alpha_k</math> nor the group distributions <math display="inline">P_k</math> are known. Therefore, Hashimoto et al.'s goal is to control the worst-case risk over ''all'' groups <math display="inline">K</math> (and not just the risk of the worst-off group):<br />
<br />
\begin{align}<br />
\mathcal{R}_{max}(\theta) := \underset{k \in [k]}{max} \mathcal{R}_k(\theta), \qquad \mathcal{R}_k(\theta) := \mathbb{E}_{P_k} [\ell (\theta; Z)]<br />
\end{align}<br />
<br />
There is high representation disparity if the expected loss of the model <math display="inline">\mathcal{R}(\theta)</math> is low, but the worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> of one group is high. This means that a model with high representation disparity performs well on average (i.e. has low overall loss), but fails to represent some groups <math display="inline">k</math> (i.e. the risk for the worst-off group is high). In order to make models with high representation disparity fairer, Hashimot et al. try to minimize this worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> in general.<br />
<br />
=Disparity Amplification=<br />
<br />
Representation disparity can amplify as time passes and loss is minimized. Over <math display="inline">t = 1, 2, ..., T</math> minimization rounds the group proportions <math display="inline">\alpha_k^{(t)}</math> vary dependent on past losses. At each round the expected number of users <math display="inline">\lambda_k^{(t+1)}</math> from group <math display="inline">k</math> is determined by <br />
<br />
\begin{align}<br />
\lambda_k^{(t+1)} := \lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)})) + b_k<br />
\end{align}<br />
<br />
where <math display="inline">\lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)}))</math> describes the fraction of retained users from the previous optimization and <math display="inline">b_k</math> is the number of new users of group <math display="inline">k</math>. To put simply, the number of expected users of a group depend on the number of new users of that group, and the fraction of users that continue to use the system from the previous optimization step. If less and less users from minority groups return to the model (i.e. the model as a low retention rate of minority group users), Hashimoto et al. argue that the representation disparity amplifies. <br />
<br />
==Empirical Risk Minimization (ERM)==<br />
<br />
Without the knowledge of population proportions <math display="inline">\alpha_k^{(t)}</math>, the new user rate <math display="inline">b_k</math>, and the retention function <math display="inline">\nu</math> it is hard in practice, however, to control the worst-case risk over all time periods <math display="inline">\mathcal{R}_{max}^T</math> for all groups. That is why it is the standard approach to fit a sequence of models <math display="inline">\theta^{(t)}</math> by empirically approximating them. Using ERM, for instance, the optimal model is approached by minimizing the loss of the model:<br />
<br />
\begin{align}<br />
\theta^{(t)} = arg min_{\theta \in \Theta} \sum_i \ell(\theta; Z_i^{(t)})<br />
\end{align}<br />
<br />
However, ERM fails to prevent disparity amplification. By minimizing the expected loss of the model, minority groups experience higher loss, and do not return to use the system. In doing so, the population proportions <math display="inline">\alpha_k^{(t)}</math> shift, and certain minority groups contribute even less to the system. This is mirrored in the expected user count <math display="inline">\lambda^{(t)}</math> at each optimization point. In their paper Hashimoto et al. show that, if using ERM, <math display="inline">\lambda^{(t)}</math> is unstable because it loses its fair fixed point (i.e. the population fraction where risk minimization maintains the same population fraction over time). Therefore, ERM fails to control minority risk over time, and is considered unfair.<br />
<br />
=Distributonally Robust Optimization (DRO)=<br />
<br />
At this point our goal is to minimize the worst-case group risk over a single time-step <math display="inline">\mathcal{R}_{max} (\theta^{(t)}) </math>. As previously mentioned, this is difficult to do because neither the population proportions <math display="inline">\{\alpha_k\} </math> nor group distributions <math display="inline">\{P_k\} </math> are known. Therefore, Hashimoto et al. developed an optimization technique that is robust "against '''''all''''' directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution <math display="inline">P_k </math> whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function <math display="inline">\mathcal{R}_{dro} </math> has to "up-weigh" data <math display="inline">Z</math> that cause high loss <math display="inline">\ell(\theta, Z)</math>. In other words, the risk function has to over-represent mixture components (i.e. group distributions <math display="inline">\{P_k\} </math>) in relation to their original mixture weights (i.e. the population proportions <math display="inline">\{\alpha_k\} </math>) for groups that suffer high loss. <br />
<br />
To do this we need to consider the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> around <math display="inline">P</math> within a certain limit. This limit is described by the <math display="inline">\chi^2</math>-divergence (i.e. the distance, roughly speaking) between probability distributions. For two distributions <math display="inline">P</math> and <math display="inline">Q</math> the divergence is defined as <math display="inline">D_{\chi^2} (P || Q):= \int (\frac{dP}{dQ} - 1)^2</math>. With the help of the <math display="inline">\chi^2</math>-divergence, Hashimoto et al. define the chi-squared ball <math display="inline">\mathcal{B}(P,r)</math> around the probability distribution P. This ball is defined so that <math display="inline">\mathcal{B}(P,r) := \{Q \ll P : D_{\chi^2} (Q || P) \leq r \}</math>. It is exactly this ball that gives us the opportunity to consider the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> that lie inside the ball (i.e. within reasonable range) around the probability distribution <math display="inline">P</math>. This loss is given by<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]<br />
\end{align}<br />
<br />
which for <math display="inline">P:= \sum_{k \in [K]} \alpha_k P_k</math> for all models <math display="inline">\theta \in \Theta</math> where <math display="inline">r_k := (1/a_k -1)^2</math> bounds the risk <math display="inline">\mathcal{R}_k(\theta) \leq \mathcal{R}_{dro} (\theta; r_k)</math> for each group with risk <math display="inline">\mathcal{R}_k(\theta)</math>. Furthermore, if we specify a lower bound on the group proportions <math display="inline">\alpha_{min} \leq min_{k \in [K]} \alpha_k</math>, and define <math display="inline">r_{max} := (1/\alpha_{min} -1)^2</math>, the worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math> can be controlled by <math display="inline">\mathcal{R}_{dro} (\theta; r_{max}) </math> by forming an upper bound that can be minimized.<br />
<br />
==Optimization of DRO==<br />
<br />
To minimize <math display="inline">\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]</math> Hashimoto et al. look at the dual of this maximization problem (i.e. every maximization problem can be transformed into a minimization problem and vice-versa). This dual is given by the minimization problem<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) = \underset{\eta \in \mathbb{R}}{inf} \left\{ F(\theta; \eta):= C\left(\mathbb{E}_P \left[ [\ell(\theta;Z) - \eta]_+^2 \right] \right)^\frac{1}{2} + \eta \right\}<br />
\end{align}<br />
<br />
with <math display="inline">C = (2(1/a_{min} - 1)^2 + 1)^{1/2}</math>. <math display="inline">\eta</math> describes the dual variable (i.e. the variable that appears in creating the dual). Since <math display="inline">F(\theta; \eta)</math> involves an expectation <math display="inline">\mathbb{E}_P</math> over the data generating distribution <math display="inline">P</math>, <math display="inline">F(\theta; \eta)</math> can be directly minimized. For convex losses <math display="inline">\ell(\theta;Z)</math>, <math display="inline">F(\theta; \eta)</math> is convex, and can be minimized by performing a binary search over <math display="inline">\eta</math>. In their paper, Hashimoto et al. further show that optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> at each time step controls the ''future'' worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math>, and therefore retention rates. That means if the initial group proportions satisfy <math display="inline">\alpha_k^{(0)} \geq a_{min}</math>, and <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> is optimized for every time step (and therefore <math display="inline">\mathcal{R}_{max} (\theta) </math> is minimized), <math display="inline">\mathcal{R}_{max}^T (\theta) </math> over all time steps is controlled. In other words, optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> every time step is enough to avoid disparity amplification.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fairness_Without_Demographics_in_Repeated_Loss_Minimization&diff=37025Fairness Without Demographics in Repeated Loss Minimization2018-10-22T19:04:33Z<p>Mpafla: /* Disparity Amplification */</p>
<hr />
<div>This page contains the summary of the paper "[http://proceedings.mlr.press/v80/hashimoto18a.html Fairness Without Demographics in Repeated Loss Minimization]" by Hashimoto, T. B., Srivastava, M., Namkoong, H., & Liang, P. which was published at the International Conference of Machine Learning (ICML) in 2018. <br />
<br />
=Introduction=<br />
<br />
Usually, machine learning models are minimized in their average loss to achieve high overall accuracy. While this works well for the majority, minority groups that use the system suffer high error rates because they contribute less data to the model. This phenomenon is known as '''''representation disparity''''' and has been observed in many models that, for instance, recognize faces, or identify language. This disparity even increases as minority users suffer higher error rates, and therefore, are less likely to use the system in the future. As a consequence minority groups further shrink, and therefore, less data is available for the next optimization of the model. With less data the disparity becomes even worse - a phenomenon referred to as '''''disparity amplification'''''. <br />
<br />
In this paper, Hashimoto et al. first show that standard '''''empirical risk minimization (ERM)''''' does not control the loss of minority groups, and thus causes representation disparity and its amplification over time (even if the model is fair in the beginning). Second, the researchers try to mitigate this unfairness by proposing the use of '''''distributionally robust optimization (DRO)'''''. Indeed Hashimoto et al. are able to show that DRO can bound the loss for minority groups and is fair on models that ERM turns unfair. <br />
<br />
===Note on Fairness===<br />
<br />
Hashimoto et al. follow the ''difference principle'' to achieve and measure fairness. It is defined as the maximization of the welfare of the worst-off group rather than the whole group (cf. utilitarianism).<br />
<br />
=Representation Disparity=<br />
<br />
If a user makes a query <math display="inline">Z \sim P</math>, the model <math display="inline">\theta \in \Theta</math> makes a prediction, and the user experiences loss <math display="inline">\ell (\theta; Z)</math>. The ''expected'' loss of of a model <math display="inline">\theta</math> is denoted as the risk <math display="inline">\mathcal{R}(\theta) = \mathbb{E}_{Z \sim P} [\ell (\theta; Z)] </math>. While the queries (i.e. users using the system and providing data to the model) are made by users from <math display="inline">K</math> latent groups such that <math display="inline">Z \sim P := \sum_{k \in [k]} \alpha_kP_k</math>, neither the actual population proportions <math display="inline">\alpha_k</math> nor the group distributions <math display="inline">P_k</math> are known. Therefore, Hashimoto et al.'s goal is to control the worst-case risk over ''all'' groups <math display="inline">K</math> (and not just the risk of the worst-off group):<br />
<br />
\begin{align}<br />
\mathcal{R}_{max}(\theta) := \underset{k \in [k]}{max} \mathcal{R}_k(\theta), \qquad \mathcal{R}_k(\theta) := \mathbb{E}_{P_k} [\ell (\theta; Z)]<br />
\end{align}<br />
<br />
There is high representation disparity if the expected loss of the model <math display="inline">\mathcal{R}(\theta)</math> is low, but the worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> of one group is high. This means that a model with high representation disparity performs well on average (i.e. has low overall loss), but fails to represent some groups <math display="inline">k</math> (i.e. the risk for the worst-off group is high). In order to make models with high representation disparity fairer, Hashimot et al. try to minimize this worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> in general.<br />
<br />
=Disparity Amplification=<br />
<br />
Representation disparity can amplify as time passes and loss is minimized. Over <math display="inline">t = 1, 2, ..., T</math> minimization rounds the group proportions <math display="inline">\alpha_k^{(t)}</math> vary dependent on past losses. At each round the expected number of users <math display="inline">\lambda_k^{(t+1)}</math> from group <math display="inline">k</math> is determined by <br />
<br />
\begin{align}<br />
\lambda_k^{(t+1)} := \lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)})) + b_k<br />
\end{align}<br />
<br />
where <math display="inline">\lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)}))</math> describes the fraction of retained users from the previous optimization and <math display="inline">b_k</math> is the number of new users of group <math display="inline">k</math>. To put simply, the number of expected users of a group depend on the number of new users of that group, and the fraction of users that continue to use the system from the previous optimization step. If less and less users from minority groups return to the model (i.e. the model as a low retention rate of minority group users), Hashimoto et al. argue that the representation disparity amplifies. <br />
<br />
==Empirical Risk Minimization (ERM)==<br />
<br />
Without the knowledge of population proportions <math display="inline">\alpha_k^{(t)}</math>, the new user rate <math display="inline">b_k</math>, and the retention function <math display="inline">\nu</math> it is hard in practice, however, to control the worst-case risk over all time periods <math display="inline">\mathcal{R}_{max}^T</math> for all groups. That is why it is the standard approach to fit a sequence of models <math display="inline">\theta^{(t)}</math> by empirically approximating them. Using ERM, for instance, the optimal model is approached by minimizing the loss of the model:<br />
<br />
\begin{align}<br />
\theta^{(t)} = arg min_{\theta \in \Theta} \sum_i \ell(\theta; Z_i^{(t)})<br />
\end{align}<br />
<br />
However, ERM fails to prevent disparity amplification. By minimizing the expected loss of the model, minority groups experience higher loss, and do not return to use the system. In doing so, the population proportions <math display="inline">\{\alpha_k^{(t)}\}</math> shift, and certain minority groups contribute even less to the system. This is mirrored in the expected user count <math display="inline">\lambda^{(t)}</math> at each optimization point. In their paper Hashimoto et al. show that, if using ERM, <math display="inline">\lambda^{(t)}</math> is unstable because it loses its fair fixed point (i.e. the population fraction where risk minimization maintains the same population fraction over time). Therefore, ERM fails to control minority risk over time, and is considered unfair.<br />
<br />
=Distributonally Robust Optimization (DRO)=<br />
<br />
At this point our goal is to minimize the worst-case group risk over a single time-step <math display="inline">\mathcal{R}_{max} (\theta^{(t)}) </math>. As previously mentioned, this is difficult to do because neither the population proportions <math display="inline">\{\alpha_k\} </math> nor group distributions <math display="inline">\{P_k\} </math> are known. Therefore, Hashimoto et al. developed an optimization technique that is robust "against '''''all''''' directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution <math display="inline">P_k </math> whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function <math display="inline">\mathcal{R}_{dro} </math> has to "up-weigh" data <math display="inline">Z</math> that cause high loss <math display="inline">\ell(\theta, Z)</math>. In other words, the risk function has to over-represent mixture components (i.e. group distributions <math display="inline">\{P_k\} </math>) in relation to their original mixture weights (i.e. the population proportions <math display="inline">\{\alpha_k\} </math>) for groups that suffer high loss. <br />
<br />
To do this we need to consider the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> around <math display="inline">P</math> within a certain limit. This limit is described by the <math display="inline">\chi^2</math>-divergence (i.e. the distance, roughly speaking) between probability distributions. For two distributions <math display="inline">P</math> and <math display="inline">Q</math> the divergence is defined as <math display="inline">D_{\chi^2} (P || Q):= \int (\frac{dP}{dQ} - 1)^2</math>. With the help of the <math display="inline">\chi^2</math>-divergence, Hashimoto et al. define the chi-squared ball <math display="inline">\mathcal{B}(P,r)</math> around the probability distribution P. This ball is defined so that <math display="inline">\mathcal{B}(P,r) := \{Q \ll P : D_{\chi^2} (Q || P) \leq r \}</math>. It is exactly this ball that gives us the opportunity to consider the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> that lie inside the ball (i.e. within reasonable range) around the probability distribution <math display="inline">P</math>. This loss is given by<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]<br />
\end{align}<br />
<br />
which for <math display="inline">P:= \sum_{k \in [K]} \alpha_k P_k</math> for all models <math display="inline">\theta \in \Theta</math> where <math display="inline">r_k := (1/a_k -1)^2</math> bounds the risk <math display="inline">\mathcal{R}_k(\theta) \leq \mathcal{R}_{dro} (\theta; r_k)</math> for each group with risk <math display="inline">\mathcal{R}_k(\theta)</math>. Furthermore, if we specify a lower bound on the group proportions <math display="inline">\alpha_{min} \leq min_{k \in [K]} \alpha_k</math>, and define <math display="inline">r_{max} := (1/\alpha_{min} -1)^2</math>, the worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math> can be controlled by <math display="inline">\mathcal{R}_{dro} (\theta; r_{max}) </math> by forming an upper bound that can be minimized.<br />
<br />
==Optimization of DRO==<br />
<br />
To minimize <math display="inline">\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]</math> Hashimoto et al. look at the dual of this maximization problem (i.e. every maximization problem can be transformed into a minimization problem and vice-versa). This dual is given by the minimization problem<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) = \underset{\eta \in \mathbb{R}}{inf} \left\{ F(\theta; \eta):= C\left(\mathbb{E}_P \left[ [\ell(\theta;Z) - \eta]_+^2 \right] \right)^\frac{1}{2} + \eta \right\}<br />
\end{align}<br />
<br />
with <math display="inline">C = (2(1/a_{min} - 1)^2 + 1)^{1/2}</math>. <math display="inline">\eta</math> describes the dual variable (i.e. the variable that appears in creating the dual). Since <math display="inline">F(\theta; \eta)</math> involves an expectation <math display="inline">\mathbb{E}_P</math> over the data generating distribution <math display="inline">P</math>, <math display="inline">F(\theta; \eta)</math> can be directly minimized. For convex losses <math display="inline">\ell(\theta;Z)</math>, <math display="inline">F(\theta; \eta)</math> is convex, and can be minimized by performing a binary search over <math display="inline">\eta</math>. In their paper, Hashimoto et al. further show that optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> at each time step controls the ''future'' worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math>, and therefore retention rates. That means if the initial group proportions satisfy <math display="inline">\alpha_k^{(0)} \geq a_{min}</math>, and <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> is optimized for every time step (and therefore <math display="inline">\mathcal{R}_{max} (\theta) </math> is minimized), <math display="inline">\mathcal{R}_{max}^T (\theta) </math> over all time steps is controlled. In other words, optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> every time step is enough to avoid disparity amplification.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fairness_Without_Demographics_in_Repeated_Loss_Minimization&diff=37024Fairness Without Demographics in Repeated Loss Minimization2018-10-22T19:03:34Z<p>Mpafla: /* Disparity Amplification */</p>
<hr />
<div>This page contains the summary of the paper "[http://proceedings.mlr.press/v80/hashimoto18a.html Fairness Without Demographics in Repeated Loss Minimization]" by Hashimoto, T. B., Srivastava, M., Namkoong, H., & Liang, P. which was published at the International Conference of Machine Learning (ICML) in 2018. <br />
<br />
=Introduction=<br />
<br />
Usually, machine learning models are minimized in their average loss to achieve high overall accuracy. While this works well for the majority, minority groups that use the system suffer high error rates because they contribute less data to the model. This phenomenon is known as '''''representation disparity''''' and has been observed in many models that, for instance, recognize faces, or identify language. This disparity even increases as minority users suffer higher error rates, and therefore, are less likely to use the system in the future. As a consequence minority groups further shrink, and therefore, less data is available for the next optimization of the model. With less data the disparity becomes even worse - a phenomenon referred to as '''''disparity amplification'''''. <br />
<br />
In this paper, Hashimoto et al. first show that standard '''''empirical risk minimization (ERM)''''' does not control the loss of minority groups, and thus causes representation disparity and its amplification over time (even if the model is fair in the beginning). Second, the researchers try to mitigate this unfairness by proposing the use of '''''distributionally robust optimization (DRO)'''''. Indeed Hashimoto et al. are able to show that DRO can bound the loss for minority groups and is fair on models that ERM turns unfair. <br />
<br />
===Note on Fairness===<br />
<br />
Hashimoto et al. follow the ''difference principle'' to achieve and measure fairness. It is defined as the maximization of the welfare of the worst-off group rather than the whole group (cf. utilitarianism).<br />
<br />
=Representation Disparity=<br />
<br />
If a user makes a query <math display="inline">Z \sim P</math>, the model <math display="inline">\theta \in \Theta</math> makes a prediction, and the user experiences loss <math display="inline">\ell (\theta; Z)</math>. The ''expected'' loss of of a model <math display="inline">\theta</math> is denoted as the risk <math display="inline">\mathcal{R}(\theta) = \mathbb{E}_{Z \sim P} [\ell (\theta; Z)] </math>. While the queries (i.e. users using the system and providing data to the model) are made by users from <math display="inline">K</math> latent groups such that <math display="inline">Z \sim P := \sum_{k \in [k]} \alpha_kP_k</math>, neither the actual population proportions <math display="inline">\alpha_k</math> nor the group distributions <math display="inline">P_k</math> are known. Therefore, Hashimoto et al.'s goal is to control the worst-case risk over ''all'' groups <math display="inline">K</math> (and not just the risk of the worst-off group):<br />
<br />
\begin{align}<br />
\mathcal{R}_{max}(\theta) := \underset{k \in [k]}{max} \mathcal{R}_k(\theta), \qquad \mathcal{R}_k(\theta) := \mathbb{E}_{P_k} [\ell (\theta; Z)]<br />
\end{align}<br />
<br />
There is high representation disparity if the expected loss of the model <math display="inline">\mathcal{R}(\theta)</math> is low, but the worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> of one group is high. This means that a model with high representation disparity performs well on average (i.e. has low overall loss), but fails to represent some groups <math display="inline">k</math> (i.e. the risk for the worst-off group is high). In order to make models with high representation disparity fairer, Hashimot et al. try to minimize this worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> in general.<br />
<br />
=Disparity Amplification=<br />
<br />
Representation disparity can amplify as time passes and loss is minimized. Over <math display="inline">t = 1, 2, ..., T</math> minimization rounds the group proportions <math display="inline">\alpha_k^{(t)}</math> vary dependent on past losses. At each round the expected number of users <math display="inline">\lambda_k^{(t+1)}</math> from group <math display="inline">k</math> is determined by <br />
<br />
\begin{align}<br />
\lambda_k^{(t+1)} := \lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)})) + b_k<br />
\end{align}<br />
<br />
where <math display="inline">\lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)}))</math> describes the fraction of retained users from the previous optimization and <math display="inline">b_k</math> is the number of new users of group <math display="inline">k</math>. To put simply, the number of expected users of a group depend on the number of new users of that group, and the fraction of users that continue to use the system from the previous optimization step. If less and less users from minority groups return to the model (i.e. the model as a low retention rate of users), Hashimoto et al. argue that the representation disparity amplifies. <br />
<br />
==Empirical Risk Minimization (ERM)==<br />
<br />
Without the knowledge of population proportions <math display="inline">\alpha_k^{(t)}</math>, the new user rate <math display="inline">b_k</math>, and the retention function <math display="inline">\nu</math> it is hard in practice, however, to control the worst-case risk over all time periods <math display="inline">\mathcal{R}_{max}^T</math> for all groups. That is why it is the standard approach to fit a sequence of models <math display="inline">\theta^{(t)}</math> by empirically approximating them. Using ERM, for instance, the optimal model is approached by minimizing the loss of the model:<br />
<br />
\begin{align}<br />
\theta^{(t)} = arg min_{\theta \in \Theta} \sum_i \ell(\theta; Z_i^{(t)})<br />
\end{align}<br />
<br />
However, ERM fails to prevent disparity amplification. By minimizing the expected loss of the model, minority groups experience higher loss, and do not return to use the system. In doing so, the population proportions <math display="inline">\{\alpha_k^{(t)}\}</math> shift, and certain minority groups contribute even less to the system. This is mirrored in the expected user count <math display="inline">\lambda^{(t)}</math> at each optimization point. In their paper Hashimoto et al. show that, if using ERM, <math display="inline">\lambda^{(t)}</math> is unstable because it loses its fair fixed point (i.e. the population fraction where risk minimization maintains the same population fraction over time). Therefore, ERM fails to control minority risk over time, and is considered unfair.<br />
<br />
=Distributonally Robust Optimization (DRO)=<br />
<br />
At this point our goal is to minimize the worst-case group risk over a single time-step <math display="inline">\mathcal{R}_{max} (\theta^{(t)}) </math>. As previously mentioned, this is difficult to do because neither the population proportions <math display="inline">\{\alpha_k\} </math> nor group distributions <math display="inline">\{P_k\} </math> are known. Therefore, Hashimoto et al. developed an optimization technique that is robust "against '''''all''''' directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution <math display="inline">P_k </math> whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function <math display="inline">\mathcal{R}_{dro} </math> has to "up-weigh" data <math display="inline">Z</math> that cause high loss <math display="inline">\ell(\theta, Z)</math>. In other words, the risk function has to over-represent mixture components (i.e. group distributions <math display="inline">\{P_k\} </math>) in relation to their original mixture weights (i.e. the population proportions <math display="inline">\{\alpha_k\} </math>) for groups that suffer high loss. <br />
<br />
To do this we need to consider the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> around <math display="inline">P</math> within a certain limit. This limit is described by the <math display="inline">\chi^2</math>-divergence (i.e. the distance, roughly speaking) between probability distributions. For two distributions <math display="inline">P</math> and <math display="inline">Q</math> the divergence is defined as <math display="inline">D_{\chi^2} (P || Q):= \int (\frac{dP}{dQ} - 1)^2</math>. With the help of the <math display="inline">\chi^2</math>-divergence, Hashimoto et al. define the chi-squared ball <math display="inline">\mathcal{B}(P,r)</math> around the probability distribution P. This ball is defined so that <math display="inline">\mathcal{B}(P,r) := \{Q \ll P : D_{\chi^2} (Q || P) \leq r \}</math>. It is exactly this ball that gives us the opportunity to consider the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> that lie inside the ball (i.e. within reasonable range) around the probability distribution <math display="inline">P</math>. This loss is given by<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]<br />
\end{align}<br />
<br />
which for <math display="inline">P:= \sum_{k \in [K]} \alpha_k P_k</math> for all models <math display="inline">\theta \in \Theta</math> where <math display="inline">r_k := (1/a_k -1)^2</math> bounds the risk <math display="inline">\mathcal{R}_k(\theta) \leq \mathcal{R}_{dro} (\theta; r_k)</math> for each group with risk <math display="inline">\mathcal{R}_k(\theta)</math>. Furthermore, if we specify a lower bound on the group proportions <math display="inline">\alpha_{min} \leq min_{k \in [K]} \alpha_k</math>, and define <math display="inline">r_{max} := (1/\alpha_{min} -1)^2</math>, the worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math> can be controlled by <math display="inline">\mathcal{R}_{dro} (\theta; r_{max}) </math> by forming an upper bound that can be minimized.<br />
<br />
==Optimization of DRO==<br />
<br />
To minimize <math display="inline">\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]</math> Hashimoto et al. look at the dual of this maximization problem (i.e. every maximization problem can be transformed into a minimization problem and vice-versa). This dual is given by the minimization problem<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) = \underset{\eta \in \mathbb{R}}{inf} \left\{ F(\theta; \eta):= C\left(\mathbb{E}_P \left[ [\ell(\theta;Z) - \eta]_+^2 \right] \right)^\frac{1}{2} + \eta \right\}<br />
\end{align}<br />
<br />
with <math display="inline">C = (2(1/a_{min} - 1)^2 + 1)^{1/2}</math>. <math display="inline">\eta</math> describes the dual variable (i.e. the variable that appears in creating the dual). Since <math display="inline">F(\theta; \eta)</math> involves an expectation <math display="inline">\mathbb{E}_P</math> over the data generating distribution <math display="inline">P</math>, <math display="inline">F(\theta; \eta)</math> can be directly minimized. For convex losses <math display="inline">\ell(\theta;Z)</math>, <math display="inline">F(\theta; \eta)</math> is convex, and can be minimized by performing a binary search over <math display="inline">\eta</math>. In their paper, Hashimoto et al. further show that optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> at each time step controls the ''future'' worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math>, and therefore retention rates. That means if the initial group proportions satisfy <math display="inline">\alpha_k^{(0)} \geq a_{min}</math>, and <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> is optimized for every time step (and therefore <math display="inline">\mathcal{R}_{max} (\theta) </math> is minimized), <math display="inline">\mathcal{R}_{max}^T (\theta) </math> over all time steps is controlled. In other words, optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> every time step is enough to avoid disparity amplification.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fairness_Without_Demographics_in_Repeated_Loss_Minimization&diff=37023Fairness Without Demographics in Repeated Loss Minimization2018-10-22T19:02:39Z<p>Mpafla: /* Representation Disparity */</p>
<hr />
<div>This page contains the summary of the paper "[http://proceedings.mlr.press/v80/hashimoto18a.html Fairness Without Demographics in Repeated Loss Minimization]" by Hashimoto, T. B., Srivastava, M., Namkoong, H., & Liang, P. which was published at the International Conference of Machine Learning (ICML) in 2018. <br />
<br />
=Introduction=<br />
<br />
Usually, machine learning models are minimized in their average loss to achieve high overall accuracy. While this works well for the majority, minority groups that use the system suffer high error rates because they contribute less data to the model. This phenomenon is known as '''''representation disparity''''' and has been observed in many models that, for instance, recognize faces, or identify language. This disparity even increases as minority users suffer higher error rates, and therefore, are less likely to use the system in the future. As a consequence minority groups further shrink, and therefore, less data is available for the next optimization of the model. With less data the disparity becomes even worse - a phenomenon referred to as '''''disparity amplification'''''. <br />
<br />
In this paper, Hashimoto et al. first show that standard '''''empirical risk minimization (ERM)''''' does not control the loss of minority groups, and thus causes representation disparity and its amplification over time (even if the model is fair in the beginning). Second, the researchers try to mitigate this unfairness by proposing the use of '''''distributionally robust optimization (DRO)'''''. Indeed Hashimoto et al. are able to show that DRO can bound the loss for minority groups and is fair on models that ERM turns unfair. <br />
<br />
===Note on Fairness===<br />
<br />
Hashimoto et al. follow the ''difference principle'' to achieve and measure fairness. It is defined as the maximization of the welfare of the worst-off group rather than the whole group (cf. utilitarianism).<br />
<br />
=Representation Disparity=<br />
<br />
If a user makes a query <math display="inline">Z \sim P</math>, the model <math display="inline">\theta \in \Theta</math> makes a prediction, and the user experiences loss <math display="inline">\ell (\theta; Z)</math>. The ''expected'' loss of of a model <math display="inline">\theta</math> is denoted as the risk <math display="inline">\mathcal{R}(\theta) = \mathbb{E}_{Z \sim P} [\ell (\theta; Z)] </math>. While the queries (i.e. users using the system and providing data to the model) are made by users from <math display="inline">K</math> latent groups such that <math display="inline">Z \sim P := \sum_{k \in [k]} \alpha_kP_k</math>, neither the actual population proportions <math display="inline">\alpha_k</math> nor the group distributions <math display="inline">P_k</math> are known. Therefore, Hashimoto et al.'s goal is to control the worst-case risk over ''all'' groups <math display="inline">K</math> (and not just the risk of the worst-off group):<br />
<br />
\begin{align}<br />
\mathcal{R}_{max}(\theta) := \underset{k \in [k]}{max} \mathcal{R}_k(\theta), \qquad \mathcal{R}_k(\theta) := \mathbb{E}_{P_k} [\ell (\theta; Z)]<br />
\end{align}<br />
<br />
There is high representation disparity if the expected loss of the model <math display="inline">\mathcal{R}(\theta)</math> is low, but the worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> of one group is high. This means that a model with high representation disparity performs well on average (i.e. has low overall loss), but fails to represent some groups <math display="inline">k</math> (i.e. the risk for the worst-off group is high). In order to make models with high representation disparity fairer, Hashimot et al. try to minimize this worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> in general.<br />
<br />
=Disparity Amplification=<br />
<br />
Representation disparity can amplify as time passes and loss is minimized. Formally, over <math display="inline">t = 1, 2, ..., T</math> minimization rounds the group proportions <math display="inline">\alpha_k^{(t)}</math> vary dependent on past losses. At each round the expected number of users <math display="inline">\lambda_k^{(t+1)}</math> from group <math display="inline">k</math> is determined by <br />
<br />
\begin{align}<br />
\lambda_k^{(t+1)} := \lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)})) + b_k<br />
\end{align}<br />
<br />
where <math display="inline">\lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)}))</math> describes the fraction of retained users from the previous optimization and <math display="inline">b_k</math> is the number of new users of group <math display="inline">k</math>. To put simply, the number of expected users of a group depend on the number of new users of that group, and the fraction of users that continue to use the system from the previous optimization step. If less and less users from minority groups return to the model (i.e. the model as a low retention rate of users), Hashimoto et al. argue that the representation disparity amplifies. <br />
<br />
==Empirical Risk Minimization (ERM)==<br />
<br />
Without the knowledge of population proportions <math display="inline">\alpha_k^{(t)}</math>, the new user rate <math display="inline">b_k</math>, and the retention function <math display="inline">\nu</math> it is hard in practice, however, to control the worst-case risk over all time periods <math display="inline">\mathcal{R}_{max}^T</math> for all groups. That is why it is the standard approach to fit a sequence of models <math display="inline">\theta^{(t)}</math> by empirically approximating them. Using ERM, for instance, the optimal model is approached by minimizing the loss of the model:<br />
<br />
\begin{align}<br />
\theta^{(t)} = arg min_{\theta \in \Theta} \sum_i \ell(\theta; Z_i^{(t)})<br />
\end{align}<br />
<br />
However, ERM fails to prevent disparity amplification. By minimizing the expected loss of the model, minority groups experience higher loss, and do not return to use the system. In doing so, the population proportions <math display="inline">\{\alpha_k^{(t)}\}</math> shift, and certain minority groups contribute even less to the system. This is mirrored in the expected user count <math display="inline">\lambda^{(t)}</math> at each optimization point. In their paper Hashimoto et al. show that, if using ERM, <math display="inline">\lambda^{(t)}</math> is unstable because it loses its fair fixed point (i.e. the population fraction where risk minimization maintains the same population fraction over time). Therefore, ERM fails to control minority risk over time, and is considered unfair.<br />
<br />
=Distributonally Robust Optimization (DRO)=<br />
<br />
At this point our goal is to minimize the worst-case group risk over a single time-step <math display="inline">\mathcal{R}_{max} (\theta^{(t)}) </math>. As previously mentioned, this is difficult to do because neither the population proportions <math display="inline">\{\alpha_k\} </math> nor group distributions <math display="inline">\{P_k\} </math> are known. Therefore, Hashimoto et al. developed an optimization technique that is robust "against '''''all''''' directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution <math display="inline">P_k </math> whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function <math display="inline">\mathcal{R}_{dro} </math> has to "up-weigh" data <math display="inline">Z</math> that cause high loss <math display="inline">\ell(\theta, Z)</math>. In other words, the risk function has to over-represent mixture components (i.e. group distributions <math display="inline">\{P_k\} </math>) in relation to their original mixture weights (i.e. the population proportions <math display="inline">\{\alpha_k\} </math>) for groups that suffer high loss. <br />
<br />
To do this we need to consider the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> around <math display="inline">P</math> within a certain limit. This limit is described by the <math display="inline">\chi^2</math>-divergence (i.e. the distance, roughly speaking) between probability distributions. For two distributions <math display="inline">P</math> and <math display="inline">Q</math> the divergence is defined as <math display="inline">D_{\chi^2} (P || Q):= \int (\frac{dP}{dQ} - 1)^2</math>. With the help of the <math display="inline">\chi^2</math>-divergence, Hashimoto et al. define the chi-squared ball <math display="inline">\mathcal{B}(P,r)</math> around the probability distribution P. This ball is defined so that <math display="inline">\mathcal{B}(P,r) := \{Q \ll P : D_{\chi^2} (Q || P) \leq r \}</math>. It is exactly this ball that gives us the opportunity to consider the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> that lie inside the ball (i.e. within reasonable range) around the probability distribution <math display="inline">P</math>. This loss is given by<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]<br />
\end{align}<br />
<br />
which for <math display="inline">P:= \sum_{k \in [K]} \alpha_k P_k</math> for all models <math display="inline">\theta \in \Theta</math> where <math display="inline">r_k := (1/a_k -1)^2</math> bounds the risk <math display="inline">\mathcal{R}_k(\theta) \leq \mathcal{R}_{dro} (\theta; r_k)</math> for each group with risk <math display="inline">\mathcal{R}_k(\theta)</math>. Furthermore, if we specify a lower bound on the group proportions <math display="inline">\alpha_{min} \leq min_{k \in [K]} \alpha_k</math>, and define <math display="inline">r_{max} := (1/\alpha_{min} -1)^2</math>, the worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math> can be controlled by <math display="inline">\mathcal{R}_{dro} (\theta; r_{max}) </math> by forming an upper bound that can be minimized.<br />
<br />
==Optimization of DRO==<br />
<br />
To minimize <math display="inline">\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]</math> Hashimoto et al. look at the dual of this maximization problem (i.e. every maximization problem can be transformed into a minimization problem and vice-versa). This dual is given by the minimization problem<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) = \underset{\eta \in \mathbb{R}}{inf} \left\{ F(\theta; \eta):= C\left(\mathbb{E}_P \left[ [\ell(\theta;Z) - \eta]_+^2 \right] \right)^\frac{1}{2} + \eta \right\}<br />
\end{align}<br />
<br />
with <math display="inline">C = (2(1/a_{min} - 1)^2 + 1)^{1/2}</math>. <math display="inline">\eta</math> describes the dual variable (i.e. the variable that appears in creating the dual). Since <math display="inline">F(\theta; \eta)</math> involves an expectation <math display="inline">\mathbb{E}_P</math> over the data generating distribution <math display="inline">P</math>, <math display="inline">F(\theta; \eta)</math> can be directly minimized. For convex losses <math display="inline">\ell(\theta;Z)</math>, <math display="inline">F(\theta; \eta)</math> is convex, and can be minimized by performing a binary search over <math display="inline">\eta</math>. In their paper, Hashimoto et al. further show that optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> at each time step controls the ''future'' worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math>, and therefore retention rates. That means if the initial group proportions satisfy <math display="inline">\alpha_k^{(0)} \geq a_{min}</math>, and <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> is optimized for every time step (and therefore <math display="inline">\mathcal{R}_{max} (\theta) </math> is minimized), <math display="inline">\mathcal{R}_{max}^T (\theta) </math> over all time steps is controlled. In other words, optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> every time step is enough to avoid disparity amplification.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fairness_Without_Demographics_in_Repeated_Loss_Minimization&diff=37022Fairness Without Demographics in Repeated Loss Minimization2018-10-22T19:02:07Z<p>Mpafla: /* Representation Disparity */</p>
<hr />
<div>This page contains the summary of the paper "[http://proceedings.mlr.press/v80/hashimoto18a.html Fairness Without Demographics in Repeated Loss Minimization]" by Hashimoto, T. B., Srivastava, M., Namkoong, H., & Liang, P. which was published at the International Conference of Machine Learning (ICML) in 2018. <br />
<br />
=Introduction=<br />
<br />
Usually, machine learning models are minimized in their average loss to achieve high overall accuracy. While this works well for the majority, minority groups that use the system suffer high error rates because they contribute less data to the model. This phenomenon is known as '''''representation disparity''''' and has been observed in many models that, for instance, recognize faces, or identify language. This disparity even increases as minority users suffer higher error rates, and therefore, are less likely to use the system in the future. As a consequence minority groups further shrink, and therefore, less data is available for the next optimization of the model. With less data the disparity becomes even worse - a phenomenon referred to as '''''disparity amplification'''''. <br />
<br />
In this paper, Hashimoto et al. first show that standard '''''empirical risk minimization (ERM)''''' does not control the loss of minority groups, and thus causes representation disparity and its amplification over time (even if the model is fair in the beginning). Second, the researchers try to mitigate this unfairness by proposing the use of '''''distributionally robust optimization (DRO)'''''. Indeed Hashimoto et al. are able to show that DRO can bound the loss for minority groups and is fair on models that ERM turns unfair. <br />
<br />
===Note on Fairness===<br />
<br />
Hashimoto et al. follow the ''difference principle'' to achieve and measure fairness. It is defined as the maximization of the welfare of the worst-off group rather than the whole group (cf. utilitarianism).<br />
<br />
=Representation Disparity=<br />
<br />
If a user makes a query <math display="inline">Z \sim P</math>, the model <math display="inline">\theta \in \Theta</math> makes a prediction, and the user experiences loss <math display="inline">\ell (\theta; Z)</math>. The ''expected'' loss of of a model <math display="inline">\theta</math> is denoted as the risk <math display="inline">\mathcal{R}(\theta) = \mathbb{E}_{Z \sim P} [\ell (\theta; Z)] </math>. While the queries (i.e. users using the system and providing data to the model) are made by users from <math display="inline">K</math> latent groups such that <math display="inline">Z \sim P := \sum_{k \in [k]} \alpha_kP_k</math>, neither the actual population proportions <math display="inline">\alpha_k</math> nor the group distributions <math display="inline">P_k</math> are known. Therefore, Hashimoto et al.'s goal is to control the worst-case risk over ''all'' groups <math display="inline">K</math> (and not just the risk of the worst-off group):<br />
<br />
\begin{align}<br />
\mathcal{R}_{max}(\theta) := \underset{k \in [k]}{max} \mathcal{R}_k(\theta), \qquad \mathcal{R}_k(\theta) := \mathbb{E}_{P_k} [\ell (\theta; Z)]<br />
\end{align}<br />
<br />
There is high representation disparity if the expected loss of the model <math display="inline">\mathcal{R}(\theta)</math> is low, but the worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> of one group is high. This means that a model with high representation disparity performs well on average (i.e. has low overall loss), but fails to represent some groups <math display="inline">k</math> (i.e. the risk for the worst-off group is high). In order to make models fairer, Hashimot et al. try to minimize this worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> in general.<br />
<br />
=Disparity Amplification=<br />
<br />
Representation disparity can amplify as time passes and loss is minimized. Formally, over <math display="inline">t = 1, 2, ..., T</math> minimization rounds the group proportions <math display="inline">\alpha_k^{(t)}</math> vary dependent on past losses. At each round the expected number of users <math display="inline">\lambda_k^{(t+1)}</math> from group <math display="inline">k</math> is determined by <br />
<br />
\begin{align}<br />
\lambda_k^{(t+1)} := \lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)})) + b_k<br />
\end{align}<br />
<br />
where <math display="inline">\lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)}))</math> describes the fraction of retained users from the previous optimization and <math display="inline">b_k</math> is the number of new users of group <math display="inline">k</math>. To put simply, the number of expected users of a group depend on the number of new users of that group, and the fraction of users that continue to use the system from the previous optimization step. If less and less users from minority groups return to the model (i.e. the model as a low retention rate of users), Hashimoto et al. argue that the representation disparity amplifies. <br />
<br />
==Empirical Risk Minimization (ERM)==<br />
<br />
Without the knowledge of population proportions <math display="inline">\alpha_k^{(t)}</math>, the new user rate <math display="inline">b_k</math>, and the retention function <math display="inline">\nu</math> it is hard in practice, however, to control the worst-case risk over all time periods <math display="inline">\mathcal{R}_{max}^T</math> for all groups. That is why it is the standard approach to fit a sequence of models <math display="inline">\theta^{(t)}</math> by empirically approximating them. Using ERM, for instance, the optimal model is approached by minimizing the loss of the model:<br />
<br />
\begin{align}<br />
\theta^{(t)} = arg min_{\theta \in \Theta} \sum_i \ell(\theta; Z_i^{(t)})<br />
\end{align}<br />
<br />
However, ERM fails to prevent disparity amplification. By minimizing the expected loss of the model, minority groups experience higher loss, and do not return to use the system. In doing so, the population proportions <math display="inline">\{\alpha_k^{(t)}\}</math> shift, and certain minority groups contribute even less to the system. This is mirrored in the expected user count <math display="inline">\lambda^{(t)}</math> at each optimization point. In their paper Hashimoto et al. show that, if using ERM, <math display="inline">\lambda^{(t)}</math> is unstable because it loses its fair fixed point (i.e. the population fraction where risk minimization maintains the same population fraction over time). Therefore, ERM fails to control minority risk over time, and is considered unfair.<br />
<br />
=Distributonally Robust Optimization (DRO)=<br />
<br />
At this point our goal is to minimize the worst-case group risk over a single time-step <math display="inline">\mathcal{R}_{max} (\theta^{(t)}) </math>. As previously mentioned, this is difficult to do because neither the population proportions <math display="inline">\{\alpha_k\} </math> nor group distributions <math display="inline">\{P_k\} </math> are known. Therefore, Hashimoto et al. developed an optimization technique that is robust "against '''''all''''' directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution <math display="inline">P_k </math> whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function <math display="inline">\mathcal{R}_{dro} </math> has to "up-weigh" data <math display="inline">Z</math> that cause high loss <math display="inline">\ell(\theta, Z)</math>. In other words, the risk function has to over-represent mixture components (i.e. group distributions <math display="inline">\{P_k\} </math>) in relation to their original mixture weights (i.e. the population proportions <math display="inline">\{\alpha_k\} </math>) for groups that suffer high loss. <br />
<br />
To do this we need to consider the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> around <math display="inline">P</math> within a certain limit. This limit is described by the <math display="inline">\chi^2</math>-divergence (i.e. the distance, roughly speaking) between probability distributions. For two distributions <math display="inline">P</math> and <math display="inline">Q</math> the divergence is defined as <math display="inline">D_{\chi^2} (P || Q):= \int (\frac{dP}{dQ} - 1)^2</math>. With the help of the <math display="inline">\chi^2</math>-divergence, Hashimoto et al. define the chi-squared ball <math display="inline">\mathcal{B}(P,r)</math> around the probability distribution P. This ball is defined so that <math display="inline">\mathcal{B}(P,r) := \{Q \ll P : D_{\chi^2} (Q || P) \leq r \}</math>. It is exactly this ball that gives us the opportunity to consider the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> that lie inside the ball (i.e. within reasonable range) around the probability distribution <math display="inline">P</math>. This loss is given by<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]<br />
\end{align}<br />
<br />
which for <math display="inline">P:= \sum_{k \in [K]} \alpha_k P_k</math> for all models <math display="inline">\theta \in \Theta</math> where <math display="inline">r_k := (1/a_k -1)^2</math> bounds the risk <math display="inline">\mathcal{R}_k(\theta) \leq \mathcal{R}_{dro} (\theta; r_k)</math> for each group with risk <math display="inline">\mathcal{R}_k(\theta)</math>. Furthermore, if we specify a lower bound on the group proportions <math display="inline">\alpha_{min} \leq min_{k \in [K]} \alpha_k</math>, and define <math display="inline">r_{max} := (1/\alpha_{min} -1)^2</math>, the worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math> can be controlled by <math display="inline">\mathcal{R}_{dro} (\theta; r_{max}) </math> by forming an upper bound that can be minimized.<br />
<br />
==Optimization of DRO==<br />
<br />
To minimize <math display="inline">\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]</math> Hashimoto et al. look at the dual of this maximization problem (i.e. every maximization problem can be transformed into a minimization problem and vice-versa). This dual is given by the minimization problem<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) = \underset{\eta \in \mathbb{R}}{inf} \left\{ F(\theta; \eta):= C\left(\mathbb{E}_P \left[ [\ell(\theta;Z) - \eta]_+^2 \right] \right)^\frac{1}{2} + \eta \right\}<br />
\end{align}<br />
<br />
with <math display="inline">C = (2(1/a_{min} - 1)^2 + 1)^{1/2}</math>. <math display="inline">\eta</math> describes the dual variable (i.e. the variable that appears in creating the dual). Since <math display="inline">F(\theta; \eta)</math> involves an expectation <math display="inline">\mathbb{E}_P</math> over the data generating distribution <math display="inline">P</math>, <math display="inline">F(\theta; \eta)</math> can be directly minimized. For convex losses <math display="inline">\ell(\theta;Z)</math>, <math display="inline">F(\theta; \eta)</math> is convex, and can be minimized by performing a binary search over <math display="inline">\eta</math>. In their paper, Hashimoto et al. further show that optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> at each time step controls the ''future'' worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math>, and therefore retention rates. That means if the initial group proportions satisfy <math display="inline">\alpha_k^{(0)} \geq a_{min}</math>, and <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> is optimized for every time step (and therefore <math display="inline">\mathcal{R}_{max} (\theta) </math> is minimized), <math display="inline">\mathcal{R}_{max}^T (\theta) </math> over all time steps is controlled. In other words, optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> every time step is enough to avoid disparity amplification.</div>Mpaflahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fairness_Without_Demographics_in_Repeated_Loss_Minimization&diff=37021Fairness Without Demographics in Repeated Loss Minimization2018-10-22T19:01:52Z<p>Mpafla: /* Representation Disparity */</p>
<hr />
<div>This page contains the summary of the paper "[http://proceedings.mlr.press/v80/hashimoto18a.html Fairness Without Demographics in Repeated Loss Minimization]" by Hashimoto, T. B., Srivastava, M., Namkoong, H., & Liang, P. which was published at the International Conference of Machine Learning (ICML) in 2018. <br />
<br />
=Introduction=<br />
<br />
Usually, machine learning models are minimized in their average loss to achieve high overall accuracy. While this works well for the majority, minority groups that use the system suffer high error rates because they contribute less data to the model. This phenomenon is known as '''''representation disparity''''' and has been observed in many models that, for instance, recognize faces, or identify language. This disparity even increases as minority users suffer higher error rates, and therefore, are less likely to use the system in the future. As a consequence minority groups further shrink, and therefore, less data is available for the next optimization of the model. With less data the disparity becomes even worse - a phenomenon referred to as '''''disparity amplification'''''. <br />
<br />
In this paper, Hashimoto et al. first show that standard '''''empirical risk minimization (ERM)''''' does not control the loss of minority groups, and thus causes representation disparity and its amplification over time (even if the model is fair in the beginning). Second, the researchers try to mitigate this unfairness by proposing the use of '''''distributionally robust optimization (DRO)'''''. Indeed Hashimoto et al. are able to show that DRO can bound the loss for minority groups and is fair on models that ERM turns unfair. <br />
<br />
===Note on Fairness===<br />
<br />
Hashimoto et al. follow the ''difference principle'' to achieve and measure fairness. It is defined as the maximization of the welfare of the worst-off group rather than the whole group (cf. utilitarianism).<br />
<br />
=Representation Disparity=<br />
<br />
If a user makes a query <math display="inline">Z \sim P</math>, the model <math display="inline">\theta \in \Theta</math> makes a prediction, and the user experiences loss <math display="inline">\ell (\theta; Z)</math>. The ''expected'' loss of of a model <math display="inline">\theta</math> is denoted as the risk <math display="inline">\mathcal{R}(\theta) = \mathbb{E}_{Z \sim P} [\ell (\theta; Z)] </math>. While the queries (i.e. users using the system and providing data to the model) are made by users from <math display="inline">k</math> latent groups such that <math display="inline">Z \sim P := \sum_{k \in [k]} \alpha_kP_k</math>, neither the actual population proportions <math display="inline">\alpha_k</math> nor the group distributions <math display="inline">P_k</math> are known. Therefore, Hashimoto et al.'s goal is to control the worst-case risk over ''all'' groups <math display="inline">K</math> (and not just the risk of the worst-off group):<br />
<br />
\begin{align}<br />
\mathcal{R}_{max}(\theta) := \underset{k \in [k]}{max} \mathcal{R}_k(\theta), \qquad \mathcal{R}_k(\theta) := \mathbb{E}_{P_k} [\ell (\theta; Z)]<br />
\end{align}<br />
<br />
There is high representation disparity if the expected loss of the model <math display="inline">\mathcal{R}(\theta)</math> is low, but the worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> of one group is high. This means that a model with high representation disparity performs well on average (i.e. has low overall loss), but fails to represent some groups <math display="inline">k</math> (i.e. the risk for the worst-off group is high). In order to make models fairer, Hashimot et al. try to minimize this worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> in general.<br />
<br />
=Disparity Amplification=<br />
<br />
Representation disparity can amplify as time passes and loss is minimized. Formally, over <math display="inline">t = 1, 2, ..., T</math> minimization rounds the group proportions <math display="inline">\alpha_k^{(t)}</math> vary dependent on past losses. At each round the expected number of users <math display="inline">\lambda_k^{(t+1)}</math> from group <math display="inline">k</math> is determined by <br />
<br />
\begin{align}<br />
\lambda_k^{(t+1)} := \lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)})) + b_k<br />
\end{align}<br />
<br />
where <math display="inline">\lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)}))</math> describes the fraction of retained users from the previous optimization and <math display="inline">b_k</math> is the number of new users of group <math display="inline">k</math>. To put simply, the number of expected users of a group depend on the number of new users of that group, and the fraction of users that continue to use the system from the previous optimization step. If less and less users from minority groups return to the model (i.e. the model as a low retention rate of users), Hashimoto et al. argue that the representation disparity amplifies. <br />
<br />
==Empirical Risk Minimization (ERM)==<br />
<br />
Without the knowledge of population proportions <math display="inline">\alpha_k^{(t)}</math>, the new user rate <math display="inline">b_k</math>, and the retention function <math display="inline">\nu</math> it is hard in practice, however, to control the worst-case risk over all time periods <math display="inline">\mathcal{R}_{max}^T</math> for all groups. That is why it is the standard approach to fit a sequence of models <math display="inline">\theta^{(t)}</math> by empirically approximating them. Using ERM, for instance, the optimal model is approached by minimizing the loss of the model:<br />
<br />
\begin{align}<br />
\theta^{(t)} = arg min_{\theta \in \Theta} \sum_i \ell(\theta; Z_i^{(t)})<br />
\end{align}<br />
<br />
However, ERM fails to prevent disparity amplification. By minimizing the expected loss of the model, minority groups experience higher loss, and do not return to use the system. In doing so, the population proportions <math display="inline">\{\alpha_k^{(t)}\}</math> shift, and certain minority groups contribute even less to the system. This is mirrored in the expected user count <math display="inline">\lambda^{(t)}</math> at each optimization point. In their paper Hashimoto et al. show that, if using ERM, <math display="inline">\lambda^{(t)}</math> is unstable because it loses its fair fixed point (i.e. the population fraction where risk minimization maintains the same population fraction over time). Therefore, ERM fails to control minority risk over time, and is considered unfair.<br />
<br />
=Distributonally Robust Optimization (DRO)=<br />
<br />
At this point our goal is to minimize the worst-case group risk over a single time-step <math display="inline">\mathcal{R}_{max} (\theta^{(t)}) </math>. As previously mentioned, this is difficult to do because neither the population proportions <math display="inline">\{\alpha_k\} </math> nor group distributions <math display="inline">\{P_k\} </math> are known. Therefore, Hashimoto et al. developed an optimization technique that is robust "against '''''all''''' directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution <math display="inline">P_k </math> whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function <math display="inline">\mathcal{R}_{dro} </math> has to "up-weigh" data <math display="inline">Z</math> that cause high loss <math display="inline">\ell(\theta, Z)</math>. In other words, the risk function has to over-represent mixture components (i.e. group distributions <math display="inline">\{P_k\} </math>) in relation to their original mixture weights (i.e. the population proportions <math display="inline">\{\alpha_k\} </math>) for groups that suffer high loss. <br />
<br />
To do this we need to