Understanding Image Motion with Group Representations: Difference between revisions
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
= Introduction = | == Introduction == | ||
Motion perception is a key component of computer vision. It is critical to problems such as optical flow and visual odometry, where a sequence of images are used to calculate either the pixel level (local) motion or the motion of the entire scene (global). The smooth image transformation caused by camera motion is a subspace of all position image transformations. Here, we are interested in realistic transformation caused by motion, therefore unrealistic motion caused by say, face swapping, are not considered. | Motion perception is a key component of computer vision. It is critical to problems such as optical flow and visual odometry, where a sequence of images are used to calculate either the pixel level (local) motion or the motion of the entire scene (global). The smooth image transformation caused by camera motion is a subspace of all position image transformations. Here, we are interested in realistic transformation caused by motion, therefore unrealistic motion caused by say, face swapping, are not considered. | ||
Line 6: | Line 6: | ||
<Image> | <Image> | ||
= Related Work = | == Related Work == | ||
The most common global representations of motion are from structure from motion (SfM) and simultaneous localization and mapping (SLAM), which represents poses in special Euclidean group <math> SE(3) </math> to represent a sequence of motions. However, these cannot be used to represent non-rigid or independent motions. Another approache to representing motion is spatiotemporal features (STFs), which are flexible enough to represent non-rigid motions. | The most common global representations of motion are from structure from motion (SfM) and simultaneous localization and mapping (SLAM), which represents poses in special Euclidean group <math> SE(3) </math> to represent a sequence of motions. However, these cannot be used to represent non-rigid or independent motions. Another approache to representing motion is spatiotemporal features (STFs), which are flexible enough to represent non-rigid motions. | ||
There are also works using CNN’s to learn optical flow using brightness constancy assumptions, and/or photometric local constraints. Works on stereo depth estimation using learning has also shown results. Regarding to image sequences, there are works on shuffling the order of images to learn representations of its contents, as well as learning representations equivariant to the egomotion of the camera. | There are also works using CNN’s to learn optical flow using brightness constancy assumptions, and/or photometric local constraints. Works on stereo depth estimation using learning has also shown results. Regarding to image sequences, there are works on shuffling the order of images to learn representations of its contents, as well as learning representations equivariant to the egomotion of the camera. | ||
= Approach = | == Approach == | ||
The proposed method is based on the observation that 3D motions, equipped with composition forms a group. By learning the underlying mapping that captures the motion transformations, we are approximating latent motion of the scene.The method is designed to capture group associativity and invertibility. | The proposed method is based on the observation that 3D motions, equipped with composition forms a group. By learning the underlying mapping that captures the motion transformations, we are approximating latent motion of the scene.The method is designed to capture group associativity and invertibility. | ||
Consider a latent structure space <math>S</math>, element of the structure space generates images via projection <math>\pi:S\rightarrow I</math>, latent motion space <math>M</math> which is some closed subgroup of the set of homeomorphism on <math>S</math>. For <math>s \in S</math>, a continuous motion sequence <math> \{m_t \in M | t \geq 0\} </math> generates continous image sequence <math> \{i_t \in I | t \geq 0\} </math> where <math> i_t=\pi(m_t(s)) </math>. Writing this as a hidden Markov model gives <math> i_t=\pi(m_{\Delta t}(s_{t-1}))) </math> where the current state is based on the change from the previous. Since <math> M </math> is a closed group on <math> S </math>, it is associative, has inverse, and contains idenity. <math> SE(3) </math> is an exmaple of this. | Consider a latent structure space <math>S</math>, element of the structure space generates images via projection <math>\pi:S\rightarrow I</math>, latent motion space <math>M</math> which is some closed subgroup of the set of homeomorphism on <math>S</math>. For <math>s \in S</math>, a continuous motion sequence <math> \{m_t \in M | t \geq 0\} </math> generates continous image sequence <math> \{i_t \in I | t \geq 0\} </math> where <math> i_t=\pi(m_t(s)) </math>. Writing this as a hidden Markov model gives <math> i_t=\pi(m_{\Delta t}(s_{t-1}))) </math> where the current state is based on the change from the previous. Since <math> M </math> is a closed group on <math> S </math>, it is associative, has inverse, and contains idenity. <math> SE(3) </math> is an exmaple of this. | ||
== Learning Motion by Group Properties == | === Learning Motion by Group Properties === | ||
The goal is to learn function <math> \Phi : I \times I \rightarrow \overline{M} </math>, <math> \overline{M} </math> indicating representation of <math> M </math>, as well as the composition operator <math> \diamond : \overline{M} \rightarrow \overline{M} </math> that represents composition in <math> M </math>. For all sequences, it is assumed <math> t_0 < t_1 < t_2 ... </math> | The goal is to learn function <math> \Phi : I \times I \rightarrow \overline{M} </math>, <math> \overline{M} </math> indicating representation of <math> M </math>, as well as the composition operator <math> \diamond : \overline{M} \rightarrow \overline{M} </math> that represents composition in <math> M </math>. For all sequences, it is assumed <math> t_0 < t_1 < t_2 ... </math> | ||
# Associativity: <math> \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_2}, I_{t_3}) = (\Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_2})) \diamond \Phi(I_{t_2}, I_{t_3}) = \Phi(I_{t_0}, I_{t_1}) \diamond (\Phi(I_{t_1}, I_{t_2}) \diamond \Phi(I_{t_2}, I_{t_3})) = \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_3}) </math> | # Associativity: <math> \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_2}, I_{t_3}) = (\Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_2})) \diamond \Phi(I_{t_2}, I_{t_3}) = \Phi(I_{t_0}, I_{t_1}) \diamond (\Phi(I_{t_1}, I_{t_2}) \diamond \Phi(I_{t_2}, I_{t_3})) = \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_3}) </math> | ||
Line 27: | Line 27: | ||
Also with this method, it is possible multiple representations <math> \overline{M} </math> can be learned from a single <math> M </math>, thus the learned representation is not necessary unique. In addition, the scenes are not expected to have rapid changing contents, scene cuts, or long-term occlusions. | Also with this method, it is possible multiple representations <math> \overline{M} </math> can be learned from a single <math> M </math>, thus the learned representation is not necessary unique. In addition, the scenes are not expected to have rapid changing contents, scene cuts, or long-term occlusions. | ||
== Sequence Learning with Neural Networks == | === Sequence Learning with Neural Networks === | ||
The functions <math> \Phi </math> and <math> \diamond </math> are approximated by CNN and RNN, respectively. LSTM is used for RNN. The input to the network is a sequence of images <math> I_t = \{I_1,...,I_t\} </math>. The CNN processes pairs of images are intermediate representations, and the LSTM operates over the sequence of CNN outputs to produce and embedding sequence <math> R_t = \{R_{1,2},...,R_{t-1,t}\} </math>. Only the embedding at the final timstep is used for loss. The network is trained to minimize a hinge loss with respect to embeddings to pairs of sequences. The cost function is: | The functions <math> \Phi </math> and <math> \diamond </math> are approximated by CNN and RNN, respectively. LSTM is used for RNN. The input to the network is a sequence of images <math> I_t = \{I_1,...,I_t\} </math>. The CNN processes pairs of images are intermediate representations, and the LSTM operates over the sequence of CNN outputs to produce and embedding sequence <math> R_t = \{R_{1,2},...,R_{t-1,t}\} </math>. Only the embedding at the final timstep is used for loss. The network is trained to minimize a hinge loss with respect to embeddings to pairs of sequences. The cost function is: | ||
<center><math>L(R^1,R^2) = \begin{cases} d(R^1,R^2), & \text{if positive pair} \\ max(0, m - d(R^1,R^2)), & \text{if negative pair} \end{cases}</math></center> | <center><math>L(R^1,R^2) = \begin{cases} d(R^1,R^2), & \text{if positive pair} \\ max(0, m - d(R^1,R^2)), & \text{if negative pair} \end{cases}</math></center> | ||
<center><math> d_{cosine}(R^1,R^2)=1-\frac{\langle R^1,R^2 \rangle}{\lVert R^1 \rVert \lVert R^2 \rVert} </math></center> | <center><math> d_{cosine}(R^1,R^2)=1-\frac{\langle R^1,R^2 \rangle}{\lVert R^1 \rVert \lVert R^2 \rVert} </math></center> | ||
Line 37: | Line 36: | ||
Each training sequence is composed into 6 subsequences: two forward, two backward, and two identity. To prevent the network from only looking at static differences, subsequence pairs are sampled such that they have the same start and end frames but different motions in between. Sequences of varying lengths are also used to generalize motion on different temporal scale. Training the network with only one input images per timestep is also tried, but consistently yielded work results than image pairs. | Each training sequence is composed into 6 subsequences: two forward, two backward, and two identity. To prevent the network from only looking at static differences, subsequence pairs are sampled such that they have the same start and end frames but different motions in between. Sequences of varying lengths are also used to generalize motion on different temporal scale. Training the network with only one input images per timestep is also tried, but consistently yielded work results than image pairs. | ||
== Experimentation == | |||
Trained network using rotated and translated MNIST dataset as well as KITTI dataset. | |||
* Used torch | |||
* Used Adam for optimization, decay schedule of 30 epochs, learning rate chosen by random serach | |||
* 50-60 batch size for MINST, 25-30 batch size for KITTI | |||
* dilated convolution with Relu and batch normalization | |||
* Two LSTM cell per layer 256 hidden units each | |||
* sequence length of 3-5 images | |||
=== Rigid Motion in 2D === | |||
* MINST data rotated <math>[0, 360)</math> degrees and translated <math>[-10, 10] </math> pixels, i.e. <math>SE(2)</math> transformations | |||
* visualized the representation using t-SNE | |||
** clear clustering by translation and rotation but not object classes | |||
** suggests the representation captures the motion properties in the dataset, but is independent of image contents | |||
* visualized the image-conditioned saliency maps | |||
** | |||
* | |||
=== Real World Motion in 3D === | |||
== Conclusion == | |||
The author presented a new model of motion and method for learning motion representations. It is shown that enforcing group properties can learn motion representations that is able to generalize between scenes with disparate content. The results can be useful for navigation, prediction, and other behavioral tasks relying on motion. Due to the fact that this method does not require labelled data, it can be applied to useful for large variety of tasks. | |||
== Criticism == | |||
Although this method does not require any labelled data, it is still learning by supervision through defined constraints. The author showed that |
Revision as of 21:45, 5 March 2018
Introduction
Motion perception is a key component of computer vision. It is critical to problems such as optical flow and visual odometry, where a sequence of images are used to calculate either the pixel level (local) motion or the motion of the entire scene (global). The smooth image transformation caused by camera motion is a subspace of all position image transformations. Here, we are interested in realistic transformation caused by motion, therefore unrealistic motion caused by say, face swapping, are not considered.
Supervised learning of 3D motion is challenging since explicit motion labels are no trivial to obtain. The proposed learning method does not need label data. Instead, the method constraints learning by using the properties of motion space. The paper presents a general model of visual motion, and how the motion space properties of associativity and can be used to constrain the learning of a deep neural network. The results show evidence that the learned model captions motion in both 2D and 3D settings.
<Image>
Related Work
The most common global representations of motion are from structure from motion (SfM) and simultaneous localization and mapping (SLAM), which represents poses in special Euclidean group [math]\displaystyle{ SE(3) }[/math] to represent a sequence of motions. However, these cannot be used to represent non-rigid or independent motions. Another approache to representing motion is spatiotemporal features (STFs), which are flexible enough to represent non-rigid motions.
There are also works using CNN’s to learn optical flow using brightness constancy assumptions, and/or photometric local constraints. Works on stereo depth estimation using learning has also shown results. Regarding to image sequences, there are works on shuffling the order of images to learn representations of its contents, as well as learning representations equivariant to the egomotion of the camera.
Approach
The proposed method is based on the observation that 3D motions, equipped with composition forms a group. By learning the underlying mapping that captures the motion transformations, we are approximating latent motion of the scene.The method is designed to capture group associativity and invertibility.
Consider a latent structure space [math]\displaystyle{ S }[/math], element of the structure space generates images via projection [math]\displaystyle{ \pi:S\rightarrow I }[/math], latent motion space [math]\displaystyle{ M }[/math] which is some closed subgroup of the set of homeomorphism on [math]\displaystyle{ S }[/math]. For [math]\displaystyle{ s \in S }[/math], a continuous motion sequence [math]\displaystyle{ \{m_t \in M | t \geq 0\} }[/math] generates continous image sequence [math]\displaystyle{ \{i_t \in I | t \geq 0\} }[/math] where [math]\displaystyle{ i_t=\pi(m_t(s)) }[/math]. Writing this as a hidden Markov model gives [math]\displaystyle{ i_t=\pi(m_{\Delta t}(s_{t-1}))) }[/math] where the current state is based on the change from the previous. Since [math]\displaystyle{ M }[/math] is a closed group on [math]\displaystyle{ S }[/math], it is associative, has inverse, and contains idenity. [math]\displaystyle{ SE(3) }[/math] is an exmaple of this.
Learning Motion by Group Properties
The goal is to learn function [math]\displaystyle{ \Phi : I \times I \rightarrow \overline{M} }[/math], [math]\displaystyle{ \overline{M} }[/math] indicating representation of [math]\displaystyle{ M }[/math], as well as the composition operator [math]\displaystyle{ \diamond : \overline{M} \rightarrow \overline{M} }[/math] that represents composition in [math]\displaystyle{ M }[/math]. For all sequences, it is assumed [math]\displaystyle{ t_0 \lt t_1 \lt t_2 ... }[/math]
- Associativity: [math]\displaystyle{ \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_2}, I_{t_3}) = (\Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_2})) \diamond \Phi(I_{t_2}, I_{t_3}) = \Phi(I_{t_0}, I_{t_1}) \diamond (\Phi(I_{t_1}, I_{t_2}) \diamond \Phi(I_{t_2}, I_{t_3})) = \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_3}) }[/math]
- Has Identity: [math]\displaystyle{ \Phi(I_{t_0}, I_{t_1}) \diamond e = \Phi(I_{t_0}, I_{t_1}) = e \diamond \Phi(I_{t_0}, I_{t_1}) }[/math] and [math]\displaystyle{ e=\Phi(I_{t}, I_{t}) \forall t }[/math]
- Invertibility: [math]\displaystyle{ \Phi(I_{t_0}, I_{t_1}) \diamond \Phi(I_{t_1}, I_{t_0}) = e }[/math]
A embedding loss is used to approximately enforce associativity and invertibility among subsequences sampled from image sequence. Associativity is encouraged by pushing same the same final motion with different transition to the same representation. Invertibility is encouraged by pushing the same motion with same transition but in opposite direction away from each other, as well as push loops to the same representation. Uniqueness of identity is encouraged by pushing loops away from non-identity representations. Loops are also pushed to the same representation (identity) from different sequences.
These constraints are true to any type of transformation resulting from image motion. This puts little restriction on the learning problems and allows all features relevant to the motion structure to be captured.
Also with this method, it is possible multiple representations [math]\displaystyle{ \overline{M} }[/math] can be learned from a single [math]\displaystyle{ M }[/math], thus the learned representation is not necessary unique. In addition, the scenes are not expected to have rapid changing contents, scene cuts, or long-term occlusions.
Sequence Learning with Neural Networks
The functions [math]\displaystyle{ \Phi }[/math] and [math]\displaystyle{ \diamond }[/math] are approximated by CNN and RNN, respectively. LSTM is used for RNN. The input to the network is a sequence of images [math]\displaystyle{ I_t = \{I_1,...,I_t\} }[/math]. The CNN processes pairs of images are intermediate representations, and the LSTM operates over the sequence of CNN outputs to produce and embedding sequence [math]\displaystyle{ R_t = \{R_{1,2},...,R_{t-1,t}\} }[/math]. Only the embedding at the final timstep is used for loss. The network is trained to minimize a hinge loss with respect to embeddings to pairs of sequences. The cost function is:
where [math]\displaystyle{ d(R^1,R^2) }[/math] measure the distance between the embeddings of two sequences used for training selected to be cosine distance, [math]\displaystyle{ m }[/math] is a fixed margin selected to be 0.5. Positive pair are training example where two sequences have the same final motion, negative pairs are training examples where two sequences have the exact opposite final motion. Using L2 distances yields similar results as cosine distances.
Each training sequence is composed into 6 subsequences: two forward, two backward, and two identity. To prevent the network from only looking at static differences, subsequence pairs are sampled such that they have the same start and end frames but different motions in between. Sequences of varying lengths are also used to generalize motion on different temporal scale. Training the network with only one input images per timestep is also tried, but consistently yielded work results than image pairs.
Experimentation
Trained network using rotated and translated MNIST dataset as well as KITTI dataset.
- Used torch
- Used Adam for optimization, decay schedule of 30 epochs, learning rate chosen by random serach
- 50-60 batch size for MINST, 25-30 batch size for KITTI
- dilated convolution with Relu and batch normalization
- Two LSTM cell per layer 256 hidden units each
- sequence length of 3-5 images
Rigid Motion in 2D
- MINST data rotated [math]\displaystyle{ [0, 360) }[/math] degrees and translated [math]\displaystyle{ [-10, 10] }[/math] pixels, i.e. [math]\displaystyle{ SE(2) }[/math] transformations
- visualized the representation using t-SNE
- clear clustering by translation and rotation but not object classes
- suggests the representation captures the motion properties in the dataset, but is independent of image contents
- visualized the image-conditioned saliency maps
Real World Motion in 3D
Conclusion
The author presented a new model of motion and method for learning motion representations. It is shown that enforcing group properties can learn motion representations that is able to generalize between scenes with disparate content. The results can be useful for navigation, prediction, and other behavioral tasks relying on motion. Due to the fact that this method does not require labelled data, it can be applied to useful for large variety of tasks.
Criticism
Although this method does not require any labelled data, it is still learning by supervision through defined constraints. The author showed that