Difference between revisions of "Understanding Image Motion with Group Representations"

From statwiki
Jump to: navigation, search
Line 7: Line 7:
  
 
= Related Work =
 
= Related Work =
The most common global representations of motion are from structure from motion (SfM) and simultaneous localization and mapping (SLAM), which represents poses in special Euclidean group (SE(3)) to represent a sequence of motions. However, these cannot be used to represent non-rigid or independent motions. Other approaches to representing motion is spatio-temporal features (STFs), which are flexible enough to represent non-rigid motions.
+
The most common global representations of motion are from structure from motion (SfM) and simultaneous localization and mapping (SLAM), which represents poses in special Euclidean group (SE(3)) to represent a sequence of motions. However, these cannot be used to represent non-rigid or independent motions. Another approache to representing motion is spatiotemporal features (STFs), which are flexible enough to represent non-rigid motions.
  
There are also works using CNN’s to learn optical flow using brightness constancy assumptions, and/or photometric local constraints. Works on stereo depth estimation using learning has also shown results. Regarding to image sequences, there are works on shuffling the order of images to learn representation of its contents, as well as learning representations equivariant to the egomotion of the camera.  
+
There are also works using CNN’s to learn optical flow using brightness constancy assumptions, and/or photometric local constraints. Works on stereo depth estimation using learning has also shown results. Regarding to image sequences, there are works on shuffling the order of images to learn representations of its contents, as well as learning representations equivariant to the egomotion of the camera.  
  
=Approach=
+
= Approach =
 
The proposed method is based on the observation that 3D motions, equipped with composition forms a group. By learning the underlying mapping that captures the motion transformations, we are approximating latent motion of the scene.The method is designed to capture group associativity and invertibility.
 
The proposed method is based on the observation that 3D motions, equipped with composition forms a group. By learning the underlying mapping that captures the motion transformations, we are approximating latent motion of the scene.The method is designed to capture group associativity and invertibility.
  
Consider a latent structure space <S>, element of the structure space generates images via projection pi: S->I, latent motion space M which, which is some closed subgroup of the set of homeomorpohims on S. For S E S,
+
Consider a latent structure space <math>S</math>, element of the structure space generates images via projection <math>\pi:S\rightarrow I</math>, latent motion space <math>M</math> which is some closed subgroup of the set of homeomorphism on <math>S</math>. For <math>s \in S</math>, a continuous motion sequence <math> \{m_t \in M | t \geq 0\} </math> generates continous image sequence <math> \{i_t \in I | t \geq 0\} </math> where <math> i_t=\pi(m_t(s)) </math>.

Revision as of 12:49, 5 March 2018

Introduction

Motion perception is a key component of computer vision. It is critical to problems such as optical flow and visual odometry, where a sequence of images are used to calculate either the pixel level (local) motion or the motion of the entire scene (global). The smooth image transformation caused by camera motion is a subspace of all position image transformations. Here, we are interested in realistic transformation caused by motion, therefore unrealistic motion caused by say, face swapping, are not considered.

Supervised learning of 3D motion is challenging since explicit motion labels are no trivial to obtain. The proposed learning method does not need label data. Instead, the method constraints learning by using the properties of motion space. The paper presents a general model of visual motion, and how the motion space properties of associativity and can be used to constrain the learning of a deep neural network. The results show evidence that the learned model captions motion in both 2D and 3D settings.

<Image>

Related Work

The most common global representations of motion are from structure from motion (SfM) and simultaneous localization and mapping (SLAM), which represents poses in special Euclidean group (SE(3)) to represent a sequence of motions. However, these cannot be used to represent non-rigid or independent motions. Another approache to representing motion is spatiotemporal features (STFs), which are flexible enough to represent non-rigid motions.

There are also works using CNN’s to learn optical flow using brightness constancy assumptions, and/or photometric local constraints. Works on stereo depth estimation using learning has also shown results. Regarding to image sequences, there are works on shuffling the order of images to learn representations of its contents, as well as learning representations equivariant to the egomotion of the camera.

Approach

The proposed method is based on the observation that 3D motions, equipped with composition forms a group. By learning the underlying mapping that captures the motion transformations, we are approximating latent motion of the scene.The method is designed to capture group associativity and invertibility.

Consider a latent structure space [math]S[/math], element of the structure space generates images via projection [math]\pi:S\rightarrow I[/math], latent motion space [math]M[/math] which is some closed subgroup of the set of homeomorphism on [math]S[/math]. For [math]s \in S[/math], a continuous motion sequence [math] \{m_t \in M | t \geq 0\} [/math] generates continous image sequence [math] \{i_t \in I | t \geq 0\} [/math] where [math] i_t=\pi(m_t(s)) [/math].