Robot Learning in Homes: Improving Generalization and Reducing Dataset Bias

From statwiki
Jump to navigation Jump to search

Introduction

Using data-driven approaches in robotics has increased in the last decade. Instead of using hand-designed models, these data-driven approaches works on large-scale datasets and learn appropriate policies that map from high-dimensional observations to actions. Since collecting data using an actual robot in real-time is very expensive, most of data-driven approaches in robotics use simulators in order to collect simulated data. The concern which arises here is whether these approaches are able to be robust enough to domain shift and to be used for real-world data. It is an undeniable fact that there is a wide reality gap between simulators and the real world.

On the other hand, the declining costs of hardware to expand collecting data for a variety of tasks push the robotics community to collect real-world physical data. This approach has been quite successful at tasks such as grasping, pushing, poking and imitation learning. However, the major problem is that the performance of these learning models is not good enough and tends to plateau fast. Furthermore, robotic action data did not lead to similar gains as in other areas such as computer vision and natural language processing. As the paper claimed, the solution for all of these obstacles is using “real data”. Current robotic datasets lack diversity of environment. Learning based approaches need to move out of simulators in the labs and go to real environments such as real homes so that they can learn from real datasets.

Like every other process, collecting real data and working with it has several challenges. First, there is a need for cheap and compact robots to collect data in homes but current industrial robots (i.e. Sawyer and Baxter) are too expensive. Secondly, cheap robots are not accurate enough to collect reliable data. Also, collecting data in homes cannot have a supervisor at all times. These challenges in addition to some other external factors can have a result in having noisy data. In this paper, a first systematic effort has been presented for collecting a dataset inside the homes which has the following parts:

-A cheap robot which is appropriate for using in homes

-Collecting training data in 6 different homes and testing data in 3 homes

-An approach for modeling the noise in the labeled data


Overview

This paper emphasizes on the importance of diversifying the data for robotic learning in order to have a greater generalization. A diverse dataset also allows for removing biases in the data. By considering these facts, the paper argues that even for simple tasks like grasping, datasets which are collected in labs suffer from strong biases such as simple back grounds and same environment dynamics. Hence, the learning approaches cannot generalize the models and work well on real datasets.

As a future possibility, there would be a need for having a low-cost robot to collect large-scale data inside a huge number of homes. For this reason, they introduced a customized mobile manipulator. They used a Dobot Magician which is a robotic arm mounted on a Kobuki which is a low-cost mobile base. The resulting robot arm has five degrees of freedom (DOF). They also add an Intel R200 RGBD camera to their robot which is at a height of 1m above the ground. An intel core i5 processor is also used as an on-board laptop to perform all the processing. The whole system can run for 1.5 hours with a single charge.

As there is always a trade-off, when we gain a low-cost robot, we are actually losing accuracy for controlling it. So, the low-cost robot which is built from cheaper components than the expensive setups such as Baxter and Sawyer, suffers from higher calibration errors and execution errors. This means that, the dataset collected with this approach is diverse and huge but it has noisy labels. To illustrate, consider when the robot wants to grasp at location [math]\displaystyle{ {(x, y)} }[/math]. Since there is a noise in the execution, the robot may perform this action in the location [math]\displaystyle{ {(x + \delta_{x}, y+ \delta_{y})} }[/math] which would assign the success or failure label of this action to a wrong place. Therefore, to solve the problem, they used an approach to learn from noisy data. They modeled noise as a latent variable and used two networks, one for predicting the noise and one for predicting the action to execute.


Learning on low cost robot data

The patch grasping framework is used in the architecture that the paper presents. Also, as mentioned before, there is a high tendency for noisy labels in the datasets which are collected by inaccurate and cheap robots. The cause of the noise in the labels could be due to the hardware execution error, inaccurate kinematics, camera calibration, proprioception, wear and tear, etc. Here are more explanations about different parts of the architecture:


- Grasping Formulation

Planar grasping is the object of interest in this architecture. It means that all the objects are grasped at the same height and vertical to the ground. The final goal is to find [math]\displaystyle{ {(x, y, \theta)} }[/math] given an observation [math]\displaystyle{ {I} }[/math] of the object, where [math]\displaystyle{ {x} }[/math] and [math]\displaystyle{ {y} }[/math] are the translational degrees of freedom and [math]\displaystyle{ {\theta} }[/math] is the rotational degrees of freedom. For the purpose of comparison, they used a model which does not predict the [math]\displaystyle{ {(x, y, \theta)} }[/math] directly from the image [math]\displaystyle{ {I} }[/math], but samples several smaller patches [math]\displaystyle{ {I_{P}} }[/math] at different locations [math]\displaystyle{ {(x, y)} }[/math]. Thus, the angle of grasp [math]\displaystyle{ {\theta} }[/math] is predicted from these patches. Also, in order to have multimodal predictions, discrete steps of the angle [math]\displaystyle{ {\theta} }[/math], [math]\displaystyle{ {\theta_{D}} }[/math] is used.

Hence, each datapoint consists of an image [math]\displaystyle{ {I} }[/math], the executed grasp [math]\displaystyle{ {(x, y, \theta)} }[/math] and the grasp success/failure label g. Then, the image [math]\displaystyle{ {I} }[/math] and the angle [math]\displaystyle{ {\theta} }[/math] are converted to image patch [math]\displaystyle{ {I_{P}} }[/math] and angle [math]\displaystyle{ {\theta_{D}} }[/math]. Then, to minimize the classification error, a binary cross entropy loss is used.


- Modeling noise as latent variable

In order to tackle the problem of inaccurate position control, they found a structure in the noise which is dependent on the robot and the design. They modeled this structure as a latent variable which is shown is figure 2:



The grasp success probability for image patch [math]\displaystyle{ {I_{P}} }[/math] at angle [math]\displaystyle{ {\theta_{D}} }[/math] is represented as [math]\displaystyle{ {P(g|I_{P},\theta_{D};R )} }[/math] where [math]\displaystyle{ {R} }[/math] represents environment variables that can add noise to the system.

The conditional probability of grasping for this model is computed by:


\[ { P(g|I_{P},\theta_{D}, R ) = ∑_{( I ̂_{P} ϵ P)} P(g│z=I ̂_{P},\theta_{D},R ). P(z=I ̂_{P} |(\theta_{D},I_{P} ,R ) } \]



Here, [math]\displaystyle{ {z} }[/math] models the latent variable of the actual patch executed, and [math]\displaystyle{ {I^_{P}} }[/math] belongs to a set of possible neighboring patches P. P(z=I ̂_P |〖θ_D,I〗_P ,R ) shows the noise which can be caused by R variables and is implemented as the Noise Modelling Network (NMN). P(g│z=I ̂_P,θ_D,R ) shows the grasp prediction probability given the true patch and is implemented as the Grasp Prediction Network (GPN). The overall Robust-Grasp model is computed by marginalizing GPN and NMN.

3.3 Learning the latent noise model They assume that z is conditionally independent of the local patch-specific variables (I_P,θ_D). To estimate the latent variable Z, they used direct optimization to learn both NMN and GPN with noisy labels. The entire image of the scene and the environment information are the inputs of the NMN. The output of the NMN is the probability distribution of the actual patches where the grasps are executed. Finally, a binary cross entropy loss is applied on the output of these two networks and the true grasp label g.

3.4 Training details They implemented their model in PyTorch using a pretrained ResNet-18 model. They concatenated 512 dimensional ResNet feature with a 1-hot vector of robot ID and the raw pixel location of the grasp for their NMN. Also, the inputs of the GPN are the original noisy patch plus 8 other equidistant patches from the original one. Their training process starts with training only GPN over 5 epochs of the data. Then, the NMN and the marginalization operator are added to the model. So, they train NMN and GPN simultaneously for the other 25 epochs.