Robot Learning in Homes: Improving Generalization and Reducing Dataset Bias

From statwiki
Revision as of 13:23, 14 November 2018 by Aaafify (talk | contribs) (→‎Results)
Jump to navigation Jump to search

Introduction

Using data-driven approaches in robotics has increased in the last decade. Instead of using hand-designed models, these data-driven approaches works on large-scale datasets and learn appropriate policies that map from high-dimensional observations to actions. Since collecting data using an actual robot in real-time is very expensive, most of data-driven approaches in robotics use simulators in order to collect simulated data. The concern which arises here is whether these approaches are able to be robust enough to domain shift and to be used for real-world data. It is an undeniable fact that there is a wide reality gap between simulators and the real world.

On the other hand, the declining costs of hardware to expand collecting data for a variety of tasks push the robotics community to collect real-world physical data. This approach has been quite successful at tasks such as grasping, pushing, poking and imitation learning. However, the major problem is that the performance of these learning models is not good enough and tends to plateau fast. Furthermore, robotic action data did not lead to similar gains as in other areas such as computer vision and natural language processing. As the paper claimed, the solution for all of these obstacles is using “real data”. Current robotic datasets lack diversity of environment. Learning based approaches need to move out of simulators in the labs and go to real environments such as real homes so that they can learn from real datasets.

Like every other process, collecting real data and working with it has several challenges. First, there is a need for cheap and compact robots to collect data in homes but current industrial robots (i.e. Sawyer and Baxter) are too expensive. Secondly, cheap robots are not accurate enough to collect reliable data. Also, collecting data in homes cannot have a supervisor at all times. These challenges in addition to some other external factors can have a result in having noisy data. In this paper, a first systematic effort has been presented for collecting a dataset inside the homes which has the following parts:

-A cheap robot which is appropriate for using in homes

-Collecting training data in 6 different homes and testing data in 3 homes

-An approach for modeling the noise in the labeled data


Overview

This paper emphasizes on the importance of diversifying the data for robotic learning in order to have a greater generalization. A diverse dataset also allows for removing biases in the data. By considering these facts, the paper argues that even for simple tasks like grasping, datasets which are collected in labs suffer from strong biases such as simple back grounds and same environment dynamics. Hence, the learning approaches cannot generalize the models and work well on real datasets.

As a future possibility, there would be a need for having a low-cost robot to collect large-scale data inside a huge number of homes. For this reason, they introduced a customized mobile manipulator. They used a Dobot Magician which is a robotic arm mounted on a Kobuki which is a low-cost mobile base. The resulting robot arm has five degrees of freedom (DOF). They also add an Intel R200 RGBD camera to their robot which is at a height of 1m above the ground. An intel core i5 processor is also used as an on-board laptop to perform all the processing. The whole system can run for 1.5 hours with a single charge.

As there is always a trade-off, when we gain a low-cost robot, we are actually losing accuracy for controlling it. So, the low-cost robot which is built from cheaper components than the expensive setups such as Baxter and Sawyer, suffers from higher calibration errors and execution errors. This means that, the dataset collected with this approach is diverse and huge but it has noisy labels. To illustrate, consider when the robot wants to grasp at location [math]\displaystyle{ {(x, y)} }[/math]. Since there is a noise in the execution, the robot may perform this action in the location [math]\displaystyle{ {(x + \delta_{x}, y+ \delta_{y})} }[/math] which would assign the success or failure label of this action to a wrong place. Therefore, to solve the problem, they used an approach to learn from noisy data. They modeled noise as a latent variable and used two networks, one for predicting the noise and one for predicting the action to execute.


Learning on low cost robot data

The patch grasping framework is used in the architecture that the paper presents. Also, as mentioned before, there is a high tendency for noisy labels in the datasets which are collected by inaccurate and cheap robots. The cause of the noise in the labels could be due to the hardware execution error, inaccurate kinematics, camera calibration, proprioception, wear and tear, etc. Here are more explanations about different parts of the architecture:


Grasping Formulation

Planar grasping is the object of interest in this architecture. It means that all the objects are grasped at the same height and vertical to the ground. The final goal is to find [math]\displaystyle{ {(x, y, \theta)} }[/math] given an observation [math]\displaystyle{ {I} }[/math] of the object, where [math]\displaystyle{ {x} }[/math] and [math]\displaystyle{ {y} }[/math] are the translational degrees of freedom and [math]\displaystyle{ {\theta} }[/math] is the rotational degrees of freedom. For the purpose of comparison, they used a model which does not predict the [math]\displaystyle{ {(x, y, \theta)} }[/math] directly from the image [math]\displaystyle{ {I} }[/math], but samples several smaller patches [math]\displaystyle{ {I_{P}} }[/math] at different locations [math]\displaystyle{ {(x, y)} }[/math]. Thus, the angle of grasp [math]\displaystyle{ {\theta} }[/math] is predicted from these patches. Also, in order to have multimodal predictions, discrete steps of the angle [math]\displaystyle{ {\theta} }[/math], [math]\displaystyle{ {\theta_{D}} }[/math] is used.

Hence, each datapoint consists of an image [math]\displaystyle{ {I} }[/math], the executed grasp [math]\displaystyle{ {(x, y, \theta)} }[/math] and the grasp success/failure label g. Then, the image [math]\displaystyle{ {I} }[/math] and the angle [math]\displaystyle{ {\theta} }[/math] are converted to image patch [math]\displaystyle{ {I_{P}} }[/math] and angle [math]\displaystyle{ {\theta_{D}} }[/math]. Then, to minimize the classification error, a binary cross entropy loss is used.


Modeling noise as latent variable

In order to tackle the problem of inaccurate position control, they found a structure in the noise which is dependent on the robot and the design. They modeled this structure as a latent variable which is shown is figure 2:



The grasp success probability for image patch [math]\displaystyle{ {I_{P}} }[/math] at angle [math]\displaystyle{ {\theta_{D}} }[/math] is represented as [math]\displaystyle{ {P(g|I_{P},\theta_{D};R )} }[/math] where [math]\displaystyle{ {R} }[/math] represents environment variables that can add noise to the system.

The conditional probability of grasping for this model is computed by:


\[ { P(g|I_{P},\theta_{D}, R ) = ∑_{( \hat{I}_{P} ϵ P)} P(g│z=\hat{I}_{P},\theta_{D},R ). P(z=\hat{I}_{P} |(\theta_{D},I_{P} ,R ) } \]



Here, [math]\displaystyle{ {z} }[/math] models the latent variable of the actual patch executed, and [math]\displaystyle{ {\hat{I}_{P}} }[/math] belongs to a set of possible neighboring patches [math]\displaystyle{ {P} }[/math]. [math]\displaystyle{ {P(z=\hat{I}_{P} |(\theta_{D},I_{P} ,R )} }[/math] shows the noise which can be caused by [math]\displaystyle{ {R} }[/math] variables and is implemented as the Noise Modelling Network (NMN). [math]\displaystyle{ {P(g│z=\hat{I}_{P},\theta_{D},R )} }[/math] shows the grasp prediction probability given the true patch and is implemented as the Grasp Prediction Network (GPN). The overall Robust-Grasp model is computed by marginalizing GPN and NMN.

Learning the latent noise model

They assume that [math]\displaystyle{ {z} }[/math] is conditionally independent of the local patch-specific variables [math]\displaystyle{ {(I_{P}, \theta_{D})} }[/math]. To estimate the latent variable [math]\displaystyle{ {z} }[/math], they used direct optimization to learn both NMN and GPN with noisy labels. The entire image of the scene and the environment information are the inputs of the NMN. The output of the NMN is the probability distribution of the actual patches where the grasps are executed. Finally, a binary cross entropy loss is applied on the output of these two networks and the true grasp label g.


Training details

They implemented their model in PyTorch using a pretrained ResNet-18 model. They concatenated 512 dimensional ResNet feature with a 1-hot vector of robot ID and the raw pixel location of the grasp for their NMN. Also, the inputs of the GPN are the original noisy patch plus 8 other equidistant patches from the original one. Their training process starts with training only GPN over 5 epochs of the data. Then, the NMN and the marginalization operator are added to the model. So, they train NMN and GPN simultaneously for the other 25 epochs.

Results

In the results part of the paper, they show that collecting dataset in homes is essential for generalizing learning from unseen environments. They also show that modelling the noise in their Low Cost Arm (LCA) can improve grasping performance. They collected data in parallel using multiple robots in 6 different homes, as shown in Figure 3. They used an object detector (tiny-YOLO) as the input data were unstructured due to LCA limited memory and computational capabilities. They scattered different objects in homes within 2m area to prevent collision of the robot with obstacles and let the robot move randomly and grasp objects. Finally, they collected a dataset with 28K grasp results.


To evaluate their approach in a more quantitative way, they used three test settings:

- The first one is binary classification or held-out data. The test set is collected by performing random grasps on objects. They measure the performance of binary classification by predicting the success or failure of grasping, given a location and the angle. Using binary classification allows for testing a lot of models without running them on real robots. They collected two held-out datasets using LCA in lab and homes and the dataset for Baxter robot.

- The second one is Real Low Cost Arm(Real-LCA). Here, they evaluate their model by running it in three unseen homes. They put 20 new objects in these three homes in different orientations. Since the objects and the environments are completely new, this tests could measure the generalization of the model.

- The third one is Real Sawyer(Real-Sawyer). They evaluate the performance of their model by running the model on the Sawyer robot which is more accurate than the LCA. They tested their model in the lab environment to show that training models with the datasets collected from homes can improve the performance of models even in lab environments.

They used baselines for both their data which is collected in homes and their model which is Robust-Grasp. They used two datasets for the baseline. The dataset collected by (Lab-Baxter) and the dataset collected by their LCA in lab(Lab-LCA). They compared their model with the noise independent patch grasping model (Patch-Grasp). They also compared their data and model with DexNet-3.0 (DexNet) for a strong real-world grasping baseline.


Experiment 1: Performance on held-out data

Table 1 shows that the models trained on lab data cannot generalize to the Home-LCA environment. However, the model trained on Home-LCA have a good performance on both lab data and home environment.


Experiment 2: Performance on Real LCA Robot

In table 2, the performance of the Home-LCA is compared against a pre-trained DexNet and the model trained on the Lab-Baxter. Training on the Home-LCA dataset performs 43.7% better than training on the Lab-Baxter dataset and 33% better than DexNet. The low performance of DexNet can be described by the possible noise in the depth images that are caused by the natural light. So that’s why DexNet which requires high quality depth sensing cannot perform well. By using cheap commodity RGBD cameras in LCA, the noise in the depth images is not a matter of concern.


Performance on Real Sawyer

To compare the performance of the Robust-Grasp model against the Patch-Grasp model, they used Lab-Baxter which is an accurate robot. Sawyer robot is used for testing to ensure that the testing robot is different from both training robots. As shown in Table 3, the Robust-Grasp model trained on Home-LCA outperforms the Patch-Grasp model and achieves 77.5% accuracy. Furthermore, the visualizations of predicted noise corrections in Figure 4 shows that the corrections depend on both the pixel locations of the noisy grasp and the robot.


Conclusion

All in all, the paper presents an approach for collecting large scale robot data in real home environments. They implemented their approach by using a mobile manipulator which is a lot cheaper than the existing industrial robots. They collected a dataset of 28K grasps in six different homes. In order to solve the problem of noisy labels which were caused by their inaccurate robots, they presented a framework to factor out the noise in the data. They tested their model by physically grasping 20 new objects in three new homes and in the lab. The model trained with home dataset showed 43.7% improvement over the models trained with lab data. Their results also showed that their model can improve the grasping performance even in lab environments. They also demonstrated that their architecture for modelling the noise improved the performance by about 10%.