Robust Probabilistic Modeling with Bayesian Data Reweighting

From statwiki
Revision as of 02:08, 15 November 2018 by Cmstranc (talk | contribs)
Jump to navigation Jump to search

This is a summary based on the paper, Robust Probabilistic Modeling with Bayesian Data Reweighting by Wang, Kucukelbir, and Blei.

Robust Probabilistic Modeling through Bayesian Data Reweighting is an attempt to make probabilistic models that remain robust when given data points that do not follow the same distribution as the rest of the observations. That is, RPM aims to create models that accurately detect which observations are anomalies and gives them less weight when fitting a distribution to the data.

Presented By

  • Qingxi Huo
  • Jiaqi Wang
  • Colin Stranc
  • Aditya Maheshwari
  • Yanmin Yang
  • Yuanjing Cai
  • Philomene Bobichon
  • Zepeng An

Introduction

Probabilistic Modeling is when you attempt to find a probability distribution under which your data set is as likely as possible to have occurred. This works well when your data naturally follows a manageable probability distribution. However typical probabilistic modeling methods will be thrown off quite quickly if there are a few observations which are total anomalies, for whatever bizarre reason, including data collection or entry error.

Bayesian Data Reweighting (BDR) is an attempt to make probabilistic models that gracefully handle these anomalies which come from separate and unrelated distributions. BDR raises each observation to a latent weight variable which indicates to the model which observations are likely to have come from a separate distribution, and so which observations it can ignore.

Motivation

Imagine a Netflix account belonging to a young child. She has watched many animated kids movies. Netflix accurately recommends other animated kids movies for her. One day her parents forget to switch to their Netflix account and watch a horror movie.

Recommendation models, like Poisson Factorization struggle with this kind of corrupted data: it begins to recommend horror movies.

Graphically, the blue diamonds represent kids movies and the green circles are horror movies. The kids movies lay close to each other along some axis. If they were the only observations, the original model would have no troubles identifying a satisfactory distribution. The addition of the horror movies however, pulls the original model so it is centered at [math]\displaystyle{ \approx0.6 }[/math] instead of [math]\displaystyle{ 0 }[/math].

The reweighted model does not have this problem. It chooses to ignore the horror movies and accurately portrays the underlying distribution of the kids movies.

Likelihood Function

The goal of this probabilistic model is to maximize the likelihood function, given an assumed distribution family for the data and a prior on both the latent parameters and the weights.

[math]\displaystyle{ p \left( \beta, w | y \right) = \frac{1}{Z} p_{w}\left( w \right) p_{\beta}\left( \beta \right) \prod_{n=1}^{N} \left[ l \left( y_n | \beta \right)^{w_n} \right] }[/math]

We assume the data follows a distribution with parameters [math]\displaystyle{ \beta }[/math]. Then we choose a prior for those [math]\displaystyle{ \beta }[/math]'s. Finally we choose a prior for the weights as well. The prior on the weights must prioritize [math]\displaystyle{ w_i \approx 1 }[/math]. Choices include [math]\displaystyle{ n }[/math] Beta distributions, a scaled Dirichlet distribution or [math]\displaystyle{ n }[/math] Gamma distributions.

Inference and Computation

The likelihood function does not have a closed form solution in all but the simplest of cases. Optimal values for [math]\displaystyle{ \beta }[/math] and [math]\displaystyle{ w }[/math] are therefore solved for using various optimization methods. The paper suggests using automated inference in the probabilistic programming system, STAN.

Example: Ignoring Outliers

Our observations are a routers wait times for packets. These wait times follow a [math]\displaystyle{ POIS\left( 5 \right) }[/math] distribution. The network can fail, which is modelled by wait times following a [math]\displaystyle{ POIS\left( 50 \right) }[/math] distribution instead.

A Gamma prior is chosen for the rate. The network is set to fail [math]\displaystyle{ 25 }[/math]% of the time.

Note that the reweighted models accurately detected the rate whereas the other models did not. Notice also that the reweighted models had a much smaller spread than most other models.

This shows that BDR can handle data from an unrelated distribution.

Example: Handling Missing Latent Groups

We are attempting to predict the number of people who are colour blind, but we do not know whether the individual is male or female. Men are inherently more likely to be colour blind than females, so without gender information a standard logistic regression would misrepresent both groups. Bayesian Reweighting identifies the different distributions and only reports on the dominant group.

For the example we simulate data points from different Bernoulli distributions for men and women. We examine the results with varying degrees of the population being female.

Note here that the RPM model always contains the true mean in its [math]\displaystyle{ 95 }[/math]% credible interval. The localized model also contains the mean, however it contains much less certainty in it's predictions.

We can see that the RPM model successfully ignored the observations which came from the minority distribution, without any information on its existence.

Example: Lung Cancer Risk Study

We have three models of lung cancer risk dependency on tobacco usage and obesity and we distinguish the true model and the assumed model by some form of covariance misspecification in each model.

Note that RPM yields better estimates of [math]\displaystyle{ \beta_1 }[/math] in the first two models but gives a similar result to the original model in the third model where the obesity is ignored in the misspecified model.

This shows that RPM leverages datapoints are useful for estimating [math]\displaystyle{ \beta_1 }[/math] and RPMs can only use available information.

Example: Real Data (MovieLens 1M)

To test the model on real data, we use the MovieLens data set. It contains [math]\displaystyle{ 6000 }[/math] users ratings of a total of [math]\displaystyle{ 4000 }[/math] movies. We train a RPM model on the clean data, then add varying degrees of random corruption and see how an RPM model handles that data.

Notice the clean data provides almost entirely weights near [math]\displaystyle{ 1 }[/math]. Once corrupt data is created we see view only the weights for those corrupt data points. Notice that they tend to get lower and lower the more corrupted their ratings are.