# Streaming Bayesian Inference for Crowdsourced Classification

Group 4 Paper Presentation Summary

By Jonathan Chow, Nyle Dharani, Ildar Nasirov

## Motivation

Crowdsourcing can be a useful tool for data generation in classification projects. Often this takes the form of online questions which many respondents will manually answer for payment. One example of this is Amazon’s Mechanical Turk.

The primary limitation with this form of acquiring data is that respondents are liable to submit incorrect responses.

The integrity of the data is then limited by how well ground-truth can be determined. The primary method for doing so is probabilistic inference. However, current methods are computationally expensive, lack theoretical guarantees, or are limited to specific settings.

## Dawid-Skene Model for Crowdsourcing

The one-coin Dawid-Skene model is popular for contextualizing crowdsourcing problems. For task $i$ in set $M$, let the ground-truth be the binary $y_i = {\pm 1}$. We get labels $X = {x_{ij}}$ where $j \in N$ is the index for that worker.

At each time step $t$, a worker $j = a(t)$ provides their label for an assigned task $i$. We denote responses up to time $t$ via superscript.

We let $x_{ij} = 0$ if worker $j$ has not completed task $i$. We assume that $P(x_{ij} = y_i) = p_j$. This implies that each worker is independent and has equal probability of correct labelling regardless of task. In crowdsourcing the data, we must determine how workers are assigned to tasks. We introduce two methods.

Under uniform sampling, workers are allocated to tasks such that each task is completed by the same number of workers, rounded to the nearest integer, and no worker completes a task more than once. This policy is given by
$\pi_{uni}(t) = argmin_{i \notin M_{a(t)}^t}\{ | N_i^t | \}$
Under uncertainty sampling, we assign more workers to tasks that are less certain. Assuming, we are able to estimate the posterior probability of ground-truth, we can allocate workers to the task with the lowest probability of falling into the predicted class. This policy is given by
$\pi_{us}(t) = argmin_{i \notin M_{a(t)}^t}\{ (max_{k \in \{\pm 1\}} ( P(y_i = k | X^t) ) \}$

We then need to aggregate the data. The simple method of majority voting makes predictions for a given task based on the class the most workers have assigned it, $\hat{y}_i = sign\{\sum_{j \in N_i} x_{ij}\}$.

## Streaming Bayesian Inference for Crowdsourced Classification

The aim of the SBIC algorithm is to estimate the posterior probability, $P(y, p | X^t, \theta)$ where $X^t$ are the observed responses at time $t$ and $\theta$ is our prior. We can then generate predictions $\hat{y}^t$ as the marginal probability over each $y_i$ given $X^t$, and $\theta$.

We factor $P(y, p | X^t, \theta)$ as $\prod_{I \in M} \mu_i^t (y_i) \prod_{j \in N} \nu_j^t (p_j)$ where $\mu_i^t$ corresponds to each task and $\nu_j^t$ to each worker.

We then sequentially optimize the factors $\mu^t$ and $\nu^t$. We begin by assuming that the worker accuracy follows a beta distribution with parameters $\alpha$ and $\beta$. Initialize the task factors $\mu_i^0(+1) = q$ and $\mu_i^0(-1) = 1 – q$ for all $i$.

When a new label is observed at time $t$, we update the $\nu_j^t$ of worker $j$. We then update $\mu_i$. These updates are given by

$\nu_j^t(p_j) \sim Beta(\sum_{i \in M_j^{t - 1}} \mu_i^{t - 1}(x_{ij}) + \alpha, \sum_{i \in M_j^{t - 1}} \mu_i^{t - 1}(-x_{ij}) + \beta)$
$\mu_i^t(y_i) \propto \begin{cases} \mu_i^{t - 1}(y_i)\overline{p}_j^t & x_{ij} = y_i \\ \mu_i^{t - 1}(y_i)(1 - \overline{p}_j^t) & x_{ij} \ne y_i \end{cases}$

where $\overline{p}_j^t = \frac{\sum_{i \in M_j^{t - 1}} \mu_i^{t - 1}(x_ij) + \alpha}{|M_j^{t - 1}| + \alpha + \beta }$.

We choose our predictions to be the maximum $\mu_i^t(k)$ for $k=-1,1$.

Depending on our ordering of labels $X$, we can select for different applications.

## Fast SBIC

The pseudocode for Fast SBIC is shown below.

As the name implies, the goal with this algorithm is speed. To facilitate this, we leave the order of $X$ unchanged.

We express $\mu_i^t$ in terms of its log-odds

$z_i^t = log(\frac{\mu_i^t(+1)}{ \mu_i^t(-1)}) = z_i^{t - 1} + x_{ij} log(\frac{\overline{p}_j^t}{1 - \overline{p}_j^t })$

where $x_i^0 = log(\frac{q}{1 - q})$.

The product chain then becomes a summation and removes the need to normalize each $\mu_i^t$. We use these log-odds to compute worker accuracy.

$\overline{p}_j^t = \frac{\sum_{i \in M_j^{t - 1}} sig(x_{ij} z_i^{t-1}) + \alpha}{|M_j^{t - 1}| + \alpha + \beta}$

where $sig(z_i^{t-1}) := \frac{1}{1 + exp(-z_i^{t - 1})} = \mu_i^{t - 1}(+1)$

The final predictions are made by choosing class $\hat{y}_i^T = sign(z_i^T)$. We see later that Fast SBIC has similar computational speed to majority voting.

## Sorted SBIC

To increase the accuracy of the SBIC algorithm in exchange for computational efficiency, we run the algorithm in parallel giving labels in different orders. The pseudocode for this algorithm is given below.

From the general discussion of SBIC, we know that predictions on task $i$ are more accurate toward the end of the collection process. This is a result of observing more data points and having run more updates on $\mu_i^t$ and $\nu_j^t$ to move them further from their prior. This means that task $i$ is predicted more accurately when its corresponding labels are seen closer to the end of the process.

We take advantage of this property by maintaining a distinct “view” of the log-odds for each task. When a label is observed, we update views for all tasks except the one for which the label was observed. At the end of the collection process, we process skipped labels. When run online, this process must be repeated at every timestep.

We see that sorted SBIC is slower than Fast SBIC by a factor of M, the number of tasks.

## Theoretical Analysis

The authors prove an exponential relationship between the error probability and number of labels per task. The two theorems, for the different sampling regimes, are presented below.

## Empirical Analysis

The SBIC algorithm is run on five real-world binary classification datasets. The results can be found in the table below. Other algorithms in the comparison are, from left to right, majority voting, expectation-maximization, mean-field, belief-propagation, montecarlo sampling, and triangular estimation.

In bold are the best performing algorithms for each dataset. We see that both versions of the SBIC algorithm are competitive, having similar prediction error to EM, AMF, and MC. All are considered state-of-the-art Bayesian algorithms.

The figure below shows the average time required to simulate predictions on synthetic data under an uncertainty sampling policy. We see that Fast SBIC is comparable to majority voting and significantly faster than the other algorithms. This speed improvement, coupled with comparable accuracy, makes the Fast SBIC algorithm powerful.

## Conclusion and Future Research

In conclusion, we have seen that SBIC is computationally efficient, accurate in practice, and has theoretical guarantees. In future, the authors intend to extend the algorithm to the multi-class case.

## Critique

In crowdsourcing data, the cost associated with collecting additional labels is not usually prohibitively expensive. As a result, if there is concern over ground-truth, paying for additional data to ensure $X$ is sufficiently dense may be the desired response as opposed to sacrificing ground-truth accuracy. This may result in the SBIC algorithm being less practically useful than intended.

## References

[1] Manino, Tran-Thanh, and Jennings. Streaming Bayesian Inference for Crowdsourced Classification. 33rd Conference on Neural Information Processing Systems, 2019