Fairness Without Demographics in Repeated Loss Minimization: Difference between revisions
Line 15: | Line 15: | ||
=Representation Disparity= | =Representation Disparity= | ||
If a user makes a query <math display="inline">Z \sim P</math>, the model <math display="inline">\theta \in \Theta</math> makes a prediction, and the user experiences loss <math display="inline">\ell (\theta; Z)</math>. The ''expected'' loss of of a model <math display="inline">\theta</math> is denoted as the risk <math display="inline">\mathcal{R}(\theta) = \mathbb{E}_{Z \sim P} [\ell (\theta; Z)] </math>. While the queries (i.e. | If a user makes a query <math display="inline">Z \sim P</math>, the model <math display="inline">\theta \in \Theta</math> makes a prediction, and the user experiences loss <math display="inline">\ell (\theta; Z)</math>. The ''expected'' loss of of a model <math display="inline">\theta</math> is denoted as the risk <math display="inline">\mathcal{R}(\theta) = \mathbb{E}_{Z \sim P} [\ell (\theta; Z)] </math>. While the queries (i.e. provide data to the model) are made by users from <math display="inline">K</math> latent groups such that <math display="inline">Z \sim P := \sum_{k \in [k]} \alpha_kP_k</math>, neither the actual population proportions <math display="inline">\{\alpha_k\}</math> nor the group distributions <math display="inline">\{P_k\}</math> are known. Therefore, Hashimoto et al.'s goal is to control the worst-case risk over ''all'' groups <math display="inline">K</math>: | ||
\begin{align} | \begin{align} |
Revision as of 10:42, 22 October 2018
This page contains the summary of the paper "Fairness Without Demographics in Repeated Loss Minimization" by Hashimoto, T. B., Srivastava, M., Namkoong, H., & Liang, P. which was published at the International Conference of Machine Learning (ICML) in 2018. In the following, an
Overview of the Paper
Introduction
Usually, machine learning models aim to minimize their average loss in order to achieve high overall accuracy. While this works well for the majority, minority groups suffer high error rates because they contribute less data to the model. This phenomenon is known as representation disparity and has been observed in many models that, for instance, recognize faces, or identify language. This disparity even increases as minority users suffer higher error rates, and therefore, are less likely to use the system in the future. This retention of users from minority groups from the system shrinks these groups further, and provide less data to the system. With less data the disparity becomes even worse - a phenomenon referred to as disparity amplification.
In this paper, Hashimoto et al. first show that empirical risk minimization (ERM) does not control the loss of minority groups, and thus causes representation disparity even if the model has been fair before. Second, the researchers try to mitigate representation disparity and its amplification by proposing the use of distributionally robust optimization (DRO). Indeed Hashimoto et al. are able to show that DRO can bound the loss for minority groups and is fair on models that ERM turns unfair.
Fairness
Hashimoto et al. follow the difference principle to achieve and measure fairness. It is defined as the maximization of the welfare of the worst-off group rather than the whole group (cf. utilitarianism).
Representation Disparity
If a user makes a query [math]\displaystyle{ Z \sim P }[/math], the model [math]\displaystyle{ \theta \in \Theta }[/math] makes a prediction, and the user experiences loss [math]\displaystyle{ \ell (\theta; Z) }[/math]. The expected loss of of a model [math]\displaystyle{ \theta }[/math] is denoted as the risk [math]\displaystyle{ \mathcal{R}(\theta) = \mathbb{E}_{Z \sim P} [\ell (\theta; Z)] }[/math]. While the queries (i.e. provide data to the model) are made by users from [math]\displaystyle{ K }[/math] latent groups such that [math]\displaystyle{ Z \sim P := \sum_{k \in [k]} \alpha_kP_k }[/math], neither the actual population proportions [math]\displaystyle{ \{\alpha_k\} }[/math] nor the group distributions [math]\displaystyle{ \{P_k\} }[/math] are known. Therefore, Hashimoto et al.'s goal is to control the worst-case risk over all groups [math]\displaystyle{ K }[/math]:
\begin{align} \mathcal{R}_{max}(\theta) := \underset{k \in [k]}{max} \mathcal{R}_k(\theta), \qquad \mathcal{R}_k(\theta) := \mathbb{E}_{P_k} [\ell (\theta; Z)] \end{align}
We say there is high representation disparity if the expected loss of the model [math]\displaystyle{ \mathcal{R}(\theta) }[/math] is low, but the worst-case risk [math]\displaystyle{ \mathcal{R}_{max}(\theta) }[/math] of one group is high.
Why Empirical Risk Minimization (ERM) does not work
Distributonally Robust Optimization (DRO)
At this point our goal is to minimize the worst-case group risk over a single time-step [math]\displaystyle{ \mathcal{R}_{max} (\theta^{(t)}) }[/math]. As previously mentioned, this is difficult to do because neither the population proportions [math]\displaystyle{ \{\alpha_k\} }[/math] nor group distributions [math]\displaystyle{ \{P_k\} }[/math] are known. Therefore, Hashimoto et al. developed an optimization technique that is robust "against all directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution [math]\displaystyle{ P_k }[/math] whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function [math]\displaystyle{ \mathcal{R}_{dro} }[/math] has to "up-weigh" data [math]\displaystyle{ Z }[/math] that cause high loss [math]\displaystyle{ \ell(\theta, Z) }[/math]. In other words, the risk function has to over-represent mixture components (i.e. group distributions [math]\displaystyle{ \{P_k\} }[/math]) in relation to their original mixture weights (i.e. the population proportions [math]\displaystyle{ \{\alpha_k\} }[/math]) for groups that suffer high loss.
To do this we need to consider the worst-case loss (i.e. the highest risk) over all perturbations [math]\displaystyle{ P_k }[/math] around [math]\displaystyle{ P }[/math] within a certain limit. This limit is described by the [math]\displaystyle{ \chi^2 }[/math]-divergence (i.e. the distance, roughly speaking) between probability distributions. For two distributions [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] the divergence is defined as [math]\displaystyle{ D_{\chi^2} (P || Q):= \int (\frac{dP}{dQ} - 1)^2 }[/math]. With the help of the [math]\displaystyle{ \chi^2 }[/math]-divergence, Hashimoto et al. define the chi-squared ball [math]\displaystyle{ \mathcal{B}(P,r) }[/math] around the probability distribution P. This ball is defined so that [math]\displaystyle{ \mathcal{B}(P,r) := \{Q \ll P : D_{\chi^2} (Q || P) \leq r \} }[/math]. It is exactly this ball that gives us the opportunity to consider the worst-case loss (i.e. the highest risk) over all perturbations [math]\displaystyle{ P_k }[/math] that lie inside the ball (i.e. within reasonable range) around the probability distribution [math]\displaystyle{ P }[/math]. This loss is given by
\begin{align} \mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)] \end{align}
which for [math]\displaystyle{ P:= \sum_{k \in [K]} \alpha_k P_k }[/math] for all models [math]\displaystyle{ \theta \in \Theta }[/math] where [math]\displaystyle{ r_k := (1/a_k -1)^2 }[/math] bounds the risk [math]\displaystyle{ \mathcal{R}_k(\theta) \leq \mathcal{R}_{dro} (\theta; r_k) }[/math] for each group with risk [math]\displaystyle{ \mathcal{R}_k(\theta) }[/math]. Furthermore, if we specify a lower bound on the group proportions [math]\displaystyle{ \alpha_{min} \leq min_{k \in [K]} \alpha_k }[/math], and define [math]\displaystyle{ r_{max} := (1/\alpha_{min} -1)^2 }[/math], the worst-case risk [math]\displaystyle{ \mathcal{R}_{max} (\theta) }[/math] can be controlled by [math]\displaystyle{ \mathcal{R}_{dro} (\theta; r_{max}) }[/math] by forming an upper bound that can be minimized.
Optimization of DRO
To minimize [math]\displaystyle{ \mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)] }[/math] Hashimoto et al. look at the dual of this maximization problem (i.e. every maximization problem can become transformed into a minimization problem and vice-versa). This dual is given by the minimization problem
\begin{align} \mathcal{R}_{dro}(\theta, r) = \underset{\eta \in \mathbb{R}}{inf} \left\{ F(\theta; \eta):= C\left(\mathbb{E}_P \left[ [\ell(\theta;Z) - \eta]_+^2 \right] \right)^\frac{1}{2} + \eta \right\} \end{align}
with [math]\displaystyle{ C = (2(1/a_{min} - 1)^2 + 1)^{1/2} }[/math]. [math]\displaystyle{ \eta }[/math] describes the dual variable (i.e. the variable that appears in creating the dual). Since [math]\displaystyle{ F(\theta; \eta) }[/math] involves an expectation [math]\displaystyle{ \mathbb{E}_P }[/math] over the data generating distribution [math]\displaystyle{ P }[/math], [math]\displaystyle{ F(\theta; \eta) }[/math] can be directly minimized. For convex losses [math]\displaystyle{ \ell(\theta;Z) }[/math], [math]\displaystyle{ F(\theta; \eta) }[/math] is convex, and can be minimized by performing a binary search over [math]\displaystyle{ \eta }[/math]. In their paper, Hashimoto et al. further show that optimizing [math]\displaystyle{ \mathcal{R}_{dro}(\theta, r_{max}) }[/math] at each time step controls the future worst-case risk [math]\displaystyle{ \mathcal{R}_{max} (\theta) }[/math], and therefore retention rates. That means if the initial group proportions satisfy [math]\displaystyle{ \alpha_k^{(0)} \geq a_{min} }[/math], and [math]\displaystyle{ \mathcal{R}_{dro}(\theta, r_{max}) }[/math] is optimized for every time step (and therefore [math]\displaystyle{ \mathcal{R}_{max} (\theta) }[/math] is minimized), [math]\displaystyle{ \mathcal{R}_{max}^T (\theta) }[/math] over all time steps is controlled. In other words, optimizing [math]\displaystyle{ \mathcal{R}_{dro}(\theta, r_{max}) }[/math] every time step is enough to avoid disparity amplification.