Fairness Without Demographics in Repeated Loss Minimization
This page contains the summary of the paper "Fairness Without Demographics in Repeated Loss Minimization" by Hashimoto, T. B., Srivastava, M., Namkoong, H., & Liang, P. which was published at the International Conference of Machine Learning (ICML) in 2018. In the following, an
Overview of the Paper
Introduction
Fairness
Example and Problem Setup
Why Empirical Risk Minimization (ERM) does not work
Distributonally Robust Optimization (DRO)
Risk Bounding Over Unknown Groups
At this point our goal is to minimize the worst-case group risk over a single time-step [math]\displaystyle{ \mathcal{R}_{max} (\theta^{(t)}) }[/math]. As previously mentioned, this is difficult to do because neither the population proportions [math]\displaystyle{ \{\alpha_k\} }[/math] nor group distributions [math]\displaystyle{ \{P_k\} }[/math] are known. Therefore, Hashimoto et al. developed an optimization technique that is robust "against all directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution [math]\displaystyle{ P_k }[/math] whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function [math]\displaystyle{ \mathcal{R}_{dro} }[/math] has to "up-weigh" data [math]\displaystyle{ Z }[/math] that cause high loss [math]\displaystyle{ \ell(\theta, Z) }[/math]. In other words, the risk function has to over-represent mixture components (i.e. group distributions [math]\displaystyle{ \{P_k\} }[/math]) in relation to their original mixture weights (i.e. the population proportions [math]\displaystyle{ \{\alpha_k\} }[/math]) for groups that suffer high loss.
To do this we need to consider the worst-case loss (i.e. the highest risk) over all perturbations [math]\displaystyle{ P_k }[/math] around [math]\displaystyle{ P }[/math] within a certain limit. This limit is described by the [math]\displaystyle{ \chi^2 }[/math]-divergence (i.e. the distance, roughly speaking) between probability distributions. For two distributions [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] the divergence is defined as [math]\displaystyle{ D_{\chi^2} (P || Q):= \int (\frac{dP}{dQ} - 1)^2 }[/math]. With the help of the [math]\displaystyle{ \chi^2 }[/math]-divergence, Hashimoto et al. define the chi-squared ball [math]\displaystyle{ \mathcal{B}(P,r) }[/math] around the probability distribution P. This ball is defined so that [math]\displaystyle{ \mathcal{B}(P,r) := \{Q \ll P : D_{\chi^2} (Q || P) \leq r \} }[/math]. It is exactly this ball that gives us the opportunity to consider the worst-case loss (i.e. the highest risk) over all perturbations [math]\displaystyle{ P_k }[/math] that lie inside the ball (i.e. within reasonable range) around the probability distribution [math]\displaystyle{ P }[/math]. This loss is given by
\begin{align} \mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)] \end{align}
which for [math]\displaystyle{ P:= \sum_{k \in [K]} \alpha_k P_k }[/math] for all models [math]\displaystyle{ \theta \in \Theta }[/math] where [math]\displaystyle{ r_k := (1/a_k -1)^2 }[/math] bounds the risk [math]\displaystyle{ \mathcal{R}_k(\theta) \leq \mathcal{R}_{dro} (\theta; r_k) }[/math] for each group with risk [math]\displaystyle{ \mathcal{R}_k(\theta) }[/math]. Furthermore, if we specify a lower bound on the group proportions [math]\displaystyle{ \alpha_{min} \leq min_{k \in [K]} \alpha_k }[/math], and define [math]\displaystyle{ r_{max} := (1/\alpha_{min} -1)^2 }[/math], the worst-case risk [math]\displaystyle{ \mathcal{R}_{max} (\theta) }[/math] can be controlled by [math]\displaystyle{ \mathcal{R}_{dro} (\theta; r_{max}) }[/math] by forming an upper bound that can be minimized.