Fairness Without Demographics in Repeated Loss Minimization: Difference between revisions

From statwiki
Jump to navigation Jump to search
No edit summary
Line 12: Line 12:


=Distributonally Robust Optimization (DRO)=
=Distributonally Robust Optimization (DRO)=
==Risk Bounding Over Unknown Groups==


At this point our goal is to minimize the worst-case group risk over a single time-step <math display="inline">\mathcal{R}_{max} (\theta^{(t)}) </math>. As previously mentioned, this is difficult to do because neither the population proportions <math display="inline">\{\alpha_k\} </math> nor group distributions <math display="inline">\{P_k\} </math> are known. Therefore, Hashimoto et al. developed an optimization technique that is robust "against '''''all''''' directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution <math display="inline">P_k </math> whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function <math display="inline">\mathcal{R}_{dro} </math> has to "up-weigh" data <math display="inline">Z</math> that cause high loss <math display="inline">\ell(\theta, Z)</math>. In other words, the risk function has to over-represent mixture components (i.e. group distributions <math display="inline">\{P_k\} </math>) in relation to their original mixture weights (i.e. the population proportions <math display="inline">\{\alpha_k\} </math>) for groups that suffer high loss.  
At this point our goal is to minimize the worst-case group risk over a single time-step <math display="inline">\mathcal{R}_{max} (\theta^{(t)}) </math>. As previously mentioned, this is difficult to do because neither the population proportions <math display="inline">\{\alpha_k\} </math> nor group distributions <math display="inline">\{P_k\} </math> are known. Therefore, Hashimoto et al. developed an optimization technique that is robust "against '''''all''''' directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution <math display="inline">P_k </math> whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function <math display="inline">\mathcal{R}_{dro} </math> has to "up-weigh" data <math display="inline">Z</math> that cause high loss <math display="inline">\ell(\theta, Z)</math>. In other words, the risk function has to over-represent mixture components (i.e. group distributions <math display="inline">\{P_k\} </math>) in relation to their original mixture weights (i.e. the population proportions <math display="inline">\{\alpha_k\} </math>) for groups that suffer high loss.  
Line 24: Line 22:


which for <math display="inline">P:= \sum_{k \in [K]} \alpha_k P_k</math> for all models <math display="inline">\theta \in \Theta</math>  where <math display="inline">r_k := (1/a_k -1)^2</math> bounds the risk <math display="inline">\mathcal{R}_k(\theta) \leq \mathcal{R}_{dro} (\theta; r_k)</math> for each group with risk <math display="inline">\mathcal{R}_k(\theta)</math>. Furthermore, if we specify a lower bound on the group proportions <math display="inline">\alpha_{min} \leq min_{k \in [K]} \alpha_k</math>, and define <math display="inline">r_{max} := (1/\alpha_{min} -1)^2</math>, the worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math> can be controlled by <math display="inline">\mathcal{R}_{dro} (\theta; r_{max}) </math> by forming an upper bound that can be minimized.
which for <math display="inline">P:= \sum_{k \in [K]} \alpha_k P_k</math> for all models <math display="inline">\theta \in \Theta</math>  where <math display="inline">r_k := (1/a_k -1)^2</math> bounds the risk <math display="inline">\mathcal{R}_k(\theta) \leq \mathcal{R}_{dro} (\theta; r_k)</math> for each group with risk <math display="inline">\mathcal{R}_k(\theta)</math>. Furthermore, if we specify a lower bound on the group proportions <math display="inline">\alpha_{min} \leq min_{k \in [K]} \alpha_k</math>, and define <math display="inline">r_{max} := (1/\alpha_{min} -1)^2</math>, the worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math> can be controlled by <math display="inline">\mathcal{R}_{dro} (\theta; r_{max}) </math> by forming an upper bound that can be minimized.
==Optimization of DRO==
To minimize <math display="inline">\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]</math> Hashimoto et al. look at the dual of this maximization problem (i.e. every maximization problem can become transformed into a minimization problem and vice-versa). This dual is given by the minimization problem
\begin{align}
\mathcal{R}_{dro}(\theta, r) = \underset{\eta \in \mathbb{R}}{inf} \left\{ F(\theta; \eta):= C\left(\mathbb{E}_P \left[ [\ell(\theta;Z) - \eta]_+^2 \right] \right)^\frac{1}{2} + \eta \right\}
\end{align}
with <math display="inline">C = (2(1/a_{min} - 1)^2 + 1)^{1/2}</math>. <math display="inline">\eta</math> describes the dual variable (i.e. the variable that appears in creating the dual). Since <math display="inline">F(\theta; \eta)</math> involves an expectation <math display="inline">\mathbb{E}_P</math> over the data generating distribution <math display="inline">P</math>, <math display="inline">F(\theta; \eta)</math> can be directly minimized. For convex losses <math display="inline">\ell(\theta;Z)</math>, <math display="inline">F(\theta; \eta)</math> is convex, and can be minimized by performing a binary search over <math display="inline">\eta</math>. In their paper, Hashimoto et al. further show that optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> at each time step controls the ''future'' worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math>, and therefore retention rates. That means if the initial group proportions satisfy <math display="inline">\alpha_k^{(0)} \geq a_{min}</math>, and <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> is optimized for every time step (and therefore <math display="inline">\mathcal{R}_{max} (\theta) </math> is minimized), <math display="inline">\mathcal{R}_{max}^T (\theta) </math> over all time steps is controlled. In other words, optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> every time step is enough to avoid disparity amplification.

Revision as of 11:23, 20 October 2018

This page contains the summary of the paper "Fairness Without Demographics in Repeated Loss Minimization" by Hashimoto, T. B., Srivastava, M., Namkoong, H., & Liang, P. which was published at the International Conference of Machine Learning (ICML) in 2018. In the following, an

Overview of the Paper

Introduction

Fairness

Example and Problem Setup

Why Empirical Risk Minimization (ERM) does not work

Distributonally Robust Optimization (DRO)

At this point our goal is to minimize the worst-case group risk over a single time-step [math]\displaystyle{ \mathcal{R}_{max} (\theta^{(t)}) }[/math]. As previously mentioned, this is difficult to do because neither the population proportions [math]\displaystyle{ \{\alpha_k\} }[/math] nor group distributions [math]\displaystyle{ \{P_k\} }[/math] are known. Therefore, Hashimoto et al. developed an optimization technique that is robust "against all directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution [math]\displaystyle{ P_k }[/math] whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function [math]\displaystyle{ \mathcal{R}_{dro} }[/math] has to "up-weigh" data [math]\displaystyle{ Z }[/math] that cause high loss [math]\displaystyle{ \ell(\theta, Z) }[/math]. In other words, the risk function has to over-represent mixture components (i.e. group distributions [math]\displaystyle{ \{P_k\} }[/math]) in relation to their original mixture weights (i.e. the population proportions [math]\displaystyle{ \{\alpha_k\} }[/math]) for groups that suffer high loss.

To do this we need to consider the worst-case loss (i.e. the highest risk) over all perturbations [math]\displaystyle{ P_k }[/math] around [math]\displaystyle{ P }[/math] within a certain limit. This limit is described by the [math]\displaystyle{ \chi^2 }[/math]-divergence (i.e. the distance, roughly speaking) between probability distributions. For two distributions [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] the divergence is defined as [math]\displaystyle{ D_{\chi^2} (P || Q):= \int (\frac{dP}{dQ} - 1)^2 }[/math]. With the help of the [math]\displaystyle{ \chi^2 }[/math]-divergence, Hashimoto et al. define the chi-squared ball [math]\displaystyle{ \mathcal{B}(P,r) }[/math] around the probability distribution P. This ball is defined so that [math]\displaystyle{ \mathcal{B}(P,r) := \{Q \ll P : D_{\chi^2} (Q || P) \leq r \} }[/math]. It is exactly this ball that gives us the opportunity to consider the worst-case loss (i.e. the highest risk) over all perturbations [math]\displaystyle{ P_k }[/math] that lie inside the ball (i.e. within reasonable range) around the probability distribution [math]\displaystyle{ P }[/math]. This loss is given by

\begin{align} \mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)] \end{align}

which for [math]\displaystyle{ P:= \sum_{k \in [K]} \alpha_k P_k }[/math] for all models [math]\displaystyle{ \theta \in \Theta }[/math] where [math]\displaystyle{ r_k := (1/a_k -1)^2 }[/math] bounds the risk [math]\displaystyle{ \mathcal{R}_k(\theta) \leq \mathcal{R}_{dro} (\theta; r_k) }[/math] for each group with risk [math]\displaystyle{ \mathcal{R}_k(\theta) }[/math]. Furthermore, if we specify a lower bound on the group proportions [math]\displaystyle{ \alpha_{min} \leq min_{k \in [K]} \alpha_k }[/math], and define [math]\displaystyle{ r_{max} := (1/\alpha_{min} -1)^2 }[/math], the worst-case risk [math]\displaystyle{ \mathcal{R}_{max} (\theta) }[/math] can be controlled by [math]\displaystyle{ \mathcal{R}_{dro} (\theta; r_{max}) }[/math] by forming an upper bound that can be minimized.

Optimization of DRO

To minimize [math]\displaystyle{ \mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)] }[/math] Hashimoto et al. look at the dual of this maximization problem (i.e. every maximization problem can become transformed into a minimization problem and vice-versa). This dual is given by the minimization problem

\begin{align} \mathcal{R}_{dro}(\theta, r) = \underset{\eta \in \mathbb{R}}{inf} \left\{ F(\theta; \eta):= C\left(\mathbb{E}_P \left[ [\ell(\theta;Z) - \eta]_+^2 \right] \right)^\frac{1}{2} + \eta \right\} \end{align}

with [math]\displaystyle{ C = (2(1/a_{min} - 1)^2 + 1)^{1/2} }[/math]. [math]\displaystyle{ \eta }[/math] describes the dual variable (i.e. the variable that appears in creating the dual). Since [math]\displaystyle{ F(\theta; \eta) }[/math] involves an expectation [math]\displaystyle{ \mathbb{E}_P }[/math] over the data generating distribution [math]\displaystyle{ P }[/math], [math]\displaystyle{ F(\theta; \eta) }[/math] can be directly minimized. For convex losses [math]\displaystyle{ \ell(\theta;Z) }[/math], [math]\displaystyle{ F(\theta; \eta) }[/math] is convex, and can be minimized by performing a binary search over [math]\displaystyle{ \eta }[/math]. In their paper, Hashimoto et al. further show that optimizing [math]\displaystyle{ \mathcal{R}_{dro}(\theta, r_{max}) }[/math] at each time step controls the future worst-case risk [math]\displaystyle{ \mathcal{R}_{max} (\theta) }[/math], and therefore retention rates. That means if the initial group proportions satisfy [math]\displaystyle{ \alpha_k^{(0)} \geq a_{min} }[/math], and [math]\displaystyle{ \mathcal{R}_{dro}(\theta, r_{max}) }[/math] is optimized for every time step (and therefore [math]\displaystyle{ \mathcal{R}_{max} (\theta) }[/math] is minimized), [math]\displaystyle{ \mathcal{R}_{max}^T (\theta) }[/math] over all time steps is controlled. In other words, optimizing [math]\displaystyle{ \mathcal{R}_{dro}(\theta, r_{max}) }[/math] every time step is enough to avoid disparity amplification.