# Fairness Without Demographics in Repeated Loss Minimization

This page contains the summary of the paper "Fairness Without Demographics in Repeated Loss Minimization" by Hashimoto, T. B., Srivastava, M., Namkoong, H., & Liang, P. which was published at the International Conference of Machine Learning (ICML) in 2018. In the following, an

## Contents

# Overview of the Paper

# Introduction

## Fairness

# Example and Problem Setup

# Why Empirical Risk Minimization (ERM) does not work

# Distributonally Robust Optimization (DRO)

## Risk Bounding Over Unknown Groups

At this point our goal is to minimize the worst-case group risk over a single time-step [math]\mathcal{R}_{max} (\theta^{(t)}) [/math]. As previously mentioned, this is difficult to do because neither the population proportions [math]\{\alpha_k\} [/math] nor group distributions [math]\{P_k\} [/math] are known. Therefore, Hashimoto et al. developed an optimization technique that is robust "against * all* directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution [math]P_k [/math] whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function [math]\mathcal{R}_{dro} [/math] has to "up-weigh" data [math]Z[/math] that cause high loss [math]\ell(\theta, Z)[/math]. In other words, the risk function has to over-represent mixture components (i.e. group distributions [math]\{P_k\} [/math]) in relation to their original mixture weights (i.e. the population proportions [math]\{\alpha_k\} [/math]) for groups that suffer high loss.

To do this we need to consider the worst-case loss (i.e. the highest risk) over all perturbations [math]P_k [/math] around [math]P[/math] within a certain limit. This limit is described by the [math]\chi^2[/math]-divergence (i.e. the distance, roughly speaking) between probability distributions. For two distributions [math]P[/math] and [math]Q[/math] the divergence is defined as [math]D_{\chi^2} (P || Q):= \int (\frac{dP}{dQ} - 1)^2[/math]. With the help of the [math]\chi^2[/math]-divergence, Hashimoto et al. define the chi-squared ball [math]\mathcal{B}(P,r)[/math] around the probability distribution P. This ball is defined so that [math]\mathcal{B}(P,r) := \{Q \ll P : D_{\chi^2} (Q || P) \leq r \}[/math]. It is exactly this ball that gives us the opportunity to consider the worst-case loss (i.e. the highest risk) over all perturbations [math]P_k [/math] that lie inside the ball (i.e. within reasonable range) around the probability distribution [math]P[/math]. This loss is given by

\begin{align} \mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)] \end{align}

which for [math]P:= \sum_{k \in [K]} \alpha_k P_k[/math] for all models [math]\theta \in \Theta[/math] where [math]r_k := (1/a_k -1)^2[/math] bounds the risk [math]\mathcal{R}_k(\theta) \leq \mathcal{R}_{dro} (\theta; r_k)[/math] for each group with risk [math]\mathcal{R}_k(\theta)[/math]. Furthermore, if we specify a lower bound on the group proportions [math]\alpha_{min} \leq min_{k \in [K]} \alpha_k[/math], and define [math]r_{max} := (1/\alpha_{min} -1)^2[/math], the worst-case risk [math]\mathcal{R}_{max} (\theta) [/math] can be controlled by [math]\mathcal{R}_{dro} (\theta; r_{max}) [/math] by forming an upper bound that can be minimized.