Task Understanding from Confusing Multi-task Data: Difference between revisions

From statwiki
Jump to navigation Jump to search
Line 8: Line 8:


= Confusing Supervised Learning =
= Confusing Supervised Learning =
== Description of the Problem ==


Confusing supervised learning (CSL) offers a solution to the issue at hand. A major area of improvement can be seen in the choice of risk measure. In traditional supervised learning, assuming the risk measure is mean squared error (MSE), the expected risk functional is
Confusing supervised learning (CSL) offers a solution to the issue at hand. A major area of improvement can be seen in the choice of risk measure. In traditional supervised learning, assuming the risk measure is mean squared error (MSE), the expected risk functional is
Line 21: Line 23:
$$ R(g) = \int_x \sum_{j=1}^n (f_j(x) - g(x))^2 p(f_j) p(x) \; \mathrm{d}x $$
$$ R(g) = \int_x \sum_{j=1}^n (f_j(x) - g(x))^2 p(f_j) p(x) \; \mathrm{d}x $$


We call <math> (f_j(x) - g(x))^2 p(f_j) </math> the '''confusing multiple mappings'''. Then the optimal solution <math>g^*(x)</math> to the mapping is <math>\bar{f}(x) = \sum_{j=1}^n p(f_j) f_j(x)</math> under this risk functional. However, the optimal solution is not conditional on the specific task at hand but rather on the entire ground-truth functions. Therefore, for every non-trivial set of tasks where <math>f_k(x) \neq f_\ell(x)</math> for some input <math>x</math>, <math>R(g^*) > 0</math> which implies that there is an unavoidable confusion risk.
We call <math> (f_j(x) - g(x))^2 p(f_j) </math> the '''confusing multiple mappings'''. Then the optimal solution <math>g^*(x)</math> to the mapping is <math>\bar{f}(x) = \sum_{j=1}^n p(f_j) f_j(x)</math> under this risk functional. However, the optimal solution is not conditional on the specific task at hand but rather on the entire ground-truth functions. Therefore, for every non-trivial set of tasks where <math>f_u(x) \neq f_v(x)</math> for some input <math>x</math> and <math>u \neq v</math>, <math>R(g^*) > 0</math> which implies that there is an unavoidable confusion risk.
 
== Learning Functions of CSL ==


To overcome this issue, the authors introduce two types of learning functions:
To overcome this issue, the authors introduce two types of learning functions:
* '''Deconfusing function''' &mdash; allocation of which samples come from the same task
* '''Deconfusing function''' &mdash; allocation of which samples come from the same task
* '''Mapping function''' &mdash; mapping relation from input to output of every learned task
* '''Mapping function''' &mdash; mapping relation from input to output of every learned task
Suppose there are <math>n</math> ground-truth mappings <math>\{f_j : 1 \leq j \leq n\}</math> that we wish to approximate with a set of learning functions <math>\{g_k : 1 \leq k \leq l\}</math>. The authors define the deconfusing function as an indicator function <math>h(x, y, g_k) </math> which takes some sample <math>(x,y)</math> and determines whether the sample is assigned to task <math>g_k</math>. Under the CSL framework, the risk functional (mean squared loss) is
$$ R(g,h) = \int_x \sum_{j,k} (f_j(x) - g_k(x))^2 \; h(x, f_j(x), g_k) \;p(f_j) \; p(x) \;\mathrm{d}x $$
which can be estimated empirically with
$$R_e(g,h) = \sum_{i=1}^m \sum_{k=1}^n |y_i - g_k(x_i)|^2 \cdot h(x_i, y_i, g_k) $$
== Theoretical Results ==
This novel framework yields some theoretical results to show the viability of its construction.
'''Theorem 1 (Existence of Solution)'''
''With the confusing supervised learning framework, there is an optimal solution''
$$h^*(x, f_j(x), g_k) = \mathbb{I}[j=k]$$
$$g_k^*(x) = f_k(x)$$
''for each <math>k=1,..., n</math> that makes the expected risk function of the CSL problem zero.''
'''Theorem 2 (Error Bound of CSL)'''
''With probability at least <math>1 - \eta</math> simultaneously with finite VC dimension <math>\tau</math> of CSL learning framework, the risk measure is bounded by
$$R(\alpha) \leq R_e(\alpha) + \frac{B\epsilon(m)}{2} \left(1 + \sqrt{1 + \frac{4R_e(\alpha)}{B\epsilon(m)}}\right)$$
''where <math>\alpha</math> is the total parameters of learning functions <math>g, h</math> and <math>B</math>  is the upper bound of one sample's risk and''
$$\epsilon(m) = 4 \; \frac{\tau (\ln \frac{2m}{\tau} + 1) - \ln \eta / 4}{m}$$


= CSL-Net =
= CSL-Net =

Revision as of 20:55, 10 November 2020

Presented By

Qianlin Song, William Loh, Junyue Bai, Phoebe Choi

Introduction

Related Work

Confusing Supervised Learning

Description of the Problem

Confusing supervised learning (CSL) offers a solution to the issue at hand. A major area of improvement can be seen in the choice of risk measure. In traditional supervised learning, assuming the risk measure is mean squared error (MSE), the expected risk functional is

$$ R(g) = \int_x (f(x) - g(x))^2 p(x) \; \mathrm{d}x $$

where [math]\displaystyle{ p(x) }[/math] is the prior distribution of the input variable [math]\displaystyle{ x }[/math]. In practice, model optimizations are performed using the empirical risk

$$ R_e(g) = \sum_{i=1}^n (y_i - g(x_i))^2 $$

When the problem involves different tasks, the model should optimize for each data point depending on the given task. Let [math]\displaystyle{ f_j(x) }[/math] be the true ground-truth function for each task [math]\displaystyle{ j }[/math]. Therefore, for some input variable [math]\displaystyle{ x_i }[/math], an ideal model [math]\displaystyle{ g }[/math] would predict [math]\displaystyle{ g(x_i) = f_j(x_i) }[/math]. With this, the risk functional can be modified to fit this new task for traditional supervised learning methods.

$$ R(g) = \int_x \sum_{j=1}^n (f_j(x) - g(x))^2 p(f_j) p(x) \; \mathrm{d}x $$

We call [math]\displaystyle{ (f_j(x) - g(x))^2 p(f_j) }[/math] the confusing multiple mappings. Then the optimal solution [math]\displaystyle{ g^*(x) }[/math] to the mapping is [math]\displaystyle{ \bar{f}(x) = \sum_{j=1}^n p(f_j) f_j(x) }[/math] under this risk functional. However, the optimal solution is not conditional on the specific task at hand but rather on the entire ground-truth functions. Therefore, for every non-trivial set of tasks where [math]\displaystyle{ f_u(x) \neq f_v(x) }[/math] for some input [math]\displaystyle{ x }[/math] and [math]\displaystyle{ u \neq v }[/math], [math]\displaystyle{ R(g^*) \gt 0 }[/math] which implies that there is an unavoidable confusion risk.

Learning Functions of CSL

To overcome this issue, the authors introduce two types of learning functions:

  • Deconfusing function — allocation of which samples come from the same task
  • Mapping function — mapping relation from input to output of every learned task

Suppose there are [math]\displaystyle{ n }[/math] ground-truth mappings [math]\displaystyle{ \{f_j : 1 \leq j \leq n\} }[/math] that we wish to approximate with a set of learning functions [math]\displaystyle{ \{g_k : 1 \leq k \leq l\} }[/math]. The authors define the deconfusing function as an indicator function [math]\displaystyle{ h(x, y, g_k) }[/math] which takes some sample [math]\displaystyle{ (x,y) }[/math] and determines whether the sample is assigned to task [math]\displaystyle{ g_k }[/math]. Under the CSL framework, the risk functional (mean squared loss) is

$$ R(g,h) = \int_x \sum_{j,k} (f_j(x) - g_k(x))^2 \; h(x, f_j(x), g_k) \;p(f_j) \; p(x) \;\mathrm{d}x $$

which can be estimated empirically with

$$R_e(g,h) = \sum_{i=1}^m \sum_{k=1}^n |y_i - g_k(x_i)|^2 \cdot h(x_i, y_i, g_k) $$

Theoretical Results

This novel framework yields some theoretical results to show the viability of its construction.

Theorem 1 (Existence of Solution) With the confusing supervised learning framework, there is an optimal solution $$h^*(x, f_j(x), g_k) = \mathbb{I}[j=k]$$

$$g_k^*(x) = f_k(x)$$

for each [math]\displaystyle{ k=1,..., n }[/math] that makes the expected risk function of the CSL problem zero.

Theorem 2 (Error Bound of CSL) With probability at least [math]\displaystyle{ 1 - \eta }[/math] simultaneously with finite VC dimension [math]\displaystyle{ \tau }[/math] of CSL learning framework, the risk measure is bounded by

$$R(\alpha) \leq R_e(\alpha) + \frac{B\epsilon(m)}{2} \left(1 + \sqrt{1 + \frac{4R_e(\alpha)}{B\epsilon(m)}}\right)$$

where [math]\displaystyle{ \alpha }[/math] is the total parameters of learning functions [math]\displaystyle{ g, h }[/math] and [math]\displaystyle{ B }[/math] is the upper bound of one sample's risk and $$\epsilon(m) = 4 \; \frac{\tau (\ln \frac{2m}{\tau} + 1) - \ln \eta / 4}{m}$$

CSL-Net

Experiment

Conclusion

Critique

References

Su, Xin, et al. "Task Understanding from Confusing Multi-task Data."