Task Understanding from Confusing Multi-task Data

From statwiki
Revision as of 07:17, 15 November 2020 by P6choi (talk | contribs) (→‎References)
Jump to navigation Jump to search

Presented By

Qianlin Song, William Loh, Junyue Bai, Phoebe Choi

Introduction

Related Work: Types of machine learning problems

Multi-task learning

Multi-task learning aims to learn multiple tasks simultaneously using a shared feature representation. By exploiting similarities and differences between tasks, the learning from one task can improve the learning of another task. (Caruana, 1997) This results in improved learning efficiency. Multi-task learning is used in disciplines like computer vision, natural language processing, and reinforcement learning. Multi-task learning requires manual task annotation to learn and this paper is interested in machine learning without a clear task definition and manual task annotation.


Multi-label learning

Multi-label learning aims to assign an input to a set of classes/labels. It is a generalization of multi-class classification, which classifies an input into one class. In multi-label learning, an input can be classified into more than one class. Unlike multi-task learning, multi-label does not consider the relationship between different label judgments.

Confusing Supervised Learning

Description of the Problem

Confusing supervised learning (CSL) offers a solution to the issue at hand. A major area of improvement can be seen in the choice of risk measure. In traditional supervised learning, assuming the risk measure is mean squared error (MSE), the expected risk functional is

$$ R(g) = \int_x (f(x) - g(x))^2 p(x) \; \mathrm{d}x $$

where [math]\displaystyle{ p(x) }[/math] is the prior distribution of the input variable [math]\displaystyle{ x }[/math]. In practice, model optimizations are performed using the empirical risk

$$ R_e(g) = \sum_{i=1}^n (y_i - g(x_i))^2 $$

When the problem involves different tasks, the model should optimize for each data point depending on the given task. Let [math]\displaystyle{ f_j(x) }[/math] be the true ground-truth function for each task [math]\displaystyle{ j }[/math]. Therefore, for some input variable [math]\displaystyle{ x_i }[/math], an ideal model [math]\displaystyle{ g }[/math] would predict [math]\displaystyle{ g(x_i) = f_j(x_i) }[/math]. With this, the risk functional can be modified to fit this new task for traditional supervised learning methods.

$$ R(g) = \int_x \sum_{j=1}^n (f_j(x) - g(x))^2 p(f_j) p(x) \; \mathrm{d}x $$

We call [math]\displaystyle{ (f_j(x) - g(x))^2 p(f_j) }[/math] the confusing multiple mappings. Then the optimal solution [math]\displaystyle{ g^*(x) }[/math] to the mapping is [math]\displaystyle{ \bar{f}(x) = \sum_{j=1}^n p(f_j) f_j(x) }[/math] under this risk functional. However, the optimal solution is not conditional on the specific task at hand but rather on the entire ground-truth functions. Therefore, for every non-trivial set of tasks where [math]\displaystyle{ f_u(x) \neq f_v(x) }[/math] for some input [math]\displaystyle{ x }[/math] and [math]\displaystyle{ u \neq v }[/math], [math]\displaystyle{ R(g^*) \gt 0 }[/math] which implies that there is an unavoidable confusion risk.

Learning Functions of CSL

To overcome this issue, the authors introduce two types of learning functions:

  • Deconfusing function — allocation of which samples come from the same task
  • Mapping function — mapping relation from input to output of every learned task

Suppose there are [math]\displaystyle{ n }[/math] ground-truth mappings [math]\displaystyle{ \{f_j : 1 \leq j \leq n\} }[/math] that we wish to approximate with a set of mapping functions [math]\displaystyle{ \{g_k : 1 \leq k \leq l\} }[/math]. The authors define the deconfusing function as an indicator function [math]\displaystyle{ h(x, y, g_k) }[/math] which takes some sample [math]\displaystyle{ (x,y) }[/math] and determines whether the sample is assigned to task [math]\displaystyle{ g_k }[/math]. Under the CSL framework, the risk functional (mean squared loss) is

$$ R(g,h) = \int_x \sum_{j,k} (f_j(x) - g_k(x))^2 \; h(x, f_j(x), g_k) \;p(f_j) \; p(x) \;\mathrm{d}x $$

which can be estimated empirically with

$$R_e(g,h) = \sum_{i=1}^m \sum_{k=1}^n |y_i - g_k(x_i)|^2 \cdot h(x_i, y_i, g_k) $$

Theoretical Results

This novel framework yields some theoretical results to show the viability of its construction.

Theorem 1 (Existence of Solution) With the confusing supervised learning framework, there is an optimal solution $$h^*(x, f_j(x), g_k) = \mathbb{I}[j=k]$$

$$g_k^*(x) = f_k(x)$$

for each [math]\displaystyle{ k=1,..., n }[/math] that makes the expected risk function of the CSL problem zero.

Theorem 2 (Error Bound of CSL) With probability at least [math]\displaystyle{ 1 - \eta }[/math] simultaneously with finite VC dimension [math]\displaystyle{ \tau }[/math] of CSL learning framework, the risk measure is bounded by

$$R(\alpha) \leq R_e(\alpha) + \frac{B\epsilon(m)}{2} \left(1 + \sqrt{1 + \frac{4R_e(\alpha)}{B\epsilon(m)}}\right)$$

where [math]\displaystyle{ \alpha }[/math] is the total parameters of learning functions [math]\displaystyle{ g, h }[/math], [math]\displaystyle{ B }[/math] is the upper bound of one sample's risk, [math]\displaystyle{ m }[/math] is the size of training data and $$\epsilon(m) = 4 \; \frac{\tau (\ln \frac{2m}{\tau} + 1) - \ln \eta / 4}{m}$$

CSL-Net

In this section the authors describe how to implement and train a network for CSL.

The Structure of CSL-Net

Two neural networks, deconfusing-net and mapping-net are trained to implement two learning function variables in empirical risk. The optimization target of the training algorithm is: $$\min_{g, h} R_e = \sum_{i=1}^{m}\sum_{k=1}^{n} (y_i - g_k(x_i))^2 \cdot h(x_k, y_k; g_k)$$

The mapping-net is corresponding to functions set [math]\displaystyle{ g_k }[/math], where [math]\displaystyle{ y_k = g_k(x) }[/math] represents the output of one certain task. The deconfusing-net is corresponding to function h, whose input is a sample [math]\displaystyle{ (x,y) }[/math] and output is an n-dimensional one-hot vector. This output vector determines which task the sample [math]\displaystyle{ (x,y) }[/math] should be assigned to. The core difficulty of this algorithm is that the risk function cannot be optimized by gradient back-propagation due to the constraint of one-hot output from deconfusing-net. Approximation of softmax will lead the deconfusing-net output into a non-one-hot form, which resulting in meaningless trivial solutions.


Iterative Deconfusing Algorithm

To overcome the training difficulty, the authors divide the empirical risk minimization into two local optimization problems. In each single-network optimization step, the parameters of one network is updated while the parameters of another remain fixed. With one network's parameters unchanged, the problem can be solved by a gradient descent method of neural networks.

Training of Mapping-Net: With function h from deconfusing-net being determined, the goal is to train every mapping function [math]\displaystyle{ g_k }[/math] with its corresponding sample [math]\displaystyle{ (x_i^k, y_j^k) }[/math]. The optimization problem becomes: [math]\displaystyle{ \displaystyle \min_{g_k} L_{map}(g_k) = \sum_{i=1}^{m_k} \mid y_i^k - g_k(x_i^k)\mid^2 }[/math]. Back-propagation algorithm can be applied to solve this optimization problem.

Training of Deconfusing-Net: The task allocation is re-evaluated during the training phase while the parameters of the mapping-net remain fixed. To minimize the original risk, every sample [math]\displaystyle{ (x, y) }[/math] will be assigned to [math]\displaystyle{ g_k }[/math] that is closest to label y among all different [math]\displaystyle{ k }[/math]s. Mapping-net thus provides a temporary solution for deconfusing-net: [math]\displaystyle{ \hat{h}(x_i, y_i) = arg \displaystyle\min_{k} \mid y_i - g_k(x_i)\mid^2 }[/math]. The optimization becomes: [math]\displaystyle{ \displaystyle \min_{h} L_{dec}(h) = \sum_{i=1}^{m} \mid {h}(x_i, y_i) - \hat{h}(x_i, y_i)\mid^2 }[/math]. Similarly, the optimization problem can be solved by updating the deconfusing-net with a back-propagation algorithm.

The two optimization stages are carried out alternately until the solution converges.

Experiment

Conclusion

This paper proposed the CSL method for tackling the multi-task learning problem with manual task annotations in the input data. The model obtains a basic task concept by differentiating multiple mappings. This paper believes that the CSL method is an important step to moving from Narrow AI towards General AI for multi-task learning.

Critique

References

Su, Xin, et al. "Task Understanding from Confusing Multi-task Data."

Caruana, R. (1997) "Multi-task learning"