multi-Task Feature Learning: Difference between revisions

From statwiki
Jump to navigation Jump to search
Line 40: Line 40:


In <math>\,(2)</math>, <math>\,\gamma > 0</math> is the regularization parameter.
In <math>\,(2)</math>, <math>\,\gamma > 0</math> is the regularization parameter.
It is well-known that using the 1-norm regularization results in sparse solutions to <math>\,(2)</math>, i.e. many elements of the learned <math>\,a_t</math> are zero.

Revision as of 18:01, 24 November 2010

Introduction

It has been both empirically as well as theoretically shown that learning multiple related tasks simultaneously often significantly improves performance as compared to learning each task independently. Learning multiple related tasks simultaneously is especially beneficial when we only have a few data per task; and this benefit comes from pooling together data across many related tasks. One way that tasks can be related to each other is that some tasks share a common underlying representation; for example, people make product choices (e.g. of books, music CDs, etc.) using a common set of features describing these products. In this paper, the authors explored a way to learn a low-dimensional representation this is shared across multiple related tasks.

Learning sparse multi-task representations

Notations

Let [math]\displaystyle{ \,R }[/math] be the set of real numbers and [math]\displaystyle{ \,R_{+} (R_{++}) }[/math] the non-negative ones. Let T be the number of tasks and let the tasks as [math]\displaystyle{ \,N_T := {1,\dots,T} }[/math]. For each task [math]\displaystyle{ \,t \in N_T }[/math], we have [math]\displaystyle{ \,m }[/math] input/output examples [math]\displaystyle{ \,(x_{t1}; y_{t1}),\dots,(x_{tm}; y_{tm}) \in R^d \times R }[/math]. The goal is then to estimate [math]\displaystyle{ \,T }[/math] functions [math]\displaystyle{ \,f_t : R^d \mapsto R, t \in N_T }[/math], that approximate well the training data and are also statistically predictive for new data.

If [math]\displaystyle{ \,w, u \in R^d }[/math], [math]\displaystyle{ \,\lt w, u\gt \; := \sum_{i=1}^d w_i u_i }[/math] is their standard inner product in [math]\displaystyle{ \,R^d }[/math]. For every [math]\displaystyle{ \,p \ge 1 }[/math], [math]\displaystyle{ \,||w||_p \; := \; (\sum_{i=1}^d |w_i|^p)^\frac{1}{p} }[/math] is the [math]\displaystyle{ \,p }[/math]-norm of vector [math]\displaystyle{ \,w }[/math]. If [math]\displaystyle{ \,A }[/math] is a [math]\displaystyle{ \,d \times T }[/math] matrix, then [math]\displaystyle{ \,a^i \in R^T }[/math] and [math]\displaystyle{ \,a_j \in R^d }[/math] are the [math]\displaystyle{ \,i }[/math]th row and the [math]\displaystyle{ \,j }[/math]th column of [math]\displaystyle{ \,A }[/math], respectively. For every [math]\displaystyle{ \,r, \; p \ge 1 }[/math], [math]\displaystyle{ \,||A||_{r,p} := (\sum_{i=1}^d ||a^i||_r^p)^\frac{1}{p} }[/math] is the [math]\displaystyle{ \,(r,p) }[/math]-norm of [math]\displaystyle{ \,A }[/math].

[math]\displaystyle{ \,S^d }[/math] is the set of [math]\displaystyle{ \,d \times d }[/math] real symmetric matrices and [math]\displaystyle{ \,S_{+}^d }[/math] is the subset of positive semidefinite ones. If [math]\displaystyle{ \,D }[/math] is a [math]\displaystyle{ \,d \times d }[/math] matrix, then trace([math]\displaystyle{ \,D }[/math]) [math]\displaystyle{ \; :=\sum_{i=1}^d D_{ii} }[/math]. If [math]\displaystyle{ \,X }[/math] is a [math]\displaystyle{ \,p \times q }[/math] real matrix, then range([math]\displaystyle{ \,X }[/math]) is the set [math]\displaystyle{ \,\{x \in R^p \; : \; x = Xz }[/math], for some [math]\displaystyle{ \,z \in R^q\} }[/math]. [math]\displaystyle{ \,O^d }[/math] is the set of [math]\displaystyle{ \,d \times d }[/math] orthogonal matrices. [math]\displaystyle{ \,D^+ }[/math] is the pseudoinverse of a matrix [math]\displaystyle{ \,D }[/math].

Problem formulation

It is assumed that the functions [math]\displaystyle{ \,f_t }[/math] are related to each other, so they all share a small set of features. Formally, the hypothesis is that the functions [math]\displaystyle{ \,f_t }[/math] can be expressed as: [math]\displaystyle{ \, f_t(x) = \sum_{i=1}^d a_{it}h_i(x), \;\; t \in N_T \;\;\; (1) }[/math]

In [math]\displaystyle{ \,(1) }[/math], [math]\displaystyle{ \,h_i : R^d \mapsto R }[/math] are the features and [math]\displaystyle{ \,a_{it} \in R }[/math] are the regression parameters.


The main assumption is that the vast majority of the features have zero coefficients across all the tasks.


For simplicity, the authors focused on linear features, i.e. [math]\displaystyle{ \,h_i(x) = \lt u_i, x\gt }[/math], where [math]\displaystyle{ \,u_i \in R^d }[/math] and [math]\displaystyle{ \,u_i }[/math] are assumed to be orthonormal. They noted that extensions to non-linear functions may be done, for example, by using kernels; however, they did not focus on these extensions in their paper. Furthermore, the functions [math]\displaystyle{ \,f_t }[/math] are also linear, i.e. [math]\displaystyle{ \,f_t(x) = \lt w_t, x\gt }[/math], where [math]\displaystyle{ \, w_t = \Sigma_i \;a_{it} u_i }[/math].


[math]\displaystyle{ \,W }[/math] the [math]\displaystyle{ \,d \times T }[/math] matrix whose columns are the vectors [math]\displaystyle{ \,w_t }[/math]. [math]\displaystyle{ \,A }[/math] the [math]\displaystyle{ \,d \times T }[/math] matrix whose elements are the [math]\displaystyle{ \,a_{it} }[/math]. The authors then obtained that [math]\displaystyle{ \,W = UA }[/math].

The assumption that the tasks share a small set of features implies that [math]\displaystyle{ \,A }[/math] has many rows which are identically equal to zero. This, in turn, implies that the corresponding features, which are the columns of [math]\displaystyle{ \,U }[/math], will not be used to represent the task parameters, which are the columns of [math]\displaystyle{ \,W }[/math]. [math]\displaystyle{ \,W }[/math], thus, is a low-rank matrix.


The authors' goal was to find the feature vectors [math]\displaystyle{ \,u_i }[/math] and the parameters [math]\displaystyle{ \,a_{it} }[/math].


As a starting point, the authors considered the problem in which there is only one task (referred to as task [math]\displaystyle{ \,t }[/math]) and the features [math]\displaystyle{ \,u_i }[/math] are fixed. In this problem, the aim was to learn the parameter vector [math]\displaystyle{ \,a_t \in R^d }[/math] from data [math]\displaystyle{ \,\{(x_{ti}, y_{ti})\}_{i=1}^m }[/math]. The authors expressed this unconstrained problem as:

[math]\displaystyle{ \, min \; \{ \; \sum_{i=1}^m L(y_{ti}, \lt a_t,U^T x_{ti}\gt ) + \gamma ||a_t||_1^2 \; : \; a_t \; \in \; R^d \; \} \;\;\; (2) }[/math]

In [math]\displaystyle{ \,(2) }[/math], [math]\displaystyle{ \,\gamma \gt 0 }[/math] is the regularization parameter.

It is well-known that using the 1-norm regularization results in sparse solutions to [math]\displaystyle{ \,(2) }[/math], i.e. many elements of the learned [math]\displaystyle{ \,a_t }[/math] are zero.