multi-Task Feature Learning: Difference between revisions

From statwiki
Jump to navigation Jump to search
m (Conversion script moved page Multi-Task Feature Learning to multi-Task Feature Learning: Converting page titles to lowercase)
 
(124 intermediate revisions by 6 users not shown)
Line 1: Line 1:
=Introduction=
=Introduction=
<!--
Re-wrote the introduction since it was very similar to that of the original paper
-->
Imagine that a user wishes to perform some '''task''', such as shopping for a book.
If the user does not have much experience shopping for books, he or she may draw upon his or her knowledge of shopping for some other product, like music CDs.
The tasks are not exactly the same, but there are similarities.
The user may recognize that books are made by authors, music by artists.
Both have different genres, and perhaps appeal to different demographics.
Multi-task feature learning aims to model this paradigm.
Statistically, we wonder if there is some common set of features a user may apply in order to do some abstract task (returning to the example, the common task is '''shopping''').
And, specifically, can we learn multiple related tasks simultaneously, as opposed to independently?
Multi-task feature learning is especially relevant to situations where the number of observations may be small.
This creates a need for finding shared structures between the tasks, so that observations can be aggregated to learn many tasks simultaneously.
The search for these shared structures can be seen as a [http://en.wikipedia.org/wiki/Dimension_reduction dimensionality reduction], or alternatively a [http://en.wikipedia.org/wiki/Regularization_(mathematics) regularization], problem.
The authors generalize the idea of 1-norm regularization that presents a low dimensional representation of a single task to learn the shared representation across multiple tasks. Regularization parameter controls the number of learned common features. They present an equivalent convex optimization problem to their original formulation, and propose an algorithm to solve this problem.
The learning part of the algorithm simultaneously learns both the features and the task functions through two
alternating steps. The first step which is supervised includes independently learning the parameters of the tasks
[http://en.wikipedia.org/wiki/Regression regression] or classification functions. In the second step, which is unsupervised, a low-dimensional representation for these task parameters is found, which the authors show is equivalent to learning common features across the tasks. 


It has been both empirically as well as theoretically shown that learning multiple related tasks simultaneously often significantly improves performance as compared to learning each task independently. Learning multiple related tasks simultaneously is especially beneficial when we only have a few data per task; and this benefit comes from pooling together data across many related tasks. One way that tasks can be related to each other is that some tasks share a common underlying representation; for example, people make product choices (e.g. of books, music CDs, etc.) using a common set of features describing these products. In this paper, the authors explored a way to learn a low-dimensional representation this is shared across multiple related tasks.
=Learning sparse multi-task representations=
=Learning sparse multi-task representations=


==Notations==
==Notation==
For simplicity, we will first define the following notation:
* Let <math>\,R</math> be the set of real numbers,  <math>\,R_{+}</math> the non-negative ones and <math>\,R_{++}</math> the positive ones.
*Let <math>\,T</math> be the number of tasks and let the set of tasks be <math>\,N_T := {1,\dots,T}</math>.
*For each task <math>\,t \in N_T</math>, we have <math>\,m</math> input/output examples <math>\,(x_{t1}, y_{t1}),\dots ,(x_{tm}, y_{tm}) \in R^d \times R</math>.
*The goal is then to estimate <math>\,T</math> functions <math>\,f_t : R^d \mapsto R, t \in N_T</math>, that approximate the training data well and are also statistically predictive for new data.
*If <math>\,w, u \in R^d</math>, <math>\,<w, u> \; := \sum_{i=1}^d w_i u_i</math> is their standard inner product in <math>\,R^d</math>.
*For every <math>\,p \ge 1</math>, <math>\,||w||_p \; := \; (\sum_{i=1}^d |w_i|^p)^\frac{1}{p}</math> is the <math>\,p</math>-norm of vector <math>\,w</math>.
*If <math>\,A</math> is a <math>\,d \times T</math> matrix, then <math>\,a^i \in R^T</math> and <math>\,a_j \in R^d</math> are the <math>\,i</math>th row and the <math>\,j</math>th column of <math>\,A</math>, respectively.
*For every <math>\,r, \; p \ge 1</math>, <math>\,||A||_{r,p} := (\sum_{i=1}^d ||a^i||_r^p)^\frac{1}{p}</math> is the <math>\,(r,p)</math> -norm of <math>\,A</math>.
*<math>\,S^d</math> is the set of <math>\,d \times d</math> real symmetric matrices and <math>\,S_{+}^d</math> is the subset of positive semidefinite ones.
*If <math>\,D</math> is a <math>\,d \times d</math> matrix, then <math>\,tr(D):=\sum_{i=1}^d D_{ii}</math>.
*If <math>\,X</math> is a <math>\,p \times q</math> real matrix, then range(<math>\,X</math>) is the set  <math>\,\{x \in R^p \; : \; x = Xz</math>, for some <math>\,z \in R^q\}</math>.
*<math>\,O^d</math> is the set of <math>\,d \times d</math> orthogonal matrices.
*<math>\,D^+</math> is the [http://mathworld.wolfram.com/Pseudoinverse.html pseudoinverse] of a matrix <math>\,D</math>.
 
==Problem formulation==
It is assumed that the functions <math>\,f_t</math> are related to each other, so they all share a small set of features. Formally, the hypothesis is that the functions <math>\,f_t</math> can be expressed as:
<center><math>\, f_t(x) = \sum_{i=1}^d a_{it}h_i(x), \;\;  t \in N_T \;\;\; (1)</math></center>
 
In the above equation, <math>\,h_i : R^d \mapsto R</math> are the features and <math>\,a_{it} \in R</math> are the regression parameters.
 
The main assumption is that the vast majority of the features have zero coefficients across ''all'' the tasks.
 
For simplicity, the authors focused on linear features, i.e. <math>\,h_i(x) = <u_i, x></math>, where <math>\,u_i \in R^d</math> and <math>\,u_i</math> are assumed to be orthonormal. They noted that extensions to non-linear functions may be done, for example, by using kernels; however, they did not focus on these extensions in their paper.  Furthermore, the functions <math>\,f_t</math> are also linear, i.e. <math>\,f_t(x) = <w_t, x></math>, where <math>\, w_t = \Sigma_i \;a_{it} u_i</math>.
 
<math>\,W</math> is the <math>\,d \times T</math> matrix whose columns are the vectors <math>\,w_t</math>. <math>\,A</math> is the <math>\,d \times T</math> matrix whose elements are the <math>\,a_{it}</math>. The authors then obtained that <math>\,W = UA</math> where <math>U \in O^d</math> .
 
The assumption that the tasks share a ''small'' set of features implies that <math>\,A</math> has ''many'' rows which are identically equal to zero. This, in turn, implies that the corresponding features, which are the columns of <math>\,U</math>, will not be used to represent the task parameters, which are the columns of <math>\,W</math>. Thus, <math>\,W</math> is a low-[http://mathworld.wolfram.com/MatrixRank.html rank] matrix.
 
The authors' goal was to find the feature vectors <math>\,u_i</math> and the parameters <math>\,a_{it}</math>.
 
As a starting point, the authors considered the problem in which there is only one task (referred to as task <math>\,t</math>) and the features <math>\,u_i</math> are fixed. In this problem, the aim was to learn the parameter vector <math>\,a_t \in R^d</math> from data <math>\,\{(x_{ti}, y_{ti})\}_{i=1}^m</math>.
 
This unconstrained problem is expressed as:
 
<center><math>\, \min_{a_t} \; \{ \; \sum_{i=1}^m L(y_{ti}, <a_t,U^T x_{ti}>) + \gamma ||a_t||_1^2 \; : \; a_t \; \in \; R^d \; \} \;\;\; (2)</math></center>
 
In which, <math>\,\gamma > 0</math> is the [http://en.wikipedia.org/wiki/Regularization regularization] parameter. The fact that using the 1-norm [http://en.wikipedia.org/wiki/Regularization regularization] results in a sparse solution to <math>\,(2)</math> (i.e. many elements of the learned <math>\,a_t</math> are zero) is well known.
 
To expand the problem to the multi-task case, the authors introduced the following regularization error function:
 
<center><math>\, \mathcal{E}(A,U) = \sum_{t=1}^T \sum_{i=1}^m L(y_{ti}, <a_t,U^T x_{ti}>) + \gamma ||A||_{2,1}^2 \;\;\; (3)</math></center>
 
The first term in <math>\,(3)</math> is the average empirical error across the tasks, and the second term is a regularization term that penalizes the <math>\,(2, 1)</math> -norm of <math>\,A</math>.  The <math>\,(2, 1)</math> -norm of <math>\,A</math> is computed by first computing the 2-norm of the rows of <math>\,A</math> and then the 1-norm of the vector <math>\,b(A) = (\|a^1\|_2,\ldots,\|a^d\|_2)</math>.  The regularization term in <math>\,(3)</math> ensures that <math>\,b(A)</math> will likely be sparse and since each row of <math>\,A</math> represents a feature this regularization limits the number of common features.
 
However, since the authors did not simply want to select the features, but rather they also wanted to learn them, they further minimized the function <math>\,\mathcal{E}</math> over <math>\,U</math>, i.e. they considered the following optimization problem:
 
<center><math>\,\min \; \{\; \mathcal{E}(A,U) \; : \; U \in O^d, A \in \mathbb{R}^{d \times T} \; \} \;\;\; (4)</math></center>
 
Solving the above [http://en.wikipedia.org/wiki/Optimization optimization] problem, results in learning a low-dimensional representation which is shared across the tasks.
 
In addition, when <math>\,U</math> is not learned and is set to <math>\,I_{d \times d}</math>, solving <math>\,(4)</math> finds a common set of variables across the tasks. Under this condition for <math>\,U</math>, <math>\,(4)</math> can be re-expressed as the following [http://en.wikipedia.org/wiki/Convex_optimization convex] optimization problem:
 
<center><math>\, \min \; \{\; \sum_{t=1}^T \sum_{i=1}^m L(y_{ti}, <a_t, x_{ti}>) + \gamma ||A||_{2,1}^2 \; : \; A \in \mathbb{R}^{d \times T} \; \} \;\;\; (5)</math></center>
 
=Equivalent convex optimization formulation=
Equation <math>\,4</math> is difficult to solve due to the following two reasons:
 
* It is not convex ( although it is separately convex in each of <math>\,\{A, U\}</math> ).
* <math>\,||A||_{2,1}</math> is not smooth, which makes optimization more difficult.
 
However, <math>\,(4)</math> can be re-expressed as an equivalent convex problem.
 
Before doing so, for every <math>\,W \in \mathbb{R}^{d \times T}</math> and <math>\,D \in S_+^d</math>, the authors defined the following function:
 
<center><math>\, R(W,D) = \sum_{t=1}^T \sum_{i=1}^m \; L(y_{ti} , <w_t,x_ti)>) + \gamma \sum_{t=1}^T \; <w_t,D^+w_t> \;\;\; (6)</math></center>
 
'''Theorem 3.1''' (A detailed proof is given in the authors' paper listed in Reference) ''Problem <math>\,(4)</math> is equivalent to the following problem:
<br /><br />
<center><math>\, \min \; \{ \; R(W,D) \; : \; W \in \mathbb{R}^{d \times T}, \; D \in S_+^d, \; tr(D) \le 1, \; range(W) \subseteq range(D) \; \} \;\;\; (7)</math></center> 
<br />
''That is <math>(\hat{A},\hat(U))</math> is an optimal solution for <math>\,(4)</math> if and only if <math>(\hat{W},\hat{D}) = (\hat{U}\hat{A},\hat{U}\textrm{Diag}(\hat{\lambda})\hat{U}^T)
</math> is an optimal solution for <math>\,(7)</math>, where''
<br /><br />
<math> \hat{\lambda} := \frac{\|a^i\|_2}{\|\hat{A}\|_{2,1}}. </math>
<br /><br />
Since the rank of <math>\,D</math> is determined by the number of nonzero eigenvalues, we can see from the above definition of <math>\hat{\lambda}</math> that the rank of <math>\,D</math> gives the number of common relevant features shared by the tasks.
 
By defining the function
:<center><math>\,f(w,D) =
\begin{cases}
w^TD^+w &\text{if } w \in \textrm{range}(D) \;\; \textrm{and} \;\; D \in S_+^d , \\
+\infty &\text{otherwise } .
\end{cases}
</math></center>
and showing that it is jointly convex in both <math>\,W</math> and <math>\,D</math> the authors show (details are available in the authors' paper listed in the Reference section) that the function <math>\,R</math> in <math>\,(6)</math> is jointly convex in both <math>\,W</math> and <math>\,D</math>.
 
=Learning algorithm=
 
The authors solved <math>\,(7)</math> by alternately minimizing the function <math>\,R</math> with respect to <math>\,D</math> and the <math>\,w_t</math> (
(the <math>\,t</math>-th column of <math>\,W</math>).
 
When <math>\,D</math> is held fixed, <math>\,w_t</math> is found by learning its parameters independently using a regularization method such as an SVM. When
<math>\,w_t</math> is held fixed, <math>\,D</math> is found by solving the following optimization problem:
 
<center><math>\,\min \; \{ \; \sum_{t=1}^T \; <w_t , D^+ w_t> \; : \; D \in S_+^d, \; tr(D) \le 1, \; \textrm{range}(W) \subseteq \textrm{range}(D) \; \} \;\;\; (8) </math></center>
 
'''Theorem 4.1''' (The proof is given in the authors' paper listed in the Reference section) ''The optimal solution of <math>\,(8)</math> is <math>\,D = \frac {(WW^T)^\frac{1}{2}} {tr(WW^T)^\frac{1}{2}}. \;\;\; (9)</math> 
 
Now, we have enough theory to describe an algorithm.
===Algorithm 1:''' Multi-Task Feature Learning'''===
*'''Input''': <math>\{x_{ti}, y_{ti}\}_{i=1}^m, t \in N_t </math>
* '''Parameters''': <math>\gamma</math>
*'''Output''' <math>d \times d</math> matrix <math>D</math> and <math>d \times T</math> matrix <math>W</math>
*'''Initial Value''': <math>D = \frac{I_d}{d}</math>
*'''while''' convergence condition not met:
**'''for''' <math>t = 1,\dots, T</math>:
*** find <math>w_t = \arg\min \{\sum_{i=1}^m  L(y_{ti} , <w_t,x_{ti})>) + \gamma <w_t,D^+w_t></math> with <math>w \in R^d</math> and <math>w \in \textrm{range}(D)</math>
**'''end for'''
*set <math>D = \frac{(WW^T)^{1/2}}{tr(WW^T)^{1/2}}</math>
*'''end while'''
 
In <math>\,(9)</math>, <math>\,tr(WW^T)^\frac{1}{2}</math> is the sum of the singular values of <math>\,W</math>, and thus it is sometimes called the ''[http://en.wikipedia.org/wiki/Trace_norm trace norm]'' (this [http://people.csail.mit.edu/jrennie/writing/traceEquivalence.pdf work] by Jason D. M. Rennie shows how the trace norm of a matrix can be expressed in a number of different ways). Using the trace norm, <math>\,(7)</math> is reduced to a regularization problem that depends only on <math>\,W</math>. However, the [http://en.wikipedia.org/wiki/Trace_norm trace norm] is not smooth. As a result, the authors argue that their algorithm ('''Algorithm 1''') that uses an alternating minimization strategy is beneficial in that it is:
* simple to implement
* has a natural interpretation
 
The authors further note that '''Algorithm 1''' alternately performs a supervised step and an unsupervised step; in the unsupervised step, common representations across the tasks are learned and, in the supervised step, task-specific functions are learned using these representations.


Let <math>\,R</math> be the set of real numbers and <math>\,R_{+} (R_{++})</math> the non-negative ones. Let T be the number of tasks and let the tasks as <math>\,N_T := {1,\dots,T}</math>. For each task <math>\,t \in N_T</math>, we have <math>\,m</math> input/output examples <math>\,(x_{t1}; y_{t1}),\dots,(x_{tm}; y_{tm}) \in R^d \times R</math>. The goal is then to estimate <math>\,T</math> functions <math>\,f_t : R^d \mapsto R, t \in N_T</math>, that approximate well the training data and are also statistically predictive for new data.  
The authors further note that if <math>\,D</math> in <math>\,(7)</math> is additionally constrained to be diagonal, then <math>\,(7)</math> reduces to <math>\,(5)</math>.


If <math>\,w, u \in R^d</math>, <math>\,<w, u> \; := \sum_{i=1}^d w_i u_i</math> is their standard inner product in <math>\,R^d</math>. For every <math>\,p \ge 1</math>, <math>\,||w||_p \; := \; (\sum_{i=1}^d |w_i|^p)^\frac{1}{p}</math> is the <math>\,p</math>-norm of vector <math>\,w</math>. If <math>\,A</math> is a <math>\,d \times T</math> matrix, then <math>\,a^i \in R^T</math> and <math>\,a_j \in R^d</math> are the <math>\,i</math>th row and the <math>\,j</math>th column of <math>\,A</math>, respectively. For every <math>\,r, \; p \ge 1</math>, <math>\,||A||_{r,p} := (\sum_{i=1}^d ||a^i||_r^p)^\frac{1}{p}</math> is the <math>\,(r,p)</math>-norm of <math>\,A</math>.
=Experiments=


<math>\,S^d</math> is the set of <math>\,d \times d</math> real symmetric matrices and <math>\,S_{+}^d</math> is the subset of positive semidefinite ones. If <math>\,D</math> is a <math>\,d \times d</math> matrix, then trace(<math>\,D</math>) <math> \; :=\sum_{i=1}^d D_{ii}</math>. If <math>\,X</math> is a <math>\,p \times q</math> real matrix, then range(<math>\,X</math>) is the set  <math>\,\{x \in R^p \; : \; x = Xz</math>, for some <math>\,z \in R^q\}</math>. <math>\,O^d</math> is the set of <math>\,d \times d</math> orthogonal matrices. <math>\,D^+</math> is the [http://mathworld.wolfram.com/Pseudoinverse.html pseudoinverse] of a matrix <math>\,D</math>.
The authors performed their experiments on a [http://en.wikipedia.org/wiki/Synthetic_data synthetic data] set and a real data set, and they used the square loss function and tuned the regularization parameter using leave-one-out cross validation.


==Problem formulation==
==Synthetic Experiments==


It is assumed that the functions <math>\,f_t</math> are related to each other, so they all share a small set of features. Formally, the hypothesis is that the functions <math>\,f_t</math> can be expressed as:
The authors created [http://en.wikipedia.org/wiki/Synthetic_data synthetic data] sets by generating <math>\,T</math> = 200 task parameters <math>\,w_t</math> from a <math>\,5</math>-dimensional [http://en.wikipedia.org/wiki/Gaussian_distribution Gaussian distribution] with zero mean and covariance equal to <math>\,Diag(1, 0.25, 0.1, 0.05, 0.01)</math>.
<math>\, f_t(x) = \sum_{i=1}^d a_{it}h_i(x), \;\;  t \in N_T \;\;\; (1)</math>
More details regarding the authors' synthetic data can be found in the paper listed in Reference.


In <math>\,(1)</math>, <math>\,h_i : R^d \mapsto R</math> are the features and <math>\,a_{it} \in R</math> are the regression parameters.
The following figure (taken from the authors' paper listed in Reference) shows the performance of the authors' algorithm for <math>\,T =</math> 10, 25, 100, and 200 tasks and the performance of 200 independent standard [http://en.wikipedia.org/wiki/Tikhonov_regularization ridge regressions] on the data:




The main assumption is that the vast majority of the features have zero coefficients across ''all'' the tasks.
[[File:ps2fig2.jpg]]




For simplicity, the authors focused on linear features, i.e. <math>\,h_i(x) = <u_i, x></math>, where <math>\,u_i \in R^d</math> and <math>\,u_i</math> are assumed to be orthonormal. They noted that extensions to non-linear functions may be done, for example, by using kernels; however, they did not focus on these extensions in their paper. Furthermore, the functions <math>\,f_t</math> are also linear, i.e. <math>\,f_t(x) = <w_t, x></math>, where <math>\, w_t = \Sigma_i \;a_{it} u_i</math>.
As is also demonstrated by past empirical and theoretical works, the authors' experiment on [http://en.wikipedia.org/wiki/Synthetic_data synthetic data] demonstrated that learning multiple
tasks together results in significant improvements over learning the tasks independently. Furthermore, this experiment demonstrated the following two key points:
* The performance of the authors' algorithm improves as the number of tasks increases, i.e. adding more tasks results in better estimates of the underlying features.
* The performance of the authors' algorithm is moderate for low-dimensionality data, but it increases as the number of irrelevant dimensions increases.


==Conjoint analysis experiment==


<math>\,W</math> the <math>\,d \times T</math> matrix whose columns are the vectors <math>\,w_t</math>. <math>\,A</math> the <math>\,d \times T</math> matrix whose elements are the <math>\,a_{it}</math>. The authors then obtained that <math>\,W = UA</math>.
The authors then tested their method using a real data set that contained 180 persons' ratings regarding their propensities for purchasing one of 20 different personal computers. More details regarding this data can be found in the paper listed in Reference. In this data set, the people corresponded to tasks and the PC models corresponded to training instances.


The assumption that the tasks share a ''small'' set of features implies that <math>\,A</math> has ''many'' rows which are identically equal to zero. This, in turn, implies that the corresponding features, which are the columns of <math>\,U</math>, will not be used to represent the task parameters, which are the columns of <math>\,W</math>. <math>\,W</math>, thus, is a low-[http://mathworld.wolfram.com/MatrixRank.html rank] matrix.
The following figure (taken from the authors' paper listed in Reference) shows that the performance of the authors' algorithm improves as the number of tasks increases.




The authors' goal was to find the feature vectors <math>\,u_i</math> and the parameters <math>\,a_{it}</math>.
[[File:ps2fig3.jpg]]




As a starting point, the authors considered the problem in which there is only one task (referred to as task <math>\,t</math>) and the features <math>\,u_i</math> are fixed. In this problem, the aim was to learn the parameter vector <math>\,a_t \in R^d</math> from data <math>\,\{(x_{ti}, y_{ti})\}_{i=1}^m</math>. The authors expressed this unconstrained problem as:
The authors also showed that their algorithm performs much better than independent ridge regressions.


<center><math>\, min \; \{ \; \sum_{i=1}^m L(y_{ti}, <a_t,U^T x_{ti}>) + \gamma ||a_t||_1^2 \; : \; a_t \; \in \; R^d \; \} \;\;\; (2)</math></center>
The plot in the middle shows that there is a single most important feature that is shared by all people. This feature seems to capture how people weigh the technical characteristics of a computer (such as RAM, CPU, and CD-ROM) against the computer's price. The authors thus concluded that their algorithm is able to find interesting patterns in people’s purchasing decision processes.


In <math>\,(2)</math>, <math>\,\gamma > 0</math> is the regularization parameter.
<br>


It is well-known that using the 1-norm regularization results in a sparse solution to <math>\,(2)</math>, i.e. many elements of the learned <math>\,a_t</math> are zero.
=Conclusion=


The authors have presented an algorithm that learns common sparse function representations across a pool of related tasks. In this algorithm, the number of learned features is not a parameter, and it is instead controlled by the algorithm's regularization parameter. This algorithm also appears to be the first [http://en.wikipedia.org/wiki/Convex_optimization convex optimization] formulation for multi-task feature learning.


Before generalizing <math>\,(2)</math> to the multi-task case, the authors introduced the following regularization error function:
<br>


<center><math>\, \Epsilon(A,U) = \sum_{t=1}^T \sum_{i=1}^m L(y_{ti}, <a_t,U^T x_{ti}>) + \gamma ||A||_{2,1}^2 \;\;\; (3)</math></center>
=Reference=


In <math>\,(3)</math>, the first term is the average empirical error across the tasks, and the second term is a regularization term that penalizes the <math>\,(2, 1)</math>-norm of <math>\,A<math>.
Andreas Argyriou. Theodoros Evgeniou. Massimiliano Pontil. Multi-Task Feature Learning.

Latest revision as of 08:45, 30 August 2017

Introduction

Imagine that a user wishes to perform some task, such as shopping for a book. If the user does not have much experience shopping for books, he or she may draw upon his or her knowledge of shopping for some other product, like music CDs. The tasks are not exactly the same, but there are similarities. The user may recognize that books are made by authors, music by artists. Both have different genres, and perhaps appeal to different demographics.

Multi-task feature learning aims to model this paradigm. Statistically, we wonder if there is some common set of features a user may apply in order to do some abstract task (returning to the example, the common task is shopping). And, specifically, can we learn multiple related tasks simultaneously, as opposed to independently?

Multi-task feature learning is especially relevant to situations where the number of observations may be small. This creates a need for finding shared structures between the tasks, so that observations can be aggregated to learn many tasks simultaneously. The search for these shared structures can be seen as a dimensionality reduction, or alternatively a regularization, problem.

The authors generalize the idea of 1-norm regularization that presents a low dimensional representation of a single task to learn the shared representation across multiple tasks. Regularization parameter controls the number of learned common features. They present an equivalent convex optimization problem to their original formulation, and propose an algorithm to solve this problem.

The learning part of the algorithm simultaneously learns both the features and the task functions through two alternating steps. The first step which is supervised includes independently learning the parameters of the tasks regression or classification functions. In the second step, which is unsupervised, a low-dimensional representation for these task parameters is found, which the authors show is equivalent to learning common features across the tasks.

Learning sparse multi-task representations

Notation

For simplicity, we will first define the following notation:

  • Let [math]\displaystyle{ \,R }[/math] be the set of real numbers, [math]\displaystyle{ \,R_{+} }[/math] the non-negative ones and [math]\displaystyle{ \,R_{++} }[/math] the positive ones.
  • Let [math]\displaystyle{ \,T }[/math] be the number of tasks and let the set of tasks be [math]\displaystyle{ \,N_T := {1,\dots,T} }[/math].
  • For each task [math]\displaystyle{ \,t \in N_T }[/math], we have [math]\displaystyle{ \,m }[/math] input/output examples [math]\displaystyle{ \,(x_{t1}, y_{t1}),\dots ,(x_{tm}, y_{tm}) \in R^d \times R }[/math].
  • The goal is then to estimate [math]\displaystyle{ \,T }[/math] functions [math]\displaystyle{ \,f_t : R^d \mapsto R, t \in N_T }[/math], that approximate the training data well and are also statistically predictive for new data.
  • If [math]\displaystyle{ \,w, u \in R^d }[/math], [math]\displaystyle{ \,\lt w, u\gt \; := \sum_{i=1}^d w_i u_i }[/math] is their standard inner product in [math]\displaystyle{ \,R^d }[/math].
  • For every [math]\displaystyle{ \,p \ge 1 }[/math], [math]\displaystyle{ \,||w||_p \; := \; (\sum_{i=1}^d |w_i|^p)^\frac{1}{p} }[/math] is the [math]\displaystyle{ \,p }[/math]-norm of vector [math]\displaystyle{ \,w }[/math].
  • If [math]\displaystyle{ \,A }[/math] is a [math]\displaystyle{ \,d \times T }[/math] matrix, then [math]\displaystyle{ \,a^i \in R^T }[/math] and [math]\displaystyle{ \,a_j \in R^d }[/math] are the [math]\displaystyle{ \,i }[/math]th row and the [math]\displaystyle{ \,j }[/math]th column of [math]\displaystyle{ \,A }[/math], respectively.
  • For every [math]\displaystyle{ \,r, \; p \ge 1 }[/math], [math]\displaystyle{ \,||A||_{r,p} := (\sum_{i=1}^d ||a^i||_r^p)^\frac{1}{p} }[/math] is the [math]\displaystyle{ \,(r,p) }[/math] -norm of [math]\displaystyle{ \,A }[/math].
  • [math]\displaystyle{ \,S^d }[/math] is the set of [math]\displaystyle{ \,d \times d }[/math] real symmetric matrices and [math]\displaystyle{ \,S_{+}^d }[/math] is the subset of positive semidefinite ones.
  • If [math]\displaystyle{ \,D }[/math] is a [math]\displaystyle{ \,d \times d }[/math] matrix, then [math]\displaystyle{ \,tr(D):=\sum_{i=1}^d D_{ii} }[/math].
  • If [math]\displaystyle{ \,X }[/math] is a [math]\displaystyle{ \,p \times q }[/math] real matrix, then range([math]\displaystyle{ \,X }[/math]) is the set [math]\displaystyle{ \,\{x \in R^p \; : \; x = Xz }[/math], for some [math]\displaystyle{ \,z \in R^q\} }[/math].
  • [math]\displaystyle{ \,O^d }[/math] is the set of [math]\displaystyle{ \,d \times d }[/math] orthogonal matrices.
  • [math]\displaystyle{ \,D^+ }[/math] is the pseudoinverse of a matrix [math]\displaystyle{ \,D }[/math].

Problem formulation

It is assumed that the functions [math]\displaystyle{ \,f_t }[/math] are related to each other, so they all share a small set of features. Formally, the hypothesis is that the functions [math]\displaystyle{ \,f_t }[/math] can be expressed as:

[math]\displaystyle{ \, f_t(x) = \sum_{i=1}^d a_{it}h_i(x), \;\; t \in N_T \;\;\; (1) }[/math]

In the above equation, [math]\displaystyle{ \,h_i : R^d \mapsto R }[/math] are the features and [math]\displaystyle{ \,a_{it} \in R }[/math] are the regression parameters.

The main assumption is that the vast majority of the features have zero coefficients across all the tasks.

For simplicity, the authors focused on linear features, i.e. [math]\displaystyle{ \,h_i(x) = \lt u_i, x\gt }[/math], where [math]\displaystyle{ \,u_i \in R^d }[/math] and [math]\displaystyle{ \,u_i }[/math] are assumed to be orthonormal. They noted that extensions to non-linear functions may be done, for example, by using kernels; however, they did not focus on these extensions in their paper. Furthermore, the functions [math]\displaystyle{ \,f_t }[/math] are also linear, i.e. [math]\displaystyle{ \,f_t(x) = \lt w_t, x\gt }[/math], where [math]\displaystyle{ \, w_t = \Sigma_i \;a_{it} u_i }[/math].

[math]\displaystyle{ \,W }[/math] is the [math]\displaystyle{ \,d \times T }[/math] matrix whose columns are the vectors [math]\displaystyle{ \,w_t }[/math]. [math]\displaystyle{ \,A }[/math] is the [math]\displaystyle{ \,d \times T }[/math] matrix whose elements are the [math]\displaystyle{ \,a_{it} }[/math]. The authors then obtained that [math]\displaystyle{ \,W = UA }[/math] where [math]\displaystyle{ U \in O^d }[/math] .

The assumption that the tasks share a small set of features implies that [math]\displaystyle{ \,A }[/math] has many rows which are identically equal to zero. This, in turn, implies that the corresponding features, which are the columns of [math]\displaystyle{ \,U }[/math], will not be used to represent the task parameters, which are the columns of [math]\displaystyle{ \,W }[/math]. Thus, [math]\displaystyle{ \,W }[/math] is a low-rank matrix.

The authors' goal was to find the feature vectors [math]\displaystyle{ \,u_i }[/math] and the parameters [math]\displaystyle{ \,a_{it} }[/math].

As a starting point, the authors considered the problem in which there is only one task (referred to as task [math]\displaystyle{ \,t }[/math]) and the features [math]\displaystyle{ \,u_i }[/math] are fixed. In this problem, the aim was to learn the parameter vector [math]\displaystyle{ \,a_t \in R^d }[/math] from data [math]\displaystyle{ \,\{(x_{ti}, y_{ti})\}_{i=1}^m }[/math].

This unconstrained problem is expressed as:

[math]\displaystyle{ \, \min_{a_t} \; \{ \; \sum_{i=1}^m L(y_{ti}, \lt a_t,U^T x_{ti}\gt ) + \gamma ||a_t||_1^2 \; : \; a_t \; \in \; R^d \; \} \;\;\; (2) }[/math]

In which, [math]\displaystyle{ \,\gamma \gt 0 }[/math] is the regularization parameter. The fact that using the 1-norm regularization results in a sparse solution to [math]\displaystyle{ \,(2) }[/math] (i.e. many elements of the learned [math]\displaystyle{ \,a_t }[/math] are zero) is well known.

To expand the problem to the multi-task case, the authors introduced the following regularization error function:

[math]\displaystyle{ \, \mathcal{E}(A,U) = \sum_{t=1}^T \sum_{i=1}^m L(y_{ti}, \lt a_t,U^T x_{ti}\gt ) + \gamma ||A||_{2,1}^2 \;\;\; (3) }[/math]

The first term in [math]\displaystyle{ \,(3) }[/math] is the average empirical error across the tasks, and the second term is a regularization term that penalizes the [math]\displaystyle{ \,(2, 1) }[/math] -norm of [math]\displaystyle{ \,A }[/math]. The [math]\displaystyle{ \,(2, 1) }[/math] -norm of [math]\displaystyle{ \,A }[/math] is computed by first computing the 2-norm of the rows of [math]\displaystyle{ \,A }[/math] and then the 1-norm of the vector [math]\displaystyle{ \,b(A) = (\|a^1\|_2,\ldots,\|a^d\|_2) }[/math]. The regularization term in [math]\displaystyle{ \,(3) }[/math] ensures that [math]\displaystyle{ \,b(A) }[/math] will likely be sparse and since each row of [math]\displaystyle{ \,A }[/math] represents a feature this regularization limits the number of common features.

However, since the authors did not simply want to select the features, but rather they also wanted to learn them, they further minimized the function [math]\displaystyle{ \,\mathcal{E} }[/math] over [math]\displaystyle{ \,U }[/math], i.e. they considered the following optimization problem:

[math]\displaystyle{ \,\min \; \{\; \mathcal{E}(A,U) \; : \; U \in O^d, A \in \mathbb{R}^{d \times T} \; \} \;\;\; (4) }[/math]

Solving the above optimization problem, results in learning a low-dimensional representation which is shared across the tasks.

In addition, when [math]\displaystyle{ \,U }[/math] is not learned and is set to [math]\displaystyle{ \,I_{d \times d} }[/math], solving [math]\displaystyle{ \,(4) }[/math] finds a common set of variables across the tasks. Under this condition for [math]\displaystyle{ \,U }[/math], [math]\displaystyle{ \,(4) }[/math] can be re-expressed as the following convex optimization problem:

[math]\displaystyle{ \, \min \; \{\; \sum_{t=1}^T \sum_{i=1}^m L(y_{ti}, \lt a_t, x_{ti}\gt ) + \gamma ||A||_{2,1}^2 \; : \; A \in \mathbb{R}^{d \times T} \; \} \;\;\; (5) }[/math]

Equivalent convex optimization formulation

Equation [math]\displaystyle{ \,4 }[/math] is difficult to solve due to the following two reasons:

  • It is not convex ( although it is separately convex in each of [math]\displaystyle{ \,\{A, U\} }[/math] ).
  • [math]\displaystyle{ \,||A||_{2,1} }[/math] is not smooth, which makes optimization more difficult.

However, [math]\displaystyle{ \,(4) }[/math] can be re-expressed as an equivalent convex problem.

Before doing so, for every [math]\displaystyle{ \,W \in \mathbb{R}^{d \times T} }[/math] and [math]\displaystyle{ \,D \in S_+^d }[/math], the authors defined the following function:

[math]\displaystyle{ \, R(W,D) = \sum_{t=1}^T \sum_{i=1}^m \; L(y_{ti} , \lt w_t,x_ti)\gt ) + \gamma \sum_{t=1}^T \; \lt w_t,D^+w_t\gt \;\;\; (6) }[/math]

Theorem 3.1 (A detailed proof is given in the authors' paper listed in Reference) Problem [math]\displaystyle{ \,(4) }[/math] is equivalent to the following problem:

[math]\displaystyle{ \, \min \; \{ \; R(W,D) \; : \; W \in \mathbb{R}^{d \times T}, \; D \in S_+^d, \; tr(D) \le 1, \; range(W) \subseteq range(D) \; \} \;\;\; (7) }[/math]


That is [math]\displaystyle{ (\hat{A},\hat(U)) }[/math] is an optimal solution for [math]\displaystyle{ \,(4) }[/math] if and only if [math]\displaystyle{ (\hat{W},\hat{D}) = (\hat{U}\hat{A},\hat{U}\textrm{Diag}(\hat{\lambda})\hat{U}^T) }[/math] is an optimal solution for [math]\displaystyle{ \,(7) }[/math], where

[math]\displaystyle{ \hat{\lambda} := \frac{\|a^i\|_2}{\|\hat{A}\|_{2,1}}. }[/math]

Since the rank of [math]\displaystyle{ \,D }[/math] is determined by the number of nonzero eigenvalues, we can see from the above definition of [math]\displaystyle{ \hat{\lambda} }[/math] that the rank of [math]\displaystyle{ \,D }[/math] gives the number of common relevant features shared by the tasks.

By defining the function

[math]\displaystyle{ \,f(w,D) = \begin{cases} w^TD^+w &\text{if } w \in \textrm{range}(D) \;\; \textrm{and} \;\; D \in S_+^d , \\ +\infty &\text{otherwise } . \end{cases} }[/math]

and showing that it is jointly convex in both [math]\displaystyle{ \,W }[/math] and [math]\displaystyle{ \,D }[/math] the authors show (details are available in the authors' paper listed in the Reference section) that the function [math]\displaystyle{ \,R }[/math] in [math]\displaystyle{ \,(6) }[/math] is jointly convex in both [math]\displaystyle{ \,W }[/math] and [math]\displaystyle{ \,D }[/math].

Learning algorithm

The authors solved [math]\displaystyle{ \,(7) }[/math] by alternately minimizing the function [math]\displaystyle{ \,R }[/math] with respect to [math]\displaystyle{ \,D }[/math] and the [math]\displaystyle{ \,w_t }[/math] ( (the [math]\displaystyle{ \,t }[/math]-th column of [math]\displaystyle{ \,W }[/math]).

When [math]\displaystyle{ \,D }[/math] is held fixed, [math]\displaystyle{ \,w_t }[/math] is found by learning its parameters independently using a regularization method such as an SVM. When [math]\displaystyle{ \,w_t }[/math] is held fixed, [math]\displaystyle{ \,D }[/math] is found by solving the following optimization problem:

[math]\displaystyle{ \,\min \; \{ \; \sum_{t=1}^T \; \lt w_t , D^+ w_t\gt \; : \; D \in S_+^d, \; tr(D) \le 1, \; \textrm{range}(W) \subseteq \textrm{range}(D) \; \} \;\;\; (8) }[/math]

Theorem 4.1 (The proof is given in the authors' paper listed in the Reference section) The optimal solution of [math]\displaystyle{ \,(8) }[/math] is [math]\displaystyle{ \,D = \frac {(WW^T)^\frac{1}{2}} {tr(WW^T)^\frac{1}{2}}. \;\;\; (9) }[/math]

Now, we have enough theory to describe an algorithm.

Algorithm 1: Multi-Task Feature Learning

  • Input: [math]\displaystyle{ \{x_{ti}, y_{ti}\}_{i=1}^m, t \in N_t }[/math]
  • Parameters: [math]\displaystyle{ \gamma }[/math]
  • Output [math]\displaystyle{ d \times d }[/math] matrix [math]\displaystyle{ D }[/math] and [math]\displaystyle{ d \times T }[/math] matrix [math]\displaystyle{ W }[/math]
  • Initial Value: [math]\displaystyle{ D = \frac{I_d}{d} }[/math]
  • while convergence condition not met:
    • for [math]\displaystyle{ t = 1,\dots, T }[/math]:
      • find [math]\displaystyle{ w_t = \arg\min \{\sum_{i=1}^m L(y_{ti} , \lt w_t,x_{ti})\gt ) + \gamma \lt w_t,D^+w_t\gt }[/math] with [math]\displaystyle{ w \in R^d }[/math] and [math]\displaystyle{ w \in \textrm{range}(D) }[/math]
    • end for
  • set [math]\displaystyle{ D = \frac{(WW^T)^{1/2}}{tr(WW^T)^{1/2}} }[/math]
  • end while

In [math]\displaystyle{ \,(9) }[/math], [math]\displaystyle{ \,tr(WW^T)^\frac{1}{2} }[/math] is the sum of the singular values of [math]\displaystyle{ \,W }[/math], and thus it is sometimes called the trace norm (this work by Jason D. M. Rennie shows how the trace norm of a matrix can be expressed in a number of different ways). Using the trace norm, [math]\displaystyle{ \,(7) }[/math] is reduced to a regularization problem that depends only on [math]\displaystyle{ \,W }[/math]. However, the trace norm is not smooth. As a result, the authors argue that their algorithm (Algorithm 1) that uses an alternating minimization strategy is beneficial in that it is:

  • simple to implement
  • has a natural interpretation

The authors further note that Algorithm 1 alternately performs a supervised step and an unsupervised step; in the unsupervised step, common representations across the tasks are learned and, in the supervised step, task-specific functions are learned using these representations.

The authors further note that if [math]\displaystyle{ \,D }[/math] in [math]\displaystyle{ \,(7) }[/math] is additionally constrained to be diagonal, then [math]\displaystyle{ \,(7) }[/math] reduces to [math]\displaystyle{ \,(5) }[/math].

Experiments

The authors performed their experiments on a synthetic data set and a real data set, and they used the square loss function and tuned the regularization parameter using leave-one-out cross validation.

Synthetic Experiments

The authors created synthetic data sets by generating [math]\displaystyle{ \,T }[/math] = 200 task parameters [math]\displaystyle{ \,w_t }[/math] from a [math]\displaystyle{ \,5 }[/math]-dimensional Gaussian distribution with zero mean and covariance equal to [math]\displaystyle{ \,Diag(1, 0.25, 0.1, 0.05, 0.01) }[/math]. More details regarding the authors' synthetic data can be found in the paper listed in Reference.

The following figure (taken from the authors' paper listed in Reference) shows the performance of the authors' algorithm for [math]\displaystyle{ \,T = }[/math] 10, 25, 100, and 200 tasks and the performance of 200 independent standard ridge regressions on the data:


File:ps2fig2.jpg


As is also demonstrated by past empirical and theoretical works, the authors' experiment on synthetic data demonstrated that learning multiple tasks together results in significant improvements over learning the tasks independently. Furthermore, this experiment demonstrated the following two key points:

  • The performance of the authors' algorithm improves as the number of tasks increases, i.e. adding more tasks results in better estimates of the underlying features.
  • The performance of the authors' algorithm is moderate for low-dimensionality data, but it increases as the number of irrelevant dimensions increases.

Conjoint analysis experiment

The authors then tested their method using a real data set that contained 180 persons' ratings regarding their propensities for purchasing one of 20 different personal computers. More details regarding this data can be found in the paper listed in Reference. In this data set, the people corresponded to tasks and the PC models corresponded to training instances.

The following figure (taken from the authors' paper listed in Reference) shows that the performance of the authors' algorithm improves as the number of tasks increases.


File:ps2fig3.jpg


The authors also showed that their algorithm performs much better than independent ridge regressions.

The plot in the middle shows that there is a single most important feature that is shared by all people. This feature seems to capture how people weigh the technical characteristics of a computer (such as RAM, CPU, and CD-ROM) against the computer's price. The authors thus concluded that their algorithm is able to find interesting patterns in people’s purchasing decision processes.


Conclusion

The authors have presented an algorithm that learns common sparse function representations across a pool of related tasks. In this algorithm, the number of learned features is not a parameter, and it is instead controlled by the algorithm's regularization parameter. This algorithm also appears to be the first convex optimization formulation for multi-task feature learning.


Reference

Andreas Argyriou. Theodoros Evgeniou. Massimiliano Pontil. Multi-Task Feature Learning.