relevant Component Analysis
First paper: Shental et al., 2002 [1]
Irrelevant data variability often causes difficulties in classification and clustering tasks. For example, when data variability is dominated by environment conditions, such as global illumination, nearest-neighbour classification in the original feature space may be very unreliable. The goal of Relevant Component Analysis (RCA) is to find a transformation that amplifies relevant variability and suppresses irrelevant variability.
- Definition of irrelevant variability: We say that data variability is correlated with a specific task "if the removal of this variability from the data deteriorates (on average) the results of clustering or retrieval" [1]. Variability is irrelevant if it is "maintained in the data" but "not correlated with the specific task" [1].
To achieve this goal, Shental et al. introduced the idea of chunklets – "small sets of data points, in which the class label is constant, but unknown" [1]. As we will see, chunklets allow irrelevant variability to be suppressed without needing fully labelled training data. Since the data come unlabelled, the chunklets "must be defined naturally by the data": for example, in speaker identification, "short utterances of speech are likely to come from a single speaker" [1]. The authors coin the term adjustment learning to describe learning using chunklets; adjustment learning can be viewed as falling somewhere between unsupervised learning and supervised learning.
Relevant Component Analysis tries to find a linear transformation W of the feature space such that the effect of irrelevant variability is reduced in the transformed space. That is, we wish to rescale the feature space and reduce the weights of irrelevant directions. The main premise of RCA is that we can reduce irrelevant variability by reducing the within-class variability. Intuitively, a direction which exhibits high variability among samples of the same class is unlikely to be useful for classification or clustering.
RCA assumes that the class covariances are all equal. If we allow this assumption, it makes sense to rescale the feature space using a whitening transformation based on the common class covariance Σ. This gives the familiar transformation W = VΛ-1/2, where V and Λ can be found by the singular value decomposition of Σ.
With labelled data estimating Σ is straightforward, but in RCA labelled data is not available and an approximation is calculated using chunklets. The chunklet scatter matrix is calculated by
- [math]\displaystyle{ S_{ch} = \frac{1}{|\Omega|}\sum_{n=1}^N|H_n|Cov(H_n) }[/math]
where |Ω| is the size of the data set, Hn is the nth chunklet, |Hn| is the size of the nth chunklet, and N is the number of chunklets.
Intuitively, this is a weighted average of the chunklet covariances, with weight proportional to the size of the chunklet.
The steps of the RCA algorithm are as follows:
- "1. Calculate Sch... Let r denote its effective rank (the number of singular values of Sch which are significantly larger than 0).
- 2. Compute the total covariance (scatter) matrix of the original data ST, and project the data using PCA to its r largest dimensions.
- 3. Project Sch onto the reduced dimensional space, and compute the corresponding whitening transformation W.
- 4. Apply W to the original data (in the reduced space)." [1]
Experimental Results: Face Recognition
The authors demonstrated the performance of RCA for the task of face recognition using the yaleA database. The database contains 155 face images of 15 people; lighting conditions and facial expression are varied across images. RCA is compared with the Eigenface method (based on PCA) and the Fisherface method (based on Fisher’s Linear Discriminant) for both nearest neighbour classification and clustering-based classification. In this dataset, the data is not naturally divided into chunklets, so the authors randomly sample chunklets given the ground-truth class (for example, if an individual is represented in 10 images, two chunklets may be formed by randomly partitioning the images into two groups of 5 images.)
For nearest neighbour classification, RCA outperforms Eigenface but does slightly worse than Fisherface. For clustering, RCA performs better than Eigenface and comparably to Fisherface. The authors pointed out that these experimental results are encouraging as Fisherface is a supervised method.
Experimental Results: Surveillance
In a second experiment, the authors used surveillance video footage divided into discrete clips in which a single person is featured. The same person can appear in multiple clips, and the task was to retrieve all clips in which a query person appears. A colour histogram is used to represent a person. Sources of irrelevant variation include reflections, occlusions, and illumination. In this experiment, the data does come naturally in chunklets: each clip features a single person, so frames in the same clip from a chunklet. Figure 7 in the paper shows the results of k-nearest neighbour classification (not reproduced here for copyright reasons).
Second Paper: Bar-Hillel et al., 2003 [2]
In a subsequent work [2], Bar-Hillel et al. described how RCA can be shown to optimize an information theoretic criterion, and compared the performance of RCA with the approach proposed by Xing et al. [3].
Information Maximization
According to information theory, "when an input X is transformed into a new representation Y, we should seek to maximize the mutual information I(X, Y) between X and Y under suitable constraints" [2]. In adjustment learning, we can think of the objective to be to keep chunklet points close to each other in the transformed space. More formally:
- [math]\displaystyle{ \max_{f \in F}I(X,Y) \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||y_{ji} - m_j^y||^2 \le K }[/math]
where f is a transformation function, mjy is the mean of chunklet j in the transformed space, p is the total number of chunklet points, and K is a constant.
To maximize I(X,Y), we can simply maximize the entropy of Y, H(Y). This is because I(X,Y) = H(Y) – H(Y|X), and H(Y|X) is constant since the transformation is deterministic. Intuitively, since the transformation is deterministic there is no uncertainty in Y if X is known.
Now we would like to express H(Y) in terms of H(X). If the transformation is invertible, we have py(y) = px(x) / |J(x)|, where J(x) is the Jacobian of the transformation. Therefore,
- [math]\displaystyle{ \begin{align} H(Y) & = -\int_y p(y)\log p(y)\, dy \\ & = -\int_x p(x) \log \frac{p(x)}{|J(x)|} \, dx \\ & = H(X) + \langle \log |J(x)| \rangle_x \end{align} }[/math]
Assuming a linear transformation Y = AX, the Jacobian is simply a constant |A|. So to maximize I(X,Y), we can maximize H(Y), and maximizing H(Y) amounts to maximizing |A|. Hence, the optimization objective can be updated as
- [math]\displaystyle{ \max_A |A| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_{A^tA} \le K }[/math]
This can also be expressed in terms of the Mahalanobis distance matrix B = AtA as follows, noting that log |A| = (1/2) log |B|.
- [math]\displaystyle{ \max_B |B| \quad s.t. \quad \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \le K , \quad B \gt 0 }[/math]
The solution to this problem is [math]\displaystyle{ B = \tfrac{K}{N} \hat{C}^{-1} }[/math], where [math]\displaystyle{ \hat{C} }[/math] is the chunklet scatter matrix calculated in Step 1 of RCA. Thus, RCA gives the optimal Mahalanobis distance matrix up to a scale factor.
Within-Chunklet Distance Minimization
In addition, RCA minimizes the sum of within-chunklet squared distances. If we consider the optimization problem
- [math]\displaystyle{ \min_B \frac{1}{p}\sum_{j=1}^k\sum_{i=1}^{n_j}||x_{ji} - m_j||^2_B \quad s.t. \quad |B| \ge 1 }[/math]
then it can be shown that RCA once again gives the optimal Mahalanobis distance matrix up to a scale factor. This property suggests a natural comparison with Xing et al.’s method, which similarly learns a distance metric based on similarity side information. Xing et al.’s method assumes side information in the form of pairwise similarities and dissimilarities, and seeks to optimize
- [math]\displaystyle{ \min_B \sum_{(x_1,x_2) \in S} ||x_1 - x_2||^2_B \quad s.t. \sum_{(x_1,x_2) \in D} ||x_1 - x_2||_B \ge 1 , \quad B \ge 0 }[/math]
where S contains similar pairs and D contains dissimilar pairs. Comparing to the preceding optimization problem, if all chunklets have size 2 (i.e. the chunklets are just pairwise similarities), the objective function is the same up to a scale factor.
The authors compared the clustering performance of RCA with Xing et al.’s method using six of the UC Irvine datasets. Clustering performance was measured using a normalized accuracy score defined as
- [math]\displaystyle{ \sum_{i \gt j}\frac{1 \lbrace 1 \lbrace c_i = c_j \rbrace = 1 \lbrace \hat{c}_i = \hat{c}_j \rbrace \rbrace}{0.5m(m-1)} }[/math]
where 1{ } is the indicator function, [math]\displaystyle{ \hat{c} }[/math] is the assigned cluster, and c is the true cluster. The score may be interpreted as the probability of correctly assigning two randomly drawn points.
Overall, RCA yielded an improvement over regular K-means and showed comparable performance to Xing et al.’s method, however RCA is more computationally efficient as it works with closed-form expressions while Xing et al.’s method requires iterative gradient descent.
Suggestions/Critique
- RCA makes effective use of limited side information in the form of chunklets, however in most applications the data does not naturally come in chunklets. Indeed, in the face recognition experiments, the authors had to make use of prior information to artificially create chunklets. It may be useful if the authors provided additional examples of applications where data is naturally partitioned into chunklets, to further motivate the applicability of RCA.
- RCA also assumes equal class covariances, which might limit its performance on many real-world datasets.
- In the UC Irvine experiments, RCA shows similar performance to Xing et al.’s method, but the authors noted that RCA is more computationally efficient. While they make a sensible logical argument (iterative gradient descent tends to be computationally expensive), providing experimental running times may help support and quantify this claim.
Kernel RCA
Although RCA, computationally and technically, has significant advantages, there are some kind of situations for real problems that RCA fails to deal with them, i.e there are some restrictions along with RCA.
(i)- RCA only considers linear transformations and fails for nonlinear transformations (even for simple ones)
(ii)- since RCA acts in the input space, its number of parameters depends on the dimensionality of the feature vectors
(iii)- RCA requires the vectorial representation of data, which may not be possible for some kind of data to be naturally in this form; like protein sequences.
To overcome this restrictions Tesang and colleagues (2005) suggested to use kernel in RCA and showed how one can kernelize RCA.
Kernelizing RCA
For [math]\displaystyle{ k }[/math] given chunklets, each containing [math]\displaystyle{ n_{i} }[/math] patterns [math]\displaystyle{ \left\{x_{i,1},...,x_{i,n_{i}} \right\} }[/math] the covariance matrix of centered patterns is as follow:
[math]\displaystyle{ C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\bar{x}_{i}\right)\left(x_{i,j}-\bar{x}_{i}\right)^{'} }[/math]
and the associated whitening transform is as
[math]\displaystyle{ x\stackrel{}{\rightarrow}C^{-\frac{1}{2}}x }[/math]
Now let [math]\displaystyle{ \left\{x_{1,1},x_{1,2},...,x_{1,n_{1}},...,x_{k,1},...,x_{k,n_{k}} \right\} }[/math] then C can be written as:
[math]\displaystyle{ C=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}} \left(x_{i,j}-\frac{1}{n_{i}}X1{i}\right)\left(x_{1,i}-\frac{1}{n_{i}}X1{i}\right)^{'} }[/math]
where [math]\displaystyle{ 1_{i} }[/math] is [math]\displaystyle{ n \times 1 }[/math] vector such that:
[math]\displaystyle{ [1_{i}]_{j}= \left\{\begin{matrix} 1 & \text{patern} j \in \text{chunklet} i \\ 0 & \text{otherwise} \end{matrix}\right. }[/math]
and [math]\displaystyle{ I_{i}=diag\left(1_{i}\right) }[/math].
using the above notations C can be simplified to the form [math]\displaystyle{ C=\frac{1}{n}XHX^{'} }[/math]
where [math]\displaystyle{ H=\sum_{i=1}^{k}\left(I_{i}-\frac{1}{n_{i}}1_{i}1_{i}^{'}\right) }[/math]
for the issue of non-singularity, for small [math]\displaystyle{ \epsilon }[/math] let [math]\displaystyle{ \hat{C}=C+\epsilon I }[/math] then we can find the inverse of [math]\displaystyle{ \hat{C} }[/math] which is
[math]\displaystyle{ \hat{C}^{-1}=\frac{1}{\epsilon}I-\frac{1}{n \epsilon^{2}}XH \left(X^{'}XH \right)^{-1}X^{'} }[/math]
Therefor the the inner product of transformed [math]\displaystyle{ x }[/math] and [math]\displaystyle{ y }[/math] is
<math> \left(\hat{C}^{-\frac{1}{2}}x\right)
References
[1] N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment Learning and Relevant Component Analysis," Proc. European Conference on Computer Vision (ECCV), 2002, pp. 776-790.
[2] A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, "Learning Distance Functions using Equivalence Relations," Proc. International Conference on Machine Learning (ICML), 2003, pp. 11-18.
[3] E. Xing, A. Ng, M. Jordan, and S. Russell, "Distance metric learning with application to clustering with side-information", Advances in Neural Information Processing Systems, 2002.