independent Component Analysis: algorithms and applications: Difference between revisions
Line 65: | Line 65: | ||
The technique of ICA was first introduced in 1982 in a simplified model of motion coding in muscle contraction, where the original signals were the angular position and velocity of a moving joint and the observed signals were the measurements from two types of sensors measuring muscle contraction. Throughout the 1980s, ICA was mostly known among French researchers but not among the international research community. A lot of ICA algorithms got developed since early 1990s, though ICA still remained a small and narrow research area until mid-1990s. The breakthrough happened between mid-1990s and late-1990s during which a number of very fast ICA algorithms, of which FastICA was one, were developed so that ICA can be applied to large scaled problem. After 2000, a lot of international workshops and papers have been devoted to ICA research and ICA has now become an established and mature field of research. | The technique of ICA was first introduced in 1982 in a simplified model of motion coding in muscle contraction, where the original signals were the angular position and velocity of a moving joint and the observed signals were the measurements from two types of sensors measuring muscle contraction. Throughout the 1980s, ICA was mostly known among French researchers but not among the international research community. A lot of ICA algorithms got developed since early 1990s, though ICA still remained a small and narrow research area until mid-1990s. The breakthrough happened between mid-1990s and late-1990s during which a number of very fast ICA algorithms, of which FastICA was one, were developed so that ICA can be applied to large scaled problem. After 2000, a lot of international workshops and papers have been devoted to ICA research and ICA has now become an established and mature field of research. | ||
== Kernel ICA <ref> Bach and Jordan,(2002); Kernel Independent Component Analysis </ref>== | == Kernel ICA <ref> Bach and Jordan,(2002); Kernel Independent Component Analysis. Journal of Machine Learning Research, 3; 1-48</ref>== | ||
Bach and Jordan (2002) extended the ICA to functions in Reproducing kernel Hilbert Space (RKHS) rather than a single nonlinear function; as it was considered in the earliest works. to do so, they used Canonical Correlation - correlation of future maps of multivariate random variable using kernel associated with the RKHS - rather than direct correlation of the considered random variables. | Bach and Jordan (2002) extended the ICA to functions in Reproducing kernel Hilbert Space (RKHS) rather than a single nonlinear function; as it was considered in the earliest works. to do so, they used Canonical Correlation - correlation of future maps of multivariate random variable using kernel associated with the RKHS - rather than direct correlation of the considered random variables. |
Revision as of 11:00, 7 July 2009
Motivation
Imagine a room where two people are speaking at the same time and two microphones are used to record the speech signals. Denoting the speech signals by [math]\displaystyle{ s_1(t) \, }[/math] and [math]\displaystyle{ s_2(t)\, }[/math] and the recorded signals by [math]\displaystyle{ x_1(t) \, }[/math] and [math]\displaystyle{ x_2(t) \, }[/math], we can assume the linear relation [math]\displaystyle{ x = As \, }[/math], where [math]\displaystyle{ A \, }[/math] is a parameter matrix that depends on the distances of the microphones from the speakers. The interesting problem of estimating both [math]\displaystyle{ A\, }[/math] and [math]\displaystyle{ s\, }[/math] using only the recorded signals [math]\displaystyle{ x\, }[/math] is called the cocktail-party problem, which is the signature problem for ICA.
Introduction
ICA shows, perhaps surprisingly, that the cocktail-party problem can be solved by imposing two rather weak (and often realistic) assumptions, namely that the source signals are statistically independent and have non-Gaussian distributions. Note that PCA and classical factor analysis cannot solve the cocktail-party problem because such methods seek components that are merely uncorrelated, a condition much weaker than independence.
ICA has a lot of applications in science and engineering. For example, it can be used to find the original components of brain activity by analyzing electrical recordings of brain activity given by electroencephalogram (EEG). Another important application is to efficient representations of multimedia data for compression or denoising.
Definition of ICA
The ICA model assumes a linear mixing model [math]\displaystyle{ x = As \, }[/math], where [math]\displaystyle{ x \, }[/math] is a random vector of observed signals, [math]\displaystyle{ A \, }[/math] is a square matrix of constant parameters, and [math]\displaystyle{ s \, }[/math] is a random vector of statistically independent source signals. Note that the restriction of [math]\displaystyle{ A \, }[/math] being square matrix is not theoretically necessary and is imposed only to simplify the presentation. Also keep in mind that in the mixing model we do not assume any distributions for the independent components.
Ambiguities of ICA
Because both [math]\displaystyle{ A \, }[/math] and [math]\displaystyle{ s \, }[/math] are unknown, it is easy to see that the variances, the sign or the order of the independent components cannot be determined. Fortunately such ambiguities are often insignificant in practice and ICA can as well just fix the sign and assume unit variance of the components.
Why Gaussian variables are forbidden
In this section we show that ICA cannot resolve independent components which have Gaussian distributions.
To see this, assume that the two source signals [math]\displaystyle{ s_1 \, }[/math] and [math]\displaystyle{ s_2 \, }[/math] are Gaussian and the mixing matrix [math]\displaystyle{ A\, }[/math] is orthogonal. Then the observed signals [math]\displaystyle{ x_1 \, }[/math] and [math]\displaystyle{ x_2 \, }[/math] will have joint density given by [math]\displaystyle{ p(x_1,x_2)=\frac{1}{2 \pi}\exp(-\frac{x_1^2+x_2^2}{2}) }[/math], which is rotationally symmetric. In other words, the joint density is be the same for any orthogonal mixing matrix. This means that in the case of Gaussian variables, ICA can only determine the mixing matrix up to an orthogonal transformation.
The fact that ICA cannot be used on Gaussian variables is a primary reason of ICA's late emergence in the research literature because classical factor analysis assumes Gaussian random variables.
Independence is a much stronger requirement than uncorrelatedness. Of particular interest to ICA theory is the following two results which show that with additional assumptions, uncorrelatedness is equivalent to independence.
Result 1: Two random variables [math]\displaystyle{ X \, }[/math] and [math]\displaystyle{ Y \, }[/math] are independent if and only if any bounded continuous functions of [math]\displaystyle{ X \, }[/math] and [math]\displaystyle{ Y \, }[/math] are uncorrelated.
Result 2: Two Gaussian random variables [math]\displaystyle{ X \, }[/math] and [math]\displaystyle{ Y \, }[/math] are independent if and only if they are uncorrelated.
ICA Estimation Principles
Principle 1: Nonlinear decorrelation
From the above discussion, we see that we can estimate the mixing matrix [math]\displaystyle{ A \, }[/math] by finding a matrix [math]\displaystyle{ W \, }[/math] such that for any [math]\displaystyle{ i \neq j \, }[/math], and suitable nonlinear functions [math]\displaystyle{ g \, }[/math] and [math]\displaystyle{ h \, }[/math], [math]\displaystyle{ g(y_i) \, }[/math] and [math]\displaystyle{ h(y_j) \, }[/math] are uncorrelated.
Principle 2: Maximizing Non-gaussanity
Loosely speaking, the Central Limit Theorem says that the sum of identically distributed non-gaussian random variables are closer to gaussian than the original ones. Because of this, any mixing of the identically distributed non-gaussian independent components would be more gaussian than the original signals [math]\displaystyle{ s \, }[/math]. Using this observation, we can find the original signals from the observed signals [math]\displaystyle{ x \, }[/math] as follows: find the weighting vectors [math]\displaystyle{ w \, }[/math] such that the [math]\displaystyle{ w^T x \, }[/math] are the most non-gaussian.
Measures of non-Gaussianity
kurtosis
Kurtosis is the classical measure of non-Gaussianity which is defined by [math]\displaystyle{ kurt(y) = E\{y^4\} - 3(E\{y^2\})^2. \, }[/math]. Positive kurtosis typically implies a spiky pdf near zero and heavy tails at the two ends. (e.g. Laplace distribution); Negative kurtosis typically implies a flat pdf which is rather constant near zero, and very small at the two ends. (e.g. uniform distribution with finite support)
As a computational measure for non-gaussanity, kurtosis, on one hand, has the merit that it is easy to compute and has nice linearity properties. On the other hand, it is non-robust because kurtosis for a large sample size can be significantly affected by a few outliers in the sample.
negentropy
Intuitive explanation
Before understanding negentropy, we have to first understand entropy, which is a key concept in information theory. Loosely speaking, entropy is a measure of how "distributed" a random variable is, and a rule of thumb is that a "more distributed" pdf has a higher entropy. An important theorem in information theory states that the Gaussian distribution has the largest entropy among all distributions with the same variance. In informal language, this means the Gaussian distribution is the most "distributed" pdf. Negentropy measures non-gaussianity by the differences in entropy of a pdf with the corresponding Gaussian distribution - this would be make precise in the following technical explanation.
Technical explanation
The entropy of a discrete random variable [math]\displaystyle{ X \, }[/math] with possible values [math]\displaystyle{ \{x_1, x_2, ..., x_n\} \, }[/math] is defined as [math]\displaystyle{ H(X) = -\sum_{i=1}^n {p(x_i) \log p(x_i)} }[/math]
The (differential) entropy of a continuous random variable [math]\displaystyle{ X \, }[/math] with probability density function [math]\displaystyle{ f \, }[/math] is similarly defined as [math]\displaystyle{ H[X] = -\int\limits_{-\infty}^{\infty} f(x) \log f(x)\, dx }[/math]
It is obvious how the definition of differential entropy can be extended to higher dimensions.
For a random vector [math]\displaystyle{ y\, }[/math] with covariance matrix [math]\displaystyle{ C \, }[/math], its negentropy is defined as [math]\displaystyle{ J(y) = H(Gaussian_C) - H(y) \, }[/math], where [math]\displaystyle{ Gaussian_C \, }[/math] denotes the Gaussian distribution with covariance matrix [math]\displaystyle{ C \, }[/math]. Note that Negentropy is always non-negative and equals zero for a Gaussian distribution.
Empirical estimation of negentropy
In practice, negentropy has to be estimated from a finite sample. There are two main ways to do this. The first approach is to Taylor expand negentropy and take the lower order-terms. This would result in an estimation of negentropy expressed in higher moments(3rd degree and higher) of the pdf. As the estimation involves higher moments, this suffers from the same non-robustness problem faced by kurtosis. The second, and more robust, approach finds the distribution with the maximum entropy that is compatible with the observed sample, and estimates the negentropy of the real (and unknown) distribution by the negentropy of the "entropy-maximizing" distribution. While the second approach is more robust, it is also more computationally involved.
A brief history of ICA
The technique of ICA was first introduced in 1982 in a simplified model of motion coding in muscle contraction, where the original signals were the angular position and velocity of a moving joint and the observed signals were the measurements from two types of sensors measuring muscle contraction. Throughout the 1980s, ICA was mostly known among French researchers but not among the international research community. A lot of ICA algorithms got developed since early 1990s, though ICA still remained a small and narrow research area until mid-1990s. The breakthrough happened between mid-1990s and late-1990s during which a number of very fast ICA algorithms, of which FastICA was one, were developed so that ICA can be applied to large scaled problem. After 2000, a lot of international workshops and papers have been devoted to ICA research and ICA has now become an established and mature field of research.
Kernel ICA <ref> Bach and Jordan,(2002); Kernel Independent Component Analysis. Journal of Machine Learning Research, 3; 1-48</ref>
Bach and Jordan (2002) extended the ICA to functions in Reproducing kernel Hilbert Space (RKHS) rather than a single nonlinear function; as it was considered in the earliest works. to do so, they used Canonical Correlation - correlation of future maps of multivariate random variable using kernel associated with the RKHS - rather than direct correlation of the considered random variables.
References
<references/>