# independent Component Analysis: algorithms and applications

## Contents

## Motivation

Imagine a room where two people are speaking at the same time and two microphones are used to record the speech signals. Denoting the speech signals by [math]s_1(t) \,[/math] and [math]s_2(t)\,[/math] and the recorded signals by [math] x_1(t) \,[/math] and [math]x_2(t) \,[/math], we can assume the linear relation [math]x = As \,[/math], where [math]A \,[/math] is a parameter matrix that depends on the distances of the microphones from the speakers. The interesting problem of estimating both [math]A\,[/math] and [math]s\,[/math] using only the recorded signals [math]x\,[/math] is called the *cocktail-party problem*, which is the signature problem for **ICA**.

## Introduction

**ICA** shows, perhaps surprisingly, that the *cocktail-party problem* can be solved by imposing two rather weak (and often realistic) assumptions, namely that the source signals are statistically independent and have non-Gaussian distributions. Note that PCA and classical factor analysis cannot solve the *cocktail-party problem* because such methods seek components that are merely uncorrelated, a condition much weaker than independence. The independent assumption gives us an advantage that singals obtained form non-linear transformation of the source signals are uncorrelated While it is not true when source signals are merely uncorrelated. These two assumptions also give us an objective in finding matrix [math]\ A[/math], that is, we want to find components which are as statistically independent and non-Gaussian as possible.

**ICA** has a lot of applications in science and engineering. For example, it can be used to find the original components of brain activity by analyzing electrical recordings of brain activity given by electroencephalogram (EEG). Another important application is to efficient representations of multimedia data for compression or denoising.

**Relationship with Dimension Reduction**<ref>A. Hyvärinen, J. Karhunen, E. Oja (2001): Independent Component Analysis, New York: Wiley, ISBN 978-0-471-40540-5 Introductory chapter</ref>

Suppose we have [math]n[/math] oberserved signals [math]\ x_i[/math] where [math]\ i=1,...,n[/math] from mixing [math]\ m[/math] source signals [math]\ y_i[/math], where[math]\ i=1,...,m[/math],

we want to find such a transformation matrix [math]\ W[/math], that for a given number of dimensions [math]\ d[/math]

[math]\ y'=Wx[/math], where [math]\ y'[/math] is a [math]\ d \times 1[/math] vector.

the transformed variable [math]\ y'_i[/math] is considered the component explaning the essential structure of the observed data. These components should contain as much as possible information of the observed data.

**Concerns**

The *cocktail-party problem* or *blind source separation problem* means that we don't have information about the source signal. In the ICA setting, it seems that the number of observed signals and the number of source signals are equal. However, in general, the number of sensors could be less than the number of sources. In an extreme case, we can have only one sensor but several sources. For example, we can have one microphone recording two speeches. Given a mixed signal, could we separate it? This is one of the applicaitons of the paper by Francis R. Bach and Michael I. Jordan Learning Spectral Clustering, With Application To Speech Separation . One of the concern of ICA is that if this is the case, where the matrix [math]\ A[/math] is not square, can it demixs the siganls? The other is that if the observed signals are quite different from each other, will it cause difficulty in applying ICA?

### Definition of ICA

The **ICA** model assumes a linear mixing model [math] x = As \,[/math], where [math]x \,[/math] is a random vector of observed signals, [math]A \,[/math] is a square matrix of constant parameters, and [math]s \,[/math] is a random vector of statistically independent source signals. Each component of [math]s[/math] is a source signal. Note that the restriction of [math] A \,[/math] being square matrix is not theoretically necessary and is imposed only to simplify the presentation. Also keep in mind that in the mixing model we do not assume any distributions for the independent components.

### Ambiguities of ICA

Because both [math]A \,[/math] and [math]s \,[/math] are unknown, it is easy to see that the variances, the sign or the order of the independent components cannot be determined. That's because any change to the scale or order of sources can be canceled by appropriate changes to the matrix [math]A[/math]. Fortunately such ambiguities are often insignificant in practice and **ICA** can as well just fix the sign and assume unit variance of the components.

### Why Gaussian variables are forbidden

In this section we show that **ICA** cannot resolve independent components which have Gaussian distributions.

To see this, assume that the two source signals [math]s_1 \,[/math] and [math]s_2 \,[/math] are Gaussian and the mixing matrix [math]A\,[/math] is orthogonal. Then the observed signals [math]x_1 \,[/math] and [math]x_2 \,[/math] will have joint density given by [math]p(x_1,x_2)=\frac{1}{2 \pi}\exp(-\frac{x_1^2+x_2^2}{2})[/math], which is rotationally symmetric. In other words, the joint density is be the same for **any** orthogonal mixing matrix. This means that in the case of Gaussian variables, **ICA** can only determine the mixing matrix up to an orthogonal transformation.

The fact that **ICA** cannot be used on Gaussian variables is a primary reason of ICA's late emergence in the research literature because classical factor analysis assumes Gaussian random variables.

In the real world, we may face a distribution close to the Gaussian distribution such as Student t distribution. The question is what will happen to the ICA in these situations? If it cannot resolve these problems, isn't it too restrictive?

Two random variables [math]\,y_1, y_2[/math] are independent if information on [math]\,y_1[/math] doesn't give any information on [math]\,y_2[/math], and vice versa. In math words, [math]\,y_1, y_2[/math] are independent if the joint probability density function can be written as the multiplication of each probability denity function:

[math]\,p(y_1, y_2) = p_1(y_1)*p_2(y_2)[/math]

An important property of independent variables [math]y_1[/math] and [math]y_2[/math] is that for any two functions [math]h_1[/math] and [math]h_2[/math] we have [math]E\{h_1(y_1)h_2(y_2)\}=E\{h_1(y_1)\}E\{h_2(y_2)\}[/math].

Two random variables [math]\,y_1, y_2[/math] are uncorrelated if the covariance is zero:

[math]\,E(y_1y_2) - E(y_1)E(y_2)=0[/math]

Independence is a much stronger requirement than uncorrelatedness. Of particular interest to ICA theory is the following two results which show that with additional assumptions, uncorrelatedness is equivalent to independence.

**Result 1:** Two random variables [math]X \,[/math] and [math]Y \,[/math] are independent if and only if any bounded continuous functions of [math]X \,[/math] and [math]Y \,[/math] are uncorrelated.

**Result 2:** Two Gaussian random variables [math]X \,[/math] and [math]Y \,[/math] are independent if and only if they are uncorrelated.

### Data Whitening

Data whitening is a transformation to change the covariance matrix of a set of samples into the identity matrix. In other words, it decorrelates the random variables of the samples. These random variables have the same variance as the originals.

## Fuether preprocessing

The success of ICA for a given data set may depend crucially on performing some application-dependent preprocessing steps. For example, if the data consists of time-signals, some band-pass filtering may be very useful. Note that if we filter linearly the observed signals [math] x_i^*(t)[/math], the ICA model still holds for [math] x_i^*(t)[/math], with the same maixing matrix.

This can be seen as follows. Denote by [math] \textbf{X} [/math] the matrix that contains the observations [math] \textbf{x(1), x(2), ..., x(T)}[/math] as its columns, and similarly for [math] \textbf{S}[/math]. Then the ICA model can be expressed as:

No, time filtering of [math]\textbf{X} [/math] corresponds to multiplying [math]\textbf{X} [/math] from the right by a matrix, let us call it [math]\textbf{M} [/math]. This gives

showing that the ICA model remains valid.

### ICA Estimation Principles

#### Principle 1: Nonlinear decorrelation

From the above discussion, we see that we can estimate the mixing matrix [math]A \,[/math] by finding a matrix [math]W \,[/math] such that for any [math] i \neq j \,[/math], and suitable nonlinear functions [math]g \,[/math] and [math]h \,[/math], [math]g(y_i) \,[/math] and [math]h(y_j) \,[/math] are uncorrelated.

#### Principle 2: Maximizing Non-gaussanity

Loosely speaking, the Central Limit Theorem says that the sum of identically distributed non-gaussian random variables are closer to gaussian than the original ones. Because of this, any mixing of the identically distributed non-gaussian independent components would be more gaussian than the original signals [math] s \,[/math]. Using this observation, we can find the original signals from the observed signals [math]x \,[/math] as follows: find the weighting vectors [math]w \,[/math] such that the [math]w^T x \,[/math] are the most non-gaussian.

## Measures of non-Gaussianity

### kurtosis

Kurtosis is the classical measure of non-Gaussianity which is defined by [math]kurt(y) = E\{y^4\} - 3(E\{y^2\})^2. \,[/math]. Positive kurtosis typically implies a spiky pdf near zero and heavy tails at the two ends. (e.g. Laplace distribution); Negative kurtosis typically implies a flat pdf which is rather constant near zero, and very small at the two ends. (e.g. uniform distribution with finite support)

As a computational measure for non-gaussanity, kurtosis, on one hand, has the merit that it is easy to compute and has nice linearity properties. On the other hand, it is non-robust because kurtosis for a large sample size can be significantly affected by a few outliers in the sample.

### negentropy

#### Intuitive explanation

Before understanding negentropy, we have to first understand entropy, which is a key concept in information theory. Loosely speaking, entropy is a measure of how "distributed" a random variable is, and a rule of thumb is that a "more distributed" pdf has a higher entropy. An important theorem in information theory states that the Gaussian distribution has the largest entropy among all distributions with the same variance. In informal language, this means the Gaussian distribution is the most "distributed" pdf. Negentropy measures non-gaussianity by the differences in entropy of a pdf with the corresponding Gaussian distribution - this would be make precise in the following technical explanation.

#### Technical explanation

The entropy of a discrete random variable [math]X \,[/math] with possible values [math]\{x_1, x_2, ..., x_n\} \,[/math] is defined as [math]H(X) = -\sum_{i=1}^n {p(x_i) \log p(x_i)}[/math]

The (differential) entropy of a continuous random variable [math]X \,[/math] with probability density function [math]f \,[/math] is similarly defined as [math]H[X] = -\int\limits_{-\infty}^{\infty} f(x) \log f(x)\, dx[/math]

It is obvious how the definition of differential entropy can be extended to higher dimensions.

For a random vector [math]y\,[/math] with covariance matrix [math]C \,[/math], its negentropy is defined as [math] J(y) = H(Gaussian_C) - H(y) \,[/math], where [math]Gaussian_C \,[/math] denotes the Gaussian distribution with covariance matrix [math]C \,[/math]. Note that Negentropy is always non-negative and equals zero for a Gaussian distribution.

#### Empirical estimation of negentropy

In practice, negentropy has to be estimated from a finite sample. There are two main ways to do this. The first approach is to Taylor expand negentropy and take the lower order-terms. This would result in an estimation of negentropy expressed in higher moments(3rd degree and higher) of the pdf. As the estimation involves higher moments, this suffers from the same non-robustness problem faced by kurtosis. The second, and more robust, approach finds the distribution with the maximum entropy that is compatible with the observed sample, and estimates the negentropy of the real (and unknown) distribution by the negentropy of the "entropy-maximizing" distribution. While the second approach is more robust, it is also more computationally involved.

## A brief history of ICA

The technique of ICA was first introduced in 1982 in a simplified model of motion coding in muscle contraction, where the original signals were the angular position and velocity of a moving joint and the observed signals were the measurements from two types of sensors measuring muscle contraction. Throughout the 1980s, ICA was mostly known among French researchers but not among the international research community. A lot of ICA algorithms got developed since early 1990s, though ICA still remained a small and narrow research area until mid-1990s. The breakthrough happened between mid-1990s and late-1990s during which a number of very fast ICA algorithms, of which FastICA was one, were developed so that ICA can be applied to large scaled problem. After 2000, a lot of international workshops and papers have been devoted to ICA research and ICA has now become an established and mature field of research.

## Kernel ICA <ref> Bach and Jordan,(2002); Kernel Independent Component Analysis. Journal of Machine Learning Research, 3; 1-48</ref>

Bach and Jordan (2002) extended the ICA to functions in Reproducing kernel Hilbert Space (RKHS) rather than a single nonlinear function; as it was considered in the earliest works. To do so, they used Canonical Correlation - correlation of feature maps of multivariate random variable using kernel associated with the RKHS - rather than direct correlation of the considered random variables.

## Applications

ICA has been applied to lins source separation problem in signal processing but tit is also an important research topic in many areas such as biomedical engineering,medical imaging, speech enhancement,remote sensing, communication systems, exploration seismology, geophysics, econometrics, data mining, etc.

Suppose we have the cashflow of several stores belonging to the same retail chain. The goal is to find the fundamental factors that are common to all stroes and affect the cashflow. In this case, factors like seasonal variation, and prize changes of various coomodities affects on all stores **independently**. This is a work from Kiviluoto and Oja(1998)<ref>Kiviluoto, K. & Oja, E. (1998). Independent component analysis for parallel financial time series. Proceeding of the international comference on neural information processing(ICONIP'98) Vol.2 (pp. 895-898). Tokyo, Japan.</ref> applying ICA on the cashflow problem. However, I (as a general contributor) am still thinking that factor independency is a strong and so unreal assumption in this particular case. Imagine a case where prize changes of variuos commodities is mixed with seasonal variations.

## References

<references/>