graph Laplacian Regularization for Larg-Scale Semidefinite Programming: Difference between revisions

From statwiki
Jump to navigation Jump to search
Line 64: Line 64:
& = ( {x_1}^T {x_2}^T ) \left[\begin{matrix} x_1 + (A^{\frac{1}{2}}y) x_2 \\ (A^{\frac{1}{2}}y)^T x_1 + l x_2 \end{matrix}\right] \\
& = ( {x_1}^T {x_2}^T ) \left[\begin{matrix} x_1 + (A^{\frac{1}{2}}y) x_2 \\ (A^{\frac{1}{2}}y)^T x_1 + l x_2 \end{matrix}\right] \\
& ={x_1}^T x_1 + {x_1}^T (A^{\frac{1}{2}}y) x_2 + {x_2}^T (A^{\frac{1}{2}}y)^T x_1 +  l \,{x_2}^T x_2 \\
& ={x_1}^T x_1 + {x_1}^T (A^{\frac{1}{2}}y) x_2 + {x_2}^T (A^{\frac{1}{2}}y)^T x_1 +  l \,{x_2}^T x_2 \\
& = (x_1 + (A^{\frac{1}{2}}y)^T x_2)^T (x_1 + (A^{\frac{1}{2}}y)^T x_2)
& = (x_1 + (A^{\frac{1}{2}}y)^T x_2)^T (x_1 + (A^{\frac{1}{2}}y)^T x_2) + {x_2}^T ( l - (A^{\frac{1}{2}}y) (A^{\frac{1}{2}}y)^T) x_2
   
   



Revision as of 18:33, 1 August 2009

Introduction

This paper<ref>K. Q. Weinberger et al. Graph Laplacian Regularization for Larg-Scale Semidefinite Programming, Advances in neural information processing systems, 2007 - cs.utah.edu </ref> introduces a new approach for the discovery of low dimensional representations of high-dimensional data where, in many cases, local proximity measurements are also available. Sensor localization is a quite well-known example in this field. Existing approaches use semi-definite programs (SDPs) with low rank solutions from convex optimization methods. However, the SDP approach doesn’t scale well for large inputs. The main contribution of this paper is to use matrix factorization for solving very sophisticated problems of the above type that lead to much smaller and faster SDPs than previous approaches. This factorization comes from expanding the solution of the original problem in terms of the bottom eigenvectors of a graph laplacian. As the smaller SDPs coming from this factorization are only an approximation of the original problem, it can be refined using gradient-descent. The approach has been illustrated on localization of large scale sensor networks.

Sensor localization

Assuming sensors can only estimate their local pairwise distances to nearby sensors via radio transmitters, the problem is to identify the whole network topology. In other words, knowing that we have n sensors with [math]\displaystyle{ d_{ij} }[/math] as an estimate of local distance between adjacent sensors, i and j, the desired output would be [math]\displaystyle{ x_1, x_2, ..., x_n \in R_2 }[/math] as the planar coordinates of sensors.

Work on this issue so far

work on this issue so far starts with minimizing sum-of-squares loss function as
[math]\displaystyle{ \,\min_{x_1,...,x_n}\Sigma_{i\sim j}{(\|x_i-x_j\|^2-d_{ij}^2)^2} }[/math] (1)
and adding a centering Constraint (assuming no sensor location is known in advance) as
[math]\displaystyle{ \,\|\Sigma_i{x_i}\|^2 = 0 }[/math] (2)
The problem here is that the optimization is not convex and is more likely to be trapped by local minima. For solving this problem, an [math]\displaystyle{ n \times n }[/math] inner product matrix X is defined as [math]\displaystyle{ X_{ij} = x_i \times x_j }[/math] and by relaxing the constraint that sensor locations [math]\displaystyle{ x_i }[/math] lie in the [math]\displaystyle{ R^2 }[/math] plane , the following convex notation will be obtained:
Minimize: [math]\displaystyle{ \,\Sigma_{i\sim j}{(X_{ii}-2X_{ij}+X_{jj}-d_{ij}^2)^2} }[/math] (3)
subject to: (i) [math]\displaystyle{ \,\Sigma_{ij}{X_{ij}=0} }[/math] and (ii) [math]\displaystyle{ X \succeq 0 }[/math]
The vectors [math]\displaystyle{ \,x_i }[/math] will lie in a subspace with dimensionality equal to the rank of the solution X. Projecting [math]\displaystyle{ x_i }[/math]s into their 2D subspace of maximum variance, obtainied from the top 2 eigenvectors of X, will get planar coordinates. However, the higher the rank of X, the greater the information loss after projection. Increasing the error of projection with increased rank leads us to add a low rank, or equivalently, high trace constraint. Therefore, an extra term is added to favor solutions with high variance(high trace):
Maximize: [math]\displaystyle{ \,tr(X)-v\Sigma_{i\sim j}{(X_{ii}-2X_{ij}+X_{jj}-d_{ij}^2)^2} }[/math] (4)
subject to: (i) [math]\displaystyle{ \,\Sigma_{ij}{X_{ij}=0} }[/math] and (ii) [math]\displaystyle{ \,X \succeq 0 }[/math]
where the parameter [math]\displaystyle{ v\gt 0 }[/math] balances the trade-off between maximizing variance and preserving local distances (MVU).

Matrix factorization

Assume G is a neighborhood graph defined by the sensor network and location of sensors is a function defined over the nodes of this graph. Functions on a graph can be approximated using eigenvectors of graph’s Laplacian matrix as basis functions (spectral graph theory).

Graph Laplacian is defined by:

[math]\displaystyle{ L_{i,j}= \left\{\begin{matrix} deg(v_i) & \text{if } i=j \\ -1 & \text{if } i\neq j \text{ and } v_i \text{ adjacent } v_j \\ 0 & \text{ otherwise}\end{matrix}\right. }[/math]


Sensor's location can be approximated using the m bottom eigenvecotrs of the Laplacian matrix of G. Expanding these locations yields a matrix factorization for X so that : [math]\displaystyle{ x_i\approx\Sigma_{\alpha = 1}^m Q_{i\alpha}y_{\alpha} }[/math]
where Q is the [math]\displaystyle{ n*m }[/math] matrix with m bottom eigenvecors of Laplacian matrix and [math]\displaystyle{ y_{\alpha} }[/math] is unknown and depends on [math]\displaystyle{ d_{ij} }[/math]. Now, if we define the inner product of theses vectors as [math]\displaystyle{ Y_{\alpha\beta} = y_{\alpha}y_{\beta} }[/math] we will get the factorized matrix [math]\displaystyle{ X\approx QYQ^T }[/math] (6)
Using this approximation, we can solve an optimization for Y that is much smaller than X. Since Q stores mutulaly orthogonal eigenvectors we can imply [math]\displaystyle{ tr(Y)=tr(X) }[/math]. In addition, [math]\displaystyle{ QYQ^T }[/math] satisfies centering constraint because uniform eigenvectors are not included. Therefore, the optimization would change to to folloing equations:

  Maximize:	[math]\displaystyle{ tr(Y)-v\Sigma_{i\sim j}{((QYQ^T)_{ii}-2(QYQ^T)_{ij}+(QYQ^T)_{jj}-d_{ij}^2)^2} }[/math] (objective function)
subject to: [math]\displaystyle{ Y \succeq 0 }[/math]

Formulation as SDP

Recall that our ultimate aim is to formulate the optimization problem as a SDP. However, the objective function, as formulated in the last section, is a quadratic function of the elements of the matrix [math]\displaystyle{ Y \, }[/math]. In this subsection, we show how to transform the quadratic objective to a linear objective, via the Schur complement lemma.

First of all, note that by concatenating all the columns of the m-by-m matrix [math]\displaystyle{ Y \, }[/math] into a vector [math]\displaystyle{ y \in R^{m^2}\, }[/math], the objective function takes the form [math]\displaystyle{ \,b^Ty - y^TAy \, }[/math] (up to an additive constant), where the matrix [math]\displaystyle{ A\, }[/math] collects all the quadratic coefficients in the objective function while the vector [math]\displaystyle{ b \, }[/math] collects all the linear coefficients. Note that the matrix [math]\displaystyle{ A\, }[/math] is semi-positive definite because the second and quadratic term in the objective function is non-negative.

By defining a dummy variable [math]\displaystyle{ l \, }[/math], the optimization problem can be written as

  Maximize:   [math]\displaystyle{ \,b^Ty - l }[/math] 
subject to: (i) [math]\displaystyle{ Y \succeq 0 }[/math] and (ii) [math]\displaystyle{ l \geq y^TAy }[/math].

The formulation of the optimization problem as a SDP now remains to rewrite the second constraint [math]\displaystyle{ l \geq y^TAy }[/math] into a linear or positive semidefinite constraint. This is accomplished by the Schur complement lemma.

Schur Complement Lemma

In this section, we illustrate how to use the lemma instead of proving the lemma. Readers are referred to "Introduction to algorithms" for the formal definition of Schur complement and the proof of the Schur Complement Lemma.

Suppose

[math]\displaystyle{ Z=\left[\begin{matrix} I & A^{\frac{1}{2}}y \\ (A^{\frac{1}{2}}y)^T & l \end{matrix}\right] }[/math]

is a positive semi-definite matrix, then by definition, [math]\displaystyle{ x^T Z x \geq 0 \, }[/math] for any vector [math]\displaystyle{ x\, }[/math]. Let us break [math]\displaystyle{ x\, }[/math] into two subvectors [math]\displaystyle{ x_1 \, }[/math] and [math]\displaystyle{ x_2 \, }[/math], where [math]\displaystyle{ x_1 \, }[/math] has a dimension compatible with the matrix [math]\displaystyle{ I\, }[/math] in [math]\displaystyle{ Z \, }[/math] and [math]\displaystyle{ x_2 \, }[/math] is a real number. Then,

[math]\displaystyle{ \begin{align} x^T Z x & = ( {x_1}^T {x_2}^T ) \left[\begin{matrix} I & A^{\frac{1}{2}}y \\ (A^{\frac{1}{2}}y)^T & l \end{matrix}\right] \left[\begin{matrix} x_1 \\ x_2 \end{matrix}\right] \\ & = ( {x_1}^T {x_2}^T ) \left[\begin{matrix} x_1 + (A^{\frac{1}{2}}y) x_2 \\ (A^{\frac{1}{2}}y)^T x_1 + l x_2 \end{matrix}\right] \\ & ={x_1}^T x_1 + {x_1}^T (A^{\frac{1}{2}}y) x_2 + {x_2}^T (A^{\frac{1}{2}}y)^T x_1 + l \,{x_2}^T x_2 \\ & = (x_1 + (A^{\frac{1}{2}}y)^T x_2)^T (x_1 + (A^{\frac{1}{2}}y)^T x_2) + {x_2}^T ( l - (A^{\frac{1}{2}}y) (A^{\frac{1}{2}}y)^T) x_2 \end{align} }[/math]




Our goal is to cast the required optimization as SDP over small matrices with few constraints. Let [math]\displaystyle{ y\in R^{m^2} }[/math] be a vector obtained by concatenating all the columns of Y, [math]\displaystyle{ A \in R^{m^2 * m^2} }[/math] be a positive semidefinite matrix collecting all the quadratic coefficients in the objective function, [math]\displaystyle{ b\in R^{m^2} }[/math] be a vector collecting all the linear coefficients in the objective function, and [math]\displaystyle{ l }[/math] be a lower bound on the quadratic piece of the objective function. Using Schur’s lemma to express this bound as a linear matrix inequality, we will obtain the SDP:

Maximize: [math]\displaystyle{ \,b^Ty - l }[/math] (9)
subject to: (i) [math]\displaystyle{ Y \succeq 0 }[/math] and (ii) [math]\displaystyle{ \begin{bmatrix} I & A^{1/2}y \\ (A^{1/2}y)^T & l \end{bmatrix} \succeq 0 }[/math]
where I is [math]\displaystyle{ m^2 * m^2 }[/math] identity matrix and [math]\displaystyle{ A^{1/2} }[/math] is the matrix square root.

By puting in this form, The only variables of the SDP are the [math]\displaystyle{ m(m+1)/2 }[/math] elements of Y and the unknown scalar l. Constraints decrease to the positive semidifinteness of Y and linear matrix inequality of [math]\displaystyle{ m^2*m^2 }[/math]. It is worth noting that the complexity of the SDP does not depend on the number of nodes (n) or edges in the network.

Gradient based improvement

As the matrix factorization only provides an approximation to the global minimum, a refinement is needed by using this result as the initial starting point for gradient descent in the first equation. It can be done in 2 steps:
First, starting from the m-dimensional solution of eq. (6), use conjugate gradient methods to maximize the objective function in eq. (4)
Second, project the results from the previous step into the R2 plane and use conjugate gradient methods to minimize the loss function in eq. (1). The conjugate gradient method is an iterative method for minimizing a quadratic function where its Hessian matrix (matrix of second partial derivatives) is positive.

Results

The figure <ref> The same paper. Figure 2 </ref> belowshows sensor locations inferred for n = 1055 largest cities in continental US.Local distances were estimated up to 18 neighbors within radius r = 0.09. Local measurements were corrupted by 10% Gaussian noise over the true local distance. Using m = 10 bottom eigenvectors of graph Laplacian, the solution provides a good initial starting point(left picture) for gradient-based improvement. The right picture is senssor locations after the improvement.

File:fig2L.jpg
Initial Sensor locations
File:fig2R.jpg
Sensor locations after gradient improvement


The second simulated network, the figure<ref> The same paper. Figure 3 </ref> below, placed nodes at n=20,000 uniformly sampled points inside the unit square. local distances were estimated up to 20 other nodes within radius r = 0.06. Using m = 10 bottom eigenvectors of graph Laplacian, 19s was taken to construct and solve the SDP and 52s was taken for 100 iterations in conjugate gradient descent.

File:fig3.jpg
Results on a simulated network with n=20000 uniformly distributed nodes inside a centerd unit squre


For the simulated networks with nodes at US cities, the figure<ref> The same paper. Figure 4 </ref> below plots the loss function in eq. (1) vs. number of eigenvectors. It also plots computation time vs. number of eigenvectors. It can be inferred from the figure that there is a trade-off between getting better solution and increasing computation time. Also, we can see that [math]\displaystyle{ m\approx 10 }[/math] best manages this trade-off.

File:fig4.jpg
Left: The value of loss function. Right: The computation time

Conclusion

An approach for inferring low dimensional representations from local distance constraints using MVU was proposed.
Using matrix factorization computed from the bottom eigenvectors of the graph Laplacian was the main idea of the approach.
The initial solution can be refined by local search methods.
This approach is suitable for large input and its complexity does not depend on the input.

References

<references/>