When can Multi-Site Datasets be Pooled for Regression? Hypothesis Tests, l2-consistency and Neuroscience Applications: Summary: Difference between revisions

From statwiki
Jump to navigation Jump to search
Line 17: Line 17:
===Hypothesis Testing===
===Hypothesis Testing===
The hypothesis test to evaluate statistical power improvements (e.g., mean squared error) when running a regression model on a pooled dataset is discussed below.β corresponds to the coefficient vector (i.e., predictor weights), then the regression model is
The hypothesis test to evaluate statistical power improvements (e.g., mean squared error) when running a regression model on a pooled dataset is discussed below.β corresponds to the coefficient vector (i.e., predictor weights), then the regression model is
min<sub>\beta</sub> \frac{1}{n}\left \Vert y-''Xβ'' \right \|{{su|b=2|p=2}}
*min<sub>β[[File:Equation 1.png]]</sub> \frac{1}{n}\left \Vert y-''Xβ'' \right \|{{su|b=2|p=2}}

Revision as of 00:44, 24 October 2017

Main Contributions of the Research Article:

  1. The main result is a hypothesis test to evaluate whether pooling data across multiple sites for regression (before or after correcting for site-specific distributional shifts) can improve the estimation (mean squared error) of the relevant coefficients (while permitting an influence from a set of confounding variables).
  2. Show how pooling is can be used even when the features are different across sites. For this they show the L2-consistency rate which supports the use of spare-multi-task Lasso when sparsity patterns are not identical
  3. Experimental results showing consistent acceptance power for early Alzheimer’s detection (AD) in humans.

Introduction to some Basic Concepts and Issues

Regression Problems

Ridge and Lasso regression are powerful techniques generally used for creating parsimonious models in presence of a ‘large’ number of features. Here ‘large’ can typically mean either of two things:

  • Large enough to enhance the tendency of a model to overfit (as low as 10 variables might cause overfitting)
  • Large enough to cause computational challenges. With modern systems, this situation might arise in case of millions or billions of features

Lasso Regression:

LASSO stands for Least Absolute Shrinkage and Selection Operator. Lasso regression performs L1 regularization, i.e. it adds a factor of sum of absolute value of coefficients in the optimization objective. Thus, lasso regression optimizes the following.

  • Objective = RSS + α * (sum of absolute value of coefficients)
  1. α = 0: Same coefficients as simple linear regression
  2. α = ∞: All coefficients zero (same logic as before)
  3. 0 < α < ∞: coefficients between 0 and that of simple linear regression

Hypothesis Testing

The hypothesis test to evaluate statistical power improvements (e.g., mean squared error) when running a regression model on a pooled dataset is discussed below.β corresponds to the coefficient vector (i.e., predictor weights), then the regression model is