When can Multi-Site Datasets be Pooled for Regression? Hypothesis Tests, l2-consistency and Neuroscience Applications: Summary: Difference between revisions

From statwiki
Jump to navigation Jump to search
No edit summary
Line 44: Line 44:
Theorem 2.3 implies that the sites, in fact, do not even need to share the full dataset to assess whether pooling will be useful. Instead, the test only requires very high-level statistical information such as <math>\hat{\beta}_i,\hat{\Sigma}_i,\sigma_i</math> and <math>n_i</math> for all participating sites – which can be transferred without computational overhead.  
Theorem 2.3 implies that the sites, in fact, do not even need to share the full dataset to assess whether pooling will be useful. Instead, the test only requires very high-level statistical information such as <math>\hat{\beta}_i,\hat{\Sigma}_i,\sigma_i</math> and <math>n_i</math> for all participating sites – which can be transferred without computational overhead.  


====Case 1: Sharing a subset of <math>\beta</math>s====
====Case 2: Sharing a subset of <math>\beta</math>s====
For example, socio-economic status may (or may not) have a significant association with a health outcome (response) depending on the country of the study (e.g., insurance coverage policies). Unlike Case 1, <math>\beta</math> cannot be considered to be the same across all sites. The model in (3) will now include another design matrix of predictors <math>Z\in R^{n*q} </math>and corresponding coefficients <math>\gamma_i</math> for each site i,
For example, socio-economic status may (or may not) have a significant association with a health outcome (response) depending on the country of the study (e.g., insurance coverage policies). Unlike Case 1, <math>\beta</math> cannot be considered to be the same across all sites. The model in (3) will now include another design matrix of predictors <math>Z\in R^{n*q} </math>and corresponding coefficients <math>\gamma_i</math> for each site i,




<math>min_{β,\gamma} \sum_{i=1}^{k}\tau_i^2\left \Vert y_i-X_iβ-Z_i\gamma_i \right \|_2^2</math> ... (9)  
<math>min_{β,\gamma} \sum_{i=1}^{k}\tau_i^2\left \Vert y_i-X_iβ-Z_i\gamma_i \right \|_2^2</math> ... (9)
 
 
While evaluating whether the MSE of <math>\beta</math>� reduces, the MSE change in <math>\gamma</math> is ignored  because they correspond to site-specific variables. If <math>\hat{\beta}</math>� is close to the “true” <math>\beta*</math>, it will
also enable a better estimation of site-specific variables
 
==References==
==References==
{{Reflist}}
{{Reflist}}

Revision as of 23:34, 24 October 2017

Main Contributions of the Research Article: Summary still under Construction This page is a summary for this ICML 2017 paper.

  1. The main result is a hypothesis test to evaluate whether pooling data across multiple sites for regression (before or after correcting for site-specific distributional shifts) can improve the estimation (mean squared error) of the relevant coefficients (while permitting an influence from a set of confounding variables).
  2. Show how pooling is can be used even when the features are different across sites. For this they show the L2-consistency rate which supports the use of spare-multi-task Lasso when sparsity patterns are not identical
  3. Experimental results showing consistent acceptance power for early Alzheimer’s detection (AD) in humans.

Introduction to some Basic Concepts and Issues

Regression Problems

Ridge and Lasso regression are powerful techniques generally used for creating parsimonious models in presence of a ‘large’ number of features. Here ‘large’ can typically mean either of two things:

  • Large enough to enhance the tendency of a model to overfit (as low as 10 variables might cause overfitting)
  • Large enough to cause computational challenges. With modern systems, this situation might arise in case of millions or billions of features

Lasso Regression:

LASSO stands for Least Absolute Shrinkage and Selection Operator. Lasso regression performs L1 regularization, i.e. it adds a factor of sum of absolute value of coefficients in the optimization objective. Thus, lasso regression optimizes the following.

  • Objective = RSS + α * (sum of absolute value of coefficients)
  1. α = 0: Same coefficients as simple linear regression
  2. α = ∞: All coefficients zero (same logic as before)
  3. 0 < α < ∞: coefficients between 0 and that of simple linear regression

Bias-Variance Trade-Off

The bias is error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting). The variance is error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting).<ref>tryinginsdinsdins.</ref>

Hypothesis Testing

The hypothesis test to evaluate statistical power improvements (e.g., mean squared error) when running a regression model on a pooled dataset is discussed below.β corresponds to the coefficient vector (i.e., predictor weights), then the regression model is

  • [math]\displaystyle{ min_{β} \frac{1}{n}\left \Vert y-Xβ \right \|_2^2 }[/math] ...... (1)

If k denotes the number of sites, a domain adaptation scheme needs to be applied to account for the distributional shifts between the k different predictors [math]\displaystyle{ \lbrace X_i \rbrace_{i=1}^{k} }[/math], and then run a regression model. If the underlying “concept” (i.e., predictors and responses relationship) can be assumed to be the same across the different sites, then it is reasonable to impose the same β for all sites. For example, the influence of CSF protein measurements on cognitive scores of an individual may be invariant to demographics. if the distributional mismatch correction is imperfect, we may define ∆ βi = βi − β∗ where i ∈ {1,...,k} as the residual difference between the site-specific coefficients and the true shared coefficient vector (in the ideal case, we have ∆ βi = 0). Therefore we derive the Multi-Site Regression equation ( Eq 2) where [math]\displaystyle{ \tau_i }[/math] is the weighting parameter for each site

  • [math]\displaystyle{ min_{β} \displaystyle \sum_{i=1}^k {\tau_i^2\left \Vert y_i-X_iβ \right \|_2^2} }[/math] ......(2)

Separate Regression or Shared Regression ?

Since the underlying relationship between predictors and responses is the same across the different datasets ( from which its pooled), estimates of [math]\displaystyle{ \beta_i }[/math] across all k sites are restricted to be the same. Without this constraint , (3) is equivalent to fitting a regression separately on each site. To explore whether this constraint improves estimation, the Mean Square Error (MSE) needs to be examined. Hence, using site 1 as the reference, and setting [math]\displaystyle{ \tau_1 }[/math] = 1 in (2) and considering [math]\displaystyle{ \beta*=\beta_1 }[/math],

  • [math]\displaystyle{ min_{β} \frac{1}{n}\left \Vert y_1-X_1β \right \|_2^2 + \displaystyle \sum_{i=2}^k {\tau_i^2\left \Vert y_i-X_iβ \right \|_2^2} }[/math] .........(3)

To evaluate whether MSE is reduced, we first need to quantify the change in the bias and variance of (3) compared to (1).

Case 1: Sharing all [math]\displaystyle{ \beta }[/math]s

[math]\displaystyle{ n_i }[/math]: sample size of site i
[math]\displaystyle{ \hat{β}_i }[/math]: regression estimate from a specific site i.
[math]\displaystyle{ ∆β^T }[/math]:length kp vector

Alt text

Lemma 2.2 bounds the increase in bias and reduction in variance. Theorem 2.3 is the author's main test result.Although [math]\displaystyle{ \sigma_i }[/math] is typically

unknown, it can be easily replaced using its site specific estimation. Theorem 2.3 implies that we can conduct a non-central [math]\displaystyle{ \chi^2 }[/math] distribution test based on the statistic.


Theorem 2.3 implies that the sites, in fact, do not even need to share the full dataset to assess whether pooling will be useful. Instead, the test only requires very high-level statistical information such as [math]\displaystyle{ \hat{\beta}_i,\hat{\Sigma}_i,\sigma_i }[/math] and [math]\displaystyle{ n_i }[/math] for all participating sites – which can be transferred without computational overhead.

Case 2: Sharing a subset of [math]\displaystyle{ \beta }[/math]s

For example, socio-economic status may (or may not) have a significant association with a health outcome (response) depending on the country of the study (e.g., insurance coverage policies). Unlike Case 1, [math]\displaystyle{ \beta }[/math] cannot be considered to be the same across all sites. The model in (3) will now include another design matrix of predictors [math]\displaystyle{ Z\in R^{n*q} }[/math]and corresponding coefficients [math]\displaystyle{ \gamma_i }[/math] for each site i,


[math]\displaystyle{ min_{β,\gamma} \sum_{i=1}^{k}\tau_i^2\left \Vert y_i-X_iβ-Z_i\gamma_i \right \|_2^2 }[/math] ... (9)


While evaluating whether the MSE of [math]\displaystyle{ \beta }[/math]� reduces, the MSE change in [math]\displaystyle{ \gamma }[/math] is ignored because they correspond to site-specific variables. If [math]\displaystyle{ \hat{\beta} }[/math]� is close to the “true” [math]\displaystyle{ \beta* }[/math], it will also enable a better estimation of site-specific variables

References

Template:Reflist