User:Timothee Flutre/Notebook/Postdoc/2012/03/04: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
(Autocreate 2012/03/04 Entry for User:Timothee_Flutre/Notebook/Postdoc)
 
(→‎Entry title: first version)
Line 6: Line 6:
| colspan="2"|
| colspan="2"|
<!-- ##### DO NOT edit above this line unless you know what you are doing. ##### -->
<!-- ##### DO NOT edit above this line unless you know what you are doing. ##### -->
==Entry title==
=="Advanced Data Analysis from an Elementary Point of View" by Cosma Shalizi==
* Insert content here...


''(This page summarizes my notes about this great course. All the course is available [http://www.stat.cmu.edu/~cshalizi/uADA/12/ online], so you're likely to prefer to refer to it directly.)''
* '''Concepts to know:'''
** Random variable; population, sample. Cumulative distribution function, probability mass function, probability density function. Specific distributions: Bernoulli, binomial, Poisson, geometric, Gaussian, exponential, t , Gamma. Expectation value. Variance, standard deviation. Sample mean, sample variance. Median, mode. Quartile, percentile, quantile. Inter-quartile range. Histograms.
** Joint distribution functions. Conditional distributions; conditional expectations and variances. Statistical independence and dependence. Covariance and correlation; why dependence is not the same thing as correlation. Rules for arithmetic with expectations, variances and covariances. Laws of total probability, [http://en.wikipedia.org/wiki/Law_of_total_expectation total expectation], total variation. Contingency tables; odds ratio, log odds ratio.
** Sequences of random variables. Stochastic process. [http://en.wikipedia.org/wiki/Law_of_large_numbers Law of large numbers]. [http://en.wikipedia.org/wiki/Central_limit_theorem Central limit theorem].
** Parameters; estimator functions and point estimates. Sampling distribution. Bias of an estimator. Standard error of an estimate; standard error of the mean; how and why the standard error of the mean differs from the standard deviation. Confidence intervals and interval estimates.
** Hypothesis tests. Tests for differences in means and in proportions; Z and t tests; degrees of freedom. Size, significance, power. Relation between hypothesis tests and confidence intervals. χ 2 test of independence for contingency tables; degrees of freedom. KS test for goodness-of-fit to distributions.
** Linear regression. Meaning of the linear regression function. Fitted values and residuals of a regression. Interpretation of regression coefficients. Least-squares estimate of coefficients. Matrix formula for estimating the coefficients; the hat matrix. R2 ; why adding more predictor variables never reduces R2 . The t -test for the significance of individual coefficients given other coefficients. The F -test and partial F -test for the significance of regression models. Degrees of freedom for residuals. Examination of residuals. Confidence intervals for parameters. Confidence intervals for fitted values. Prediction intervals.
** Likelihood. Likelihood functions. Maximum likelihood estimates. Relation between maximum likelihood, least squares, and Gaussian distributions. Relation between confidence intervals and the likelihood function. Likelihood ratio test.
* '''I. Regression and Its Generalizations. 1. Regression basics'''
'''1.1 Statistics, Data Analysis, Regression'''
'''1.2 Guessing the Value of a Random Variable'''
Use mean squared error to see how bad we are doing when guessing value of Y by using a:
<math>MSE(a) = E[(Y-a)^2]</math>
<math>MSE(a) = (E[Y-a])^2 + V[Y-a]</math>
<math>MSE(a) = (E[Y]-a)^2 + V[Y]</math>
<math>\frac{dMSE}{da}(a) = 2(E[Y]-a)</math>
<math>\frac{dMSE}{da}(r) = 0 \Leftrightarrow r = E[Y]</math>
'''1.2.1 Estimating the Expected Value'''
Sample mean: <math>\hat{r} = \frac{1}{n} \sum_{i=1}^n y_i</math>
If the <math>(y_i)</math> are iid, law of large numbers says <math>\hat{r} \rightarrow E[Y] = r</math> and central limit theorem indicates how fast convergence is (squared error is about <math>V(Y) / n</math>).
'''1.3 The Regression Function'''
Use X (predictor or independent variable or covariate or input) to predict Y (dependent or variable or output or response). How bad are we doing when using f(X) to predict Y?
<math>MSE(f(X)) = E[(Y-f(X))^2]</math>
Use law of total expectation (<math>E[U]=E[E[U|V]]</math>):
<math>MSE(f(X)) = E[E[(Y-f(X))^2|X]]</math>
<math>MSE(f(X)) = E[V[Y|X] + (E[Y-f(X)|X])^2]</math>
Regression function: <math>r(x) = E[Y|X=x]</math>
'''1.3.1 Some Disclaimers'''
Usually we observe <math>Y|X = r(X) + \eta(X)</math>, ie. <math>\eta</math> (noise variable with mean 0 and variance <math>\sigma_X^2</math>) depends on X...
'''1.4 Estimating the Regression Function'''
Use conditional sample means: <math>\hat{r}(x) = \frac{1}{\sharp \{i:x_i=x\}} \sum_{i:x_i=x} y_i</math>
Works only when X is discrete.
'''1.4.1 The Bias-Variance Tradeoff'''
<math>MSE(\hat{r}(x)) = E[(Y-\hat{r}(x))^2]</math>
<math>MSE(\hat{r}(x)) = E[(Y-r(x) + r(x)-\hat{r}(x))^2]</math>
<math>MSE(\hat{r}(x)) = E[(Y-r(x))^2 + 2(Y-r(x))(r(x)-\hat{r}(x)) + (r(x)-\hat{r}(x))^2]</math>
<math>MSE(\hat{r}(x)) = \sigma_x^2 + (r(x)-\hat{r}(x))^2</math>
In fact, we have analyzed <math>MSE(\hat{R}_n(x)|\hat{R}_n=\hat{r})</math> where <math>\hat{R}_n</math> is a random regression function estimated using n random pairs <math>(x_i,y_i)</math>.
<math>MSE(\hat{R}_n(x)) = E[(Y-\hat{R}_n(X))^2|X=x]</math>
<math>MSE(\hat{R}_n(x)) = E[E[(Y-\hat{R}_n(X))^2|X=x,\hat{R}_n=\hat{r}]|X=x]</math>
<math>MSE(\hat{R}_n(x)) = E[\sigma_x^2 + (r(x)-\hat{R}_n(x))^2]</math>
<math>MSE(\hat{R}_n(x)) = \sigma_x^2 + E[(r(x)-E[\hat{R}_n(x)]+E[\hat{R}_n(x)]-\hat{R}_n(x))^2]</math>
<math>MSE(\hat{R}_n(x)) = \sigma_x^2 + (r(x)-E[\hat{R}_n(x)])^2 + V[\hat{R}_n(x)]</math>
Even if our method is unbiased (<math>r(x) = E[\hat{R}_n(x)]</math>, no approximation bias), we can still have a lot of variance in our estimates (<math>V[\hat{R}_n(x)]</math> large).
A method is '''consistent''' (for r) when both the approximation bias and the estimation variance go to 0 when we get more and more data.
'''1.4.2 The Bias-Variance Trade-Off in Action'''
'''1.4.3 Ordinary Least Squares Linear Regression as Smoothing'''
Assume X is one-dimensional and both X and Y are centered. Choose to approximate r(x) by <math>\alpha+\beta x</math>. Need to find their values a and b minimizing the MSE.
<math>MSE(\alpha,\beta) = E[(Y-\alpha-\beta X)^2]</math>
<math>MSE(\alpha,\beta) = E[E[(Y-\alpha-\beta X)^2|X]]</math>
<math>MSE(\alpha,\beta) = E[V[Y|X] + (E[Y-\alpha-\beta X)|X])^2]</math>
<math>MSE(\alpha,\beta) = E[V[Y|X]] + E[(E[Y-\alpha-\beta X)|X])^2]</math>
<math>\frac{\partial MSE}{\partial \alpha} = E[2(-1)(Y-\alpha-\beta X)]</math>
<math>\frac{\partial MSE}{\partial \alpha} = 0 \Leftrightarrow a = E[Y] - \beta E[X] = 0</math>
<math>\frac{\partial MSE}{\partial \beta} = E[2(-X)(Y-\alpha-\beta X)]</math>
<math>\frac{\partial MSE}{\partial \beta} = 0 \Leftrightarrow E[XY]-bE[X^2] = 0 \Leftrightarrow b = \frac{Cov[X,Y]}{V[X]}</math>
Now, estimate a and b from the data (replacing population values by sample values, or minimizing the residual sum of squares):
<math>\hat{a} = 0</math> and <math>\hat{b} = \frac{\sum_i y_i x_i}{\sum_i x_i^2}</math>
Least-square linear regression is thus a smoothing of the data:
<math>\hat{r}(x) = \hat{b}x = \sum_i y_i \frac{x_i}{n s_X^2} x</math>
Indeed, the prediction is a weighted average of the observed values <math>y_i</math>, where the weights are proportional to how far <math>x_i</math> is from the center of the data, relative to the variance, and proportional to the magnitude of x.
Note that the weight of a data point depends on how far it is from the center of all the data, not how far it is from the point at which we are trying to predict.
'''1.5 Linear Smoothers'''
<math>\hat{r}(x) = \sum_i y_i \hat{w}(x_i,x)</math>
Sample mean: <math>\hat{w}(x_i,x) = 1/n</math>
Ordinary linear regression: <math>\hat{w}(x_i,x) = (x_i/ns_X^2)x</math>
'''1.5.1 k-Nearest-Neighbor Regression'''
<math>\hat{w}(x_i,x) = 1/k</math> if <math>x_i</math> is one of the k nearest neighbors of x, 0 otherwise
'''1.5.2 Kernel Smoothers'''
For instance use <math>K(x_i,x) \rightarrow N(0,\sqrt{h}</math> where h is the bandwidth so that <math>\hat{w}(x_i,x) = \frac{K(x_i,x)}{\sum_j K(x_i,x)}</math>
'''1.6 Exercises'''
What minimizes the mean absolute error?
<math>MAE(a) = E[|Y-a|]</math>
<math>MAE(a) = - \int_l^a (Y-a) p(Y) dY + \int_a^u (Y-a) p(Y) dY</math>
Using Leibniz rule for differentiation under the integral:
<math>\frac{dMAE}{da}(a) = \int_l^a p(Y) dY - \int_a^u p(Y) dY</math>
<math>\frac{dMAE}{da}(a) = 2 \int_l^a p(Y) dY - 1</math>
<math>\frac{dMAE}{da}(a) = 2 P(Y \le a) - 1</math>
<math>\frac{dMAE}{da}(a) = 0 \Leftrightarrow P(Y \le r) = \frac{1}{2}</math>
The median minimizes the MAE.


<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### -->
<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### -->

Revision as of 19:07, 4 March 2012

Project name <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page
<html><img src="/images/c/c3/Resultset_previous.png" border="0" /></html>Previous entry<html>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</html>Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html>

"Advanced Data Analysis from an Elementary Point of View" by Cosma Shalizi

(This page summarizes my notes about this great course. All the course is available online, so you're likely to prefer to refer to it directly.)

  • Concepts to know:
    • Random variable; population, sample. Cumulative distribution function, probability mass function, probability density function. Specific distributions: Bernoulli, binomial, Poisson, geometric, Gaussian, exponential, t , Gamma. Expectation value. Variance, standard deviation. Sample mean, sample variance. Median, mode. Quartile, percentile, quantile. Inter-quartile range. Histograms.
    • Joint distribution functions. Conditional distributions; conditional expectations and variances. Statistical independence and dependence. Covariance and correlation; why dependence is not the same thing as correlation. Rules for arithmetic with expectations, variances and covariances. Laws of total probability, total expectation, total variation. Contingency tables; odds ratio, log odds ratio.
    • Sequences of random variables. Stochastic process. Law of large numbers. Central limit theorem.
    • Parameters; estimator functions and point estimates. Sampling distribution. Bias of an estimator. Standard error of an estimate; standard error of the mean; how and why the standard error of the mean differs from the standard deviation. Confidence intervals and interval estimates.
    • Hypothesis tests. Tests for differences in means and in proportions; Z and t tests; degrees of freedom. Size, significance, power. Relation between hypothesis tests and confidence intervals. χ 2 test of independence for contingency tables; degrees of freedom. KS test for goodness-of-fit to distributions.
    • Linear regression. Meaning of the linear regression function. Fitted values and residuals of a regression. Interpretation of regression coefficients. Least-squares estimate of coefficients. Matrix formula for estimating the coefficients; the hat matrix. R2 ; why adding more predictor variables never reduces R2 . The t -test for the significance of individual coefficients given other coefficients. The F -test and partial F -test for the significance of regression models. Degrees of freedom for residuals. Examination of residuals. Confidence intervals for parameters. Confidence intervals for fitted values. Prediction intervals.
    • Likelihood. Likelihood functions. Maximum likelihood estimates. Relation between maximum likelihood, least squares, and Gaussian distributions. Relation between confidence intervals and the likelihood function. Likelihood ratio test.
  • I. Regression and Its Generalizations. 1. Regression basics

1.1 Statistics, Data Analysis, Regression

1.2 Guessing the Value of a Random Variable

Use mean squared error to see how bad we are doing when guessing value of Y by using a:

[math]\displaystyle{ MSE(a) = E[(Y-a)^2] }[/math]

[math]\displaystyle{ MSE(a) = (E[Y-a])^2 + V[Y-a] }[/math]

[math]\displaystyle{ MSE(a) = (E[Y]-a)^2 + V[Y] }[/math]

[math]\displaystyle{ \frac{dMSE}{da}(a) = 2(E[Y]-a) }[/math]

[math]\displaystyle{ \frac{dMSE}{da}(r) = 0 \Leftrightarrow r = E[Y] }[/math]

1.2.1 Estimating the Expected Value

Sample mean: [math]\displaystyle{ \hat{r} = \frac{1}{n} \sum_{i=1}^n y_i }[/math]

If the [math]\displaystyle{ (y_i) }[/math] are iid, law of large numbers says [math]\displaystyle{ \hat{r} \rightarrow E[Y] = r }[/math] and central limit theorem indicates how fast convergence is (squared error is about [math]\displaystyle{ V(Y) / n }[/math]).

1.3 The Regression Function

Use X (predictor or independent variable or covariate or input) to predict Y (dependent or variable or output or response). How bad are we doing when using f(X) to predict Y?

[math]\displaystyle{ MSE(f(X)) = E[(Y-f(X))^2] }[/math]

Use law of total expectation ([math]\displaystyle{ E[U]=E[E[U|V]] }[/math]):

[math]\displaystyle{ MSE(f(X)) = E[E[(Y-f(X))^2|X]] }[/math]

[math]\displaystyle{ MSE(f(X)) = E[V[Y|X] + (E[Y-f(X)|X])^2] }[/math]

Regression function: [math]\displaystyle{ r(x) = E[Y|X=x] }[/math]

1.3.1 Some Disclaimers

Usually we observe [math]\displaystyle{ Y|X = r(X) + \eta(X) }[/math], ie. [math]\displaystyle{ \eta }[/math] (noise variable with mean 0 and variance [math]\displaystyle{ \sigma_X^2 }[/math]) depends on X...

1.4 Estimating the Regression Function

Use conditional sample means: [math]\displaystyle{ \hat{r}(x) = \frac{1}{\sharp \{i:x_i=x\}} \sum_{i:x_i=x} y_i }[/math]

Works only when X is discrete.

1.4.1 The Bias-Variance Tradeoff

[math]\displaystyle{ MSE(\hat{r}(x)) = E[(Y-\hat{r}(x))^2] }[/math]

[math]\displaystyle{ MSE(\hat{r}(x)) = E[(Y-r(x) + r(x)-\hat{r}(x))^2] }[/math]

[math]\displaystyle{ MSE(\hat{r}(x)) = E[(Y-r(x))^2 + 2(Y-r(x))(r(x)-\hat{r}(x)) + (r(x)-\hat{r}(x))^2] }[/math]

[math]\displaystyle{ MSE(\hat{r}(x)) = \sigma_x^2 + (r(x)-\hat{r}(x))^2 }[/math]

In fact, we have analyzed [math]\displaystyle{ MSE(\hat{R}_n(x)|\hat{R}_n=\hat{r}) }[/math] where [math]\displaystyle{ \hat{R}_n }[/math] is a random regression function estimated using n random pairs [math]\displaystyle{ (x_i,y_i) }[/math].

[math]\displaystyle{ MSE(\hat{R}_n(x)) = E[(Y-\hat{R}_n(X))^2|X=x] }[/math]

[math]\displaystyle{ MSE(\hat{R}_n(x)) = E[E[(Y-\hat{R}_n(X))^2|X=x,\hat{R}_n=\hat{r}]|X=x] }[/math]

[math]\displaystyle{ MSE(\hat{R}_n(x)) = E[\sigma_x^2 + (r(x)-\hat{R}_n(x))^2] }[/math]

[math]\displaystyle{ MSE(\hat{R}_n(x)) = \sigma_x^2 + E[(r(x)-E[\hat{R}_n(x)]+E[\hat{R}_n(x)]-\hat{R}_n(x))^2] }[/math]

[math]\displaystyle{ MSE(\hat{R}_n(x)) = \sigma_x^2 + (r(x)-E[\hat{R}_n(x)])^2 + V[\hat{R}_n(x)] }[/math]

Even if our method is unbiased ([math]\displaystyle{ r(x) = E[\hat{R}_n(x)] }[/math], no approximation bias), we can still have a lot of variance in our estimates ([math]\displaystyle{ V[\hat{R}_n(x)] }[/math] large).

A method is consistent (for r) when both the approximation bias and the estimation variance go to 0 when we get more and more data.

1.4.2 The Bias-Variance Trade-Off in Action

1.4.3 Ordinary Least Squares Linear Regression as Smoothing

Assume X is one-dimensional and both X and Y are centered. Choose to approximate r(x) by [math]\displaystyle{ \alpha+\beta x }[/math]. Need to find their values a and b minimizing the MSE.

[math]\displaystyle{ MSE(\alpha,\beta) = E[(Y-\alpha-\beta X)^2] }[/math]

[math]\displaystyle{ MSE(\alpha,\beta) = E[E[(Y-\alpha-\beta X)^2|X]] }[/math]

[math]\displaystyle{ MSE(\alpha,\beta) = E[V[Y|X] + (E[Y-\alpha-\beta X)|X])^2] }[/math]

[math]\displaystyle{ MSE(\alpha,\beta) = E[V[Y|X]] + E[(E[Y-\alpha-\beta X)|X])^2] }[/math]

[math]\displaystyle{ \frac{\partial MSE}{\partial \alpha} = E[2(-1)(Y-\alpha-\beta X)] }[/math]

[math]\displaystyle{ \frac{\partial MSE}{\partial \alpha} = 0 \Leftrightarrow a = E[Y] - \beta E[X] = 0 }[/math]

[math]\displaystyle{ \frac{\partial MSE}{\partial \beta} = E[2(-X)(Y-\alpha-\beta X)] }[/math]

[math]\displaystyle{ \frac{\partial MSE}{\partial \beta} = 0 \Leftrightarrow E[XY]-bE[X^2] = 0 \Leftrightarrow b = \frac{Cov[X,Y]}{V[X]} }[/math]

Now, estimate a and b from the data (replacing population values by sample values, or minimizing the residual sum of squares):

[math]\displaystyle{ \hat{a} = 0 }[/math] and [math]\displaystyle{ \hat{b} = \frac{\sum_i y_i x_i}{\sum_i x_i^2} }[/math]

Least-square linear regression is thus a smoothing of the data:

[math]\displaystyle{ \hat{r}(x) = \hat{b}x = \sum_i y_i \frac{x_i}{n s_X^2} x }[/math]

Indeed, the prediction is a weighted average of the observed values [math]\displaystyle{ y_i }[/math], where the weights are proportional to how far [math]\displaystyle{ x_i }[/math] is from the center of the data, relative to the variance, and proportional to the magnitude of x.

Note that the weight of a data point depends on how far it is from the center of all the data, not how far it is from the point at which we are trying to predict.

1.5 Linear Smoothers

[math]\displaystyle{ \hat{r}(x) = \sum_i y_i \hat{w}(x_i,x) }[/math]

Sample mean: [math]\displaystyle{ \hat{w}(x_i,x) = 1/n }[/math]

Ordinary linear regression: [math]\displaystyle{ \hat{w}(x_i,x) = (x_i/ns_X^2)x }[/math]

1.5.1 k-Nearest-Neighbor Regression

[math]\displaystyle{ \hat{w}(x_i,x) = 1/k }[/math] if [math]\displaystyle{ x_i }[/math] is one of the k nearest neighbors of x, 0 otherwise

1.5.2 Kernel Smoothers

For instance use [math]\displaystyle{ K(x_i,x) \rightarrow N(0,\sqrt{h} }[/math] where h is the bandwidth so that [math]\displaystyle{ \hat{w}(x_i,x) = \frac{K(x_i,x)}{\sum_j K(x_i,x)} }[/math]

1.6 Exercises

What minimizes the mean absolute error?

[math]\displaystyle{ MAE(a) = E[|Y-a|] }[/math]

[math]\displaystyle{ MAE(a) = - \int_l^a (Y-a) p(Y) dY + \int_a^u (Y-a) p(Y) dY }[/math]

Using Leibniz rule for differentiation under the integral:

[math]\displaystyle{ \frac{dMAE}{da}(a) = \int_l^a p(Y) dY - \int_a^u p(Y) dY }[/math]

[math]\displaystyle{ \frac{dMAE}{da}(a) = 2 \int_l^a p(Y) dY - 1 }[/math]

[math]\displaystyle{ \frac{dMAE}{da}(a) = 2 P(Y \le a) - 1 }[/math]

[math]\displaystyle{ \frac{dMAE}{da}(a) = 0 \Leftrightarrow P(Y \le r) = \frac{1}{2} }[/math]

The median minimizes the MAE.