User:Timothee Flutre/Notebook/Postdoc/2012/03/04
From OpenWetWare
(Autocreate 2012/03/04 Entry for User:Timothee_Flutre/Notebook/Postdoc) |
Current revision (23:12, 6 October 2013) (view source) (→"Advanced Data Analysis from an Elementary Point of View" by Cosma Shalizi: update link to book) |
||
(One intermediate revision not shown.) | |||
Line 6: | Line 6: | ||
| colspan="2"| | | colspan="2"| | ||
<!-- ##### DO NOT edit above this line unless you know what you are doing. ##### --> | <!-- ##### DO NOT edit above this line unless you know what you are doing. ##### --> | ||
- | == | + | =="Advanced Data Analysis from an Elementary Point of View" by Cosma Shalizi== |
- | + | ||
+ | ''(This page summarizes my notes about this great course. All the course is available [http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ online], so you're likely to prefer to refer to it directly.)'' | ||
+ | |||
+ | * '''Concepts to know:''' | ||
+ | ** Random variable; population, sample. Cumulative distribution function, probability mass function, probability density function. Specific distributions: Bernoulli, binomial, Poisson, geometric, Gaussian, exponential, t , Gamma. Expectation value. Variance, standard deviation. Sample mean, sample variance. Median, mode. Quartile, percentile, quantile. Inter-quartile range. Histograms. | ||
+ | ** Joint distribution functions. Conditional distributions; conditional expectations and variances. Statistical independence and dependence. Covariance and correlation; why dependence is not the same thing as correlation. Rules for arithmetic with expectations, variances and covariances. Laws of total probability, [http://en.wikipedia.org/wiki/Law_of_total_expectation total expectation], total variation. Contingency tables; odds ratio, log odds ratio. | ||
+ | ** Sequences of random variables. Stochastic process. [http://en.wikipedia.org/wiki/Law_of_large_numbers Law of large numbers]. [http://en.wikipedia.org/wiki/Central_limit_theorem Central limit theorem]. | ||
+ | ** Parameters; estimator functions and point estimates. Sampling distribution. Bias of an estimator. Standard error of an estimate; standard error of the mean; how and why the standard error of the mean differs from the standard deviation. Confidence intervals and interval estimates. | ||
+ | ** Hypothesis tests. Tests for differences in means and in proportions; Z and t tests; degrees of freedom. Size, significance, power. Relation between hypothesis tests and confidence intervals. χ 2 test of independence for contingency tables; degrees of freedom. KS test for goodness-of-fit to distributions. | ||
+ | ** Linear regression. Meaning of the linear regression function. Fitted values and residuals of a regression. Interpretation of regression coefficients. Least-squares estimate of coefficients. Matrix formula for estimating the coefficients; the hat matrix. R2 ; why adding more predictor variables never reduces R2 . The t -test for the significance of individual coefficients given other coefficients. The F -test and partial F -test for the significance of regression models. Degrees of freedom for residuals. Examination of residuals. Confidence intervals for parameters. Confidence intervals for fitted values. Prediction intervals. | ||
+ | ** Likelihood. Likelihood functions. Maximum likelihood estimates. Relation between maximum likelihood, least squares, and Gaussian distributions. Relation between confidence intervals and the likelihood function. Likelihood ratio test. | ||
+ | |||
+ | * '''I. Regression and Its Generalizations. 1. Regression basics''' | ||
+ | |||
+ | '''1.1 Statistics, Data Analysis, Regression''' | ||
+ | |||
+ | '''1.2 Guessing the Value of a Random Variable''' | ||
+ | |||
+ | Use mean squared error to see how bad we are doing when guessing value of Y by using a: | ||
+ | |||
+ | <math>MSE(a) = E[(Y-a)^2]</math> | ||
+ | |||
+ | <math>MSE(a) = (E[Y-a])^2 + V[Y-a]</math> | ||
+ | |||
+ | <math>MSE(a) = (E[Y]-a)^2 + V[Y]</math> | ||
+ | |||
+ | <math>\frac{dMSE}{da}(a) = 2(E[Y]-a)</math> | ||
+ | |||
+ | <math>\frac{dMSE}{da}(r) = 0 \Leftrightarrow r = E[Y]</math> | ||
+ | |||
+ | '''1.2.1 Estimating the Expected Value''' | ||
+ | |||
+ | Sample mean: <math>\hat{r} = \frac{1}{n} \sum_{i=1}^n y_i</math> | ||
+ | |||
+ | If the <math>(y_i)</math> are iid, law of large numbers says <math>\hat{r} \rightarrow E[Y] = r</math> and central limit theorem indicates how fast convergence is (squared error is about <math>V(Y) / n</math>). | ||
+ | |||
+ | '''1.3 The Regression Function''' | ||
+ | |||
+ | Use X (predictor or independent variable or covariate or input) to predict Y (dependent or variable or output or response). How bad are we doing when using f(X) to predict Y? | ||
+ | |||
+ | <math>MSE(f(X)) = E[(Y-f(X))^2]</math> | ||
+ | |||
+ | Use law of total expectation (<math>E[U]=E[E[U|V]]</math>): | ||
+ | |||
+ | <math>MSE(f(X)) = E[E[(Y-f(X))^2|X]]</math> | ||
+ | |||
+ | <math>MSE(f(X)) = E[V[Y|X] + (E[Y-f(X)|X])^2]</math> | ||
+ | |||
+ | Regression function: <math>r(x) = E[Y|X=x]</math> | ||
+ | |||
+ | '''1.3.1 Some Disclaimers''' | ||
+ | |||
+ | Usually we observe <math>Y|X = r(X) + \eta(X)</math>, ie. <math>\eta</math> (noise variable with mean 0 and variance <math>\sigma_X^2</math>) depends on X... | ||
+ | |||
+ | '''1.4 Estimating the Regression Function''' | ||
+ | |||
+ | Use conditional sample means: <math>\hat{r}(x) = \frac{1}{\sharp \{i:x_i=x\}} \sum_{i:x_i=x} y_i</math> | ||
+ | |||
+ | Works only when X is discrete. | ||
+ | |||
+ | '''1.4.1 The Bias-Variance Tradeoff''' | ||
+ | |||
+ | <math>MSE(\hat{r}(x)) = E[(Y-\hat{r}(x))^2]</math> | ||
+ | |||
+ | <math>MSE(\hat{r}(x)) = E[(Y-r(x) + r(x)-\hat{r}(x))^2]</math> | ||
+ | |||
+ | <math>MSE(\hat{r}(x)) = E[(Y-r(x))^2 + 2(Y-r(x))(r(x)-\hat{r}(x)) + (r(x)-\hat{r}(x))^2]</math> | ||
+ | |||
+ | <math>MSE(\hat{r}(x)) = \sigma_x^2 + (r(x)-\hat{r}(x))^2</math> | ||
+ | |||
+ | In fact, we have analyzed <math>MSE(\hat{R}_n(x)|\hat{R}_n=\hat{r})</math> where <math>\hat{R}_n</math> is a random regression function estimated using n random pairs <math>(x_i,y_i)</math>. | ||
+ | |||
+ | <math>MSE(\hat{R}_n(x)) = E[(Y-\hat{R}_n(X))^2|X=x]</math> | ||
+ | |||
+ | <math>MSE(\hat{R}_n(x)) = E[E[(Y-\hat{R}_n(X))^2|X=x,\hat{R}_n=\hat{r}]|X=x]</math> | ||
+ | |||
+ | <math>MSE(\hat{R}_n(x)) = E[\sigma_x^2 + (r(x)-\hat{R}_n(x))^2]</math> | ||
+ | |||
+ | <math>MSE(\hat{R}_n(x)) = \sigma_x^2 + E[(r(x)-E[\hat{R}_n(x)]+E[\hat{R}_n(x)]-\hat{R}_n(x))^2]</math> | ||
+ | |||
+ | <math>MSE(\hat{R}_n(x)) = \sigma_x^2 + (r(x)-E[\hat{R}_n(x)])^2 + V[\hat{R}_n(x)]</math> | ||
+ | |||
+ | Even if our method is unbiased (<math>r(x) = E[\hat{R}_n(x)]</math>, no approximation bias), we can still have a lot of variance in our estimates (<math>V[\hat{R}_n(x)]</math> large). | ||
+ | |||
+ | A method is '''consistent''' (for r) when both the approximation bias and the estimation variance go to 0 when we get more and more data. | ||
+ | |||
+ | '''1.4.2 The Bias-Variance Trade-Off in Action''' | ||
+ | |||
+ | '''1.4.3 Ordinary Least Squares Linear Regression as Smoothing''' | ||
+ | |||
+ | Assume X is one-dimensional and both X and Y are centered. Choose to approximate r(x) by <math>\alpha+\beta x</math>. Need to find their values a and b minimizing the MSE. | ||
+ | |||
+ | <math>MSE(\alpha,\beta) = E[(Y-\alpha-\beta X)^2]</math> | ||
+ | |||
+ | <math>MSE(\alpha,\beta) = E[E[(Y-\alpha-\beta X)^2|X]]</math> | ||
+ | |||
+ | <math>MSE(\alpha,\beta) = E[V[Y|X] + (E[Y-\alpha-\beta X)|X])^2]</math> | ||
+ | |||
+ | <math>MSE(\alpha,\beta) = E[V[Y|X]] + E[(E[Y-\alpha-\beta X)|X])^2]</math> | ||
+ | |||
+ | <math>\frac{\partial MSE}{\partial \alpha} = E[2(-1)(Y-\alpha-\beta X)]</math> | ||
+ | |||
+ | <math>\frac{\partial MSE}{\partial \alpha} = 0 \Leftrightarrow a = E[Y] - \beta E[X] = 0</math> | ||
+ | |||
+ | <math>\frac{\partial MSE}{\partial \beta} = E[2(-X)(Y-\alpha-\beta X)]</math> | ||
+ | |||
+ | <math>\frac{\partial MSE}{\partial \beta} = 0 \Leftrightarrow E[XY]-bE[X^2] = 0 \Leftrightarrow b = \frac{Cov[X,Y]}{V[X]}</math> | ||
+ | |||
+ | Now, estimate a and b from the data (replacing population values by sample values, or minimizing the residual sum of squares): | ||
+ | |||
+ | <math>\hat{a} = 0</math> and <math>\hat{b} = \frac{\sum_i y_i x_i}{\sum_i x_i^2}</math> | ||
+ | |||
+ | Least-square linear regression is thus a smoothing of the data: | ||
+ | |||
+ | <math>\hat{r}(x) = \hat{b}x = \sum_i y_i \frac{x_i}{n s_X^2} x</math> | ||
+ | |||
+ | Indeed, the prediction is a weighted average of the observed values <math>y_i</math>, where the weights are proportional to how far <math>x_i</math> is from the center of the data, relative to the variance, and proportional to the magnitude of x. | ||
+ | |||
+ | Note that the weight of a data point depends on how far it is from the center of all the data, not how far it is from the point at which we are trying to predict. | ||
+ | |||
+ | '''1.5 Linear Smoothers''' | ||
+ | |||
+ | <math>\hat{r}(x) = \sum_i y_i \hat{w}(x_i,x)</math> | ||
+ | |||
+ | Sample mean: <math>\hat{w}(x_i,x) = 1/n</math> | ||
+ | |||
+ | Ordinary linear regression: <math>\hat{w}(x_i,x) = (x_i/ns_X^2)x</math> | ||
+ | |||
+ | '''1.5.1 k-Nearest-Neighbor Regression''' | ||
+ | |||
+ | <math>\hat{w}(x_i,x) = 1/k</math> if <math>x_i</math> is one of the k nearest neighbors of x, 0 otherwise | ||
+ | |||
+ | '''1.5.2 Kernel Smoothers''' | ||
+ | |||
+ | For instance use <math>K(x_i,x) \rightarrow N(0,\sqrt{h}</math> where h is the bandwidth so that <math>\hat{w}(x_i,x) = \frac{K(x_i,x)}{\sum_j K(x_i,x)}</math> | ||
+ | |||
+ | '''1.6 Exercises''' | ||
+ | |||
+ | What minimizes the mean absolute error? | ||
+ | |||
+ | <math>MAE(a) = E[|Y-a|]</math> | ||
+ | |||
+ | <math>MAE(a) = - \int_l^a (Y-a) p(Y) dY + \int_a^u (Y-a) p(Y) dY</math> | ||
+ | |||
+ | Using Leibniz rule for differentiation under the integral: | ||
+ | |||
+ | <math>\frac{dMAE}{da}(a) = \int_l^a p(Y) dY - \int_a^u p(Y) dY</math> | ||
+ | |||
+ | <math>\frac{dMAE}{da}(a) = 2 \int_l^a p(Y) dY - 1</math> | ||
+ | |||
+ | <math>\frac{dMAE}{da}(a) = 2 P(Y \le a) - 1</math> | ||
+ | |||
+ | <math>\frac{dMAE}{da}(a) = 0 \Leftrightarrow P(Y \le r) = \frac{1}{2}</math> | ||
+ | |||
+ | The median minimizes the MAE. | ||
<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### --> | <!-- ##### DO NOT edit below this line unless you know what you are doing. ##### --> |
Current revision
Project name | Main project page Previous entry Next entry |
"Advanced Data Analysis from an Elementary Point of View" by Cosma Shalizi(This page summarizes my notes about this great course. All the course is available online, so you're likely to prefer to refer to it directly.)
1.1 Statistics, Data Analysis, Regression 1.2 Guessing the Value of a Random Variable Use mean squared error to see how bad we are doing when guessing value of Y by using a: MSE(a) = E[(Y − a)^{2}] MSE(a) = (E[Y − a])^{2} + V[Y − a] MSE(a) = (E[Y] − a)^{2} + V[Y]
1.2.1 Estimating the Expected Value Sample mean: If the (y_{i}) are iid, law of large numbers says and central limit theorem indicates how fast convergence is (squared error is about V(Y) / n). 1.3 The Regression Function Use X (predictor or independent variable or covariate or input) to predict Y (dependent or variable or output or response). How bad are we doing when using f(X) to predict Y? MSE(f(X)) = E[(Y − f(X))^{2}] Use law of total expectation (E[U] = E[E[U | V]]): MSE(f(X)) = E[E[(Y − f(X))^{2} | X]] MSE(f(X)) = E[V[Y | X] + (E[Y − f(X) | X])^{2}] Regression function: r(x) = E[Y | X = x] 1.3.1 Some Disclaimers Usually we observe Y | X = r(X) + η(X), ie. η (noise variable with mean 0 and variance ) depends on X... 1.4 Estimating the Regression Function Use conditional sample means: Works only when X is discrete. 1.4.1 The Bias-Variance Tradeoff
In fact, we have analyzed where is a random regression function estimated using n random pairs (x_{i},y_{i}).
Even if our method is unbiased (, no approximation bias), we can still have a lot of variance in our estimates ( large). A method is consistent (for r) when both the approximation bias and the estimation variance go to 0 when we get more and more data. 1.4.2 The Bias-Variance Trade-Off in Action 1.4.3 Ordinary Least Squares Linear Regression as Smoothing Assume X is one-dimensional and both X and Y are centered. Choose to approximate r(x) by α + βx. Need to find their values a and b minimizing the MSE. MSE(α,β) = E[(Y − α − βX)^{2}] MSE(α,β) = E[E[(Y − α − βX)^{2} | X]] MSE(α,β) = E[V[Y | X] + (E[Y − α − βX) | X])^{2}] MSE(α,β) = E[V[Y | X]] + E[(E[Y − α − βX) | X])^{2}]
Now, estimate a and b from the data (replacing population values by sample values, or minimizing the residual sum of squares): and Least-square linear regression is thus a smoothing of the data:
Indeed, the prediction is a weighted average of the observed values y_{i}, where the weights are proportional to how far x_{i} is from the center of the data, relative to the variance, and proportional to the magnitude of x. Note that the weight of a data point depends on how far it is from the center of all the data, not how far it is from the point at which we are trying to predict. 1.5 Linear Smoothers
Sample mean: Ordinary linear regression: 1.5.1 k-Nearest-Neighbor Regression if x_{i} is one of the k nearest neighbors of x, 0 otherwise 1.5.2 Kernel Smoothers For instance use where h is the bandwidth so that 1.6 Exercises What minimizes the mean absolute error? MAE(a) = E[ | Y − a | ]
Using Leibniz rule for differentiation under the integral:
The median minimizes the MAE. |