User:Timothee Flutre/Notebook/Postdoc/2012/03/04
Project name | Main project page Previous entry Next entry |
"Advanced Data Analysis from an Elementary Point of View" by Cosma Shalizi(This page summarizes my notes about this great course. All the course is available online, so you're likely to prefer to refer to it directly.)
1.1 Statistics, Data Analysis, Regression 1.2 Guessing the Value of a Random Variable Use mean squared error to see how bad we are doing when guessing value of Y by using a: [math]\displaystyle{ MSE(a) = E[(Y-a)^2] }[/math] [math]\displaystyle{ MSE(a) = (E[Y-a])^2 + V[Y-a] }[/math] [math]\displaystyle{ MSE(a) = (E[Y]-a)^2 + V[Y] }[/math] [math]\displaystyle{ \frac{dMSE}{da}(a) = 2(E[Y]-a) }[/math] [math]\displaystyle{ \frac{dMSE}{da}(r) = 0 \Leftrightarrow r = E[Y] }[/math] 1.2.1 Estimating the Expected Value Sample mean: [math]\displaystyle{ \hat{r} = \frac{1}{n} \sum_{i=1}^n y_i }[/math] If the [math]\displaystyle{ (y_i) }[/math] are iid, law of large numbers says [math]\displaystyle{ \hat{r} \rightarrow E[Y] = r }[/math] and central limit theorem indicates how fast convergence is (squared error is about [math]\displaystyle{ V(Y) / n }[/math]). 1.3 The Regression Function Use X (predictor or independent variable or covariate or input) to predict Y (dependent or variable or output or response). How bad are we doing when using f(X) to predict Y? [math]\displaystyle{ MSE(f(X)) = E[(Y-f(X))^2] }[/math] Use law of total expectation ([math]\displaystyle{ E[U]=E[E[U|V]] }[/math]): [math]\displaystyle{ MSE(f(X)) = E[E[(Y-f(X))^2|X]] }[/math] [math]\displaystyle{ MSE(f(X)) = E[V[Y|X] + (E[Y-f(X)|X])^2] }[/math] Regression function: [math]\displaystyle{ r(x) = E[Y|X=x] }[/math] 1.3.1 Some Disclaimers Usually we observe [math]\displaystyle{ Y|X = r(X) + \eta(X) }[/math], ie. [math]\displaystyle{ \eta }[/math] (noise variable with mean 0 and variance [math]\displaystyle{ \sigma_X^2 }[/math]) depends on X... 1.4 Estimating the Regression Function Use conditional sample means: [math]\displaystyle{ \hat{r}(x) = \frac{1}{\sharp \{i:x_i=x\}} \sum_{i:x_i=x} y_i }[/math] Works only when X is discrete. 1.4.1 The Bias-Variance Tradeoff [math]\displaystyle{ MSE(\hat{r}(x)) = E[(Y-\hat{r}(x))^2] }[/math] [math]\displaystyle{ MSE(\hat{r}(x)) = E[(Y-r(x) + r(x)-\hat{r}(x))^2] }[/math] [math]\displaystyle{ MSE(\hat{r}(x)) = E[(Y-r(x))^2 + 2(Y-r(x))(r(x)-\hat{r}(x)) + (r(x)-\hat{r}(x))^2] }[/math] [math]\displaystyle{ MSE(\hat{r}(x)) = \sigma_x^2 + (r(x)-\hat{r}(x))^2 }[/math] In fact, we have analyzed [math]\displaystyle{ MSE(\hat{R}_n(x)|\hat{R}_n=\hat{r}) }[/math] where [math]\displaystyle{ \hat{R}_n }[/math] is a random regression function estimated using n random pairs [math]\displaystyle{ (x_i,y_i) }[/math]. [math]\displaystyle{ MSE(\hat{R}_n(x)) = E[(Y-\hat{R}_n(X))^2|X=x] }[/math] [math]\displaystyle{ MSE(\hat{R}_n(x)) = E[E[(Y-\hat{R}_n(X))^2|X=x,\hat{R}_n=\hat{r}]|X=x] }[/math] [math]\displaystyle{ MSE(\hat{R}_n(x)) = E[\sigma_x^2 + (r(x)-\hat{R}_n(x))^2] }[/math] [math]\displaystyle{ MSE(\hat{R}_n(x)) = \sigma_x^2 + E[(r(x)-E[\hat{R}_n(x)]+E[\hat{R}_n(x)]-\hat{R}_n(x))^2] }[/math] [math]\displaystyle{ MSE(\hat{R}_n(x)) = \sigma_x^2 + (r(x)-E[\hat{R}_n(x)])^2 + V[\hat{R}_n(x)] }[/math] Even if our method is unbiased ([math]\displaystyle{ r(x) = E[\hat{R}_n(x)] }[/math], no approximation bias), we can still have a lot of variance in our estimates ([math]\displaystyle{ V[\hat{R}_n(x)] }[/math] large). A method is consistent (for r) when both the approximation bias and the estimation variance go to 0 when we get more and more data. 1.4.2 The Bias-Variance Trade-Off in Action 1.4.3 Ordinary Least Squares Linear Regression as Smoothing Assume X is one-dimensional and both X and Y are centered. Choose to approximate r(x) by [math]\displaystyle{ \alpha+\beta x }[/math]. Need to find their values a and b minimizing the MSE. [math]\displaystyle{ MSE(\alpha,\beta) = E[(Y-\alpha-\beta X)^2] }[/math] [math]\displaystyle{ MSE(\alpha,\beta) = E[E[(Y-\alpha-\beta X)^2|X]] }[/math] [math]\displaystyle{ MSE(\alpha,\beta) = E[V[Y|X] + (E[Y-\alpha-\beta X)|X])^2] }[/math] [math]\displaystyle{ MSE(\alpha,\beta) = E[V[Y|X]] + E[(E[Y-\alpha-\beta X)|X])^2] }[/math] [math]\displaystyle{ \frac{\partial MSE}{\partial \alpha} = E[2(-1)(Y-\alpha-\beta X)] }[/math] [math]\displaystyle{ \frac{\partial MSE}{\partial \alpha} = 0 \Leftrightarrow a = E[Y] - \beta E[X] = 0 }[/math] [math]\displaystyle{ \frac{\partial MSE}{\partial \beta} = E[2(-X)(Y-\alpha-\beta X)] }[/math] [math]\displaystyle{ \frac{\partial MSE}{\partial \beta} = 0 \Leftrightarrow E[XY]-bE[X^2] = 0 \Leftrightarrow b = \frac{Cov[X,Y]}{V[X]} }[/math] Now, estimate a and b from the data (replacing population values by sample values, or minimizing the residual sum of squares): [math]\displaystyle{ \hat{a} = 0 }[/math] and [math]\displaystyle{ \hat{b} = \frac{\sum_i y_i x_i}{\sum_i x_i^2} }[/math] Least-square linear regression is thus a smoothing of the data: [math]\displaystyle{ \hat{r}(x) = \hat{b}x = \sum_i y_i \frac{x_i}{n s_X^2} x }[/math] Indeed, the prediction is a weighted average of the observed values [math]\displaystyle{ y_i }[/math], where the weights are proportional to how far [math]\displaystyle{ x_i }[/math] is from the center of the data, relative to the variance, and proportional to the magnitude of x. Note that the weight of a data point depends on how far it is from the center of all the data, not how far it is from the point at which we are trying to predict. 1.5 Linear Smoothers [math]\displaystyle{ \hat{r}(x) = \sum_i y_i \hat{w}(x_i,x) }[/math] Sample mean: [math]\displaystyle{ \hat{w}(x_i,x) = 1/n }[/math] Ordinary linear regression: [math]\displaystyle{ \hat{w}(x_i,x) = (x_i/ns_X^2)x }[/math] 1.5.1 k-Nearest-Neighbor Regression [math]\displaystyle{ \hat{w}(x_i,x) = 1/k }[/math] if [math]\displaystyle{ x_i }[/math] is one of the k nearest neighbors of x, 0 otherwise 1.5.2 Kernel Smoothers For instance use [math]\displaystyle{ K(x_i,x) \rightarrow N(0,\sqrt{h} }[/math] where h is the bandwidth so that [math]\displaystyle{ \hat{w}(x_i,x) = \frac{K(x_i,x)}{\sum_j K(x_i,x)} }[/math] 1.6 Exercises What minimizes the mean absolute error? [math]\displaystyle{ MAE(a) = E[|Y-a|] }[/math] [math]\displaystyle{ MAE(a) = - \int_l^a (Y-a) p(Y) dY + \int_a^u (Y-a) p(Y) dY }[/math] Using Leibniz rule for differentiation under the integral: [math]\displaystyle{ \frac{dMAE}{da}(a) = \int_l^a p(Y) dY - \int_a^u p(Y) dY }[/math] [math]\displaystyle{ \frac{dMAE}{da}(a) = 2 \int_l^a p(Y) dY - 1 }[/math] [math]\displaystyle{ \frac{dMAE}{da}(a) = 2 P(Y \le a) - 1 }[/math] [math]\displaystyle{ \frac{dMAE}{da}(a) = 0 \Leftrightarrow P(Y \le r) = \frac{1}{2} }[/math] The median minimizes the MAE. |