Physics307L F09:Help/Fitting a line
Take home message from this class
There are statistically sound methods for obtaining the maximum likelihood slope and intercept to fit a set of data of the form [math]\displaystyle{ (x_i,y_i) }[/math]. This really is the take home message...I want you to remember enough to know that you can do it and to be able to quickly find the resources so you can remind yourself of the necessary assumptions about the data and the formulas (or algorithms) for calculating the best fit values, along with uncertainty. Two good resources:
- Chapter 6 ("Least-squares fit to a straight line") of Bevington and Robinson second edition.
- Chapter 8 ("Least-squares fitting") of Taylor second edition.
In order to leave the class with this confidence (knowing you can do it and where to find material to refresh your memory), you'll need to practice the techniques during your labs! There are plenty of labs (in fact a majority of them) where least-squares fitting to a line can and should be implemented.
Theoretical background
Assumptions
It is beyond the scope of this class to describe the methods with the least assumptions possible. For example, you can do least-squares fitting when uncertainties in both x and y are important, but here we'll assume only uncertainties in y. We're also only talking about a linear fit (y=A + B*x)...extension to quadratic and higher order is not too difficult but we're not doing that here.
- Assume that the data should follow a linear relationship. You can assess this assumption by examining the residuals of the best fit line.
- Assume that the uncertainty in each [math]\displaystyle{ y_i }[/math] is normally distributed, with a standard deviation of [math]\displaystyle{ \sigma _i }[/math]
- Sometimes, for clarity, we'll assume that there is one common σ for all data points...and many of the built-in algorithms have this assumption. (If your algorithm in matlab or Excel does not ask you for an array of uncertainties, then you know it's assuming a fixed uncertainty!)
- If your yi are each the mean of a bunch of independent measurements with a constant parent distribution, then the central limit theorem says this mean will be normally distributed.
- If your yi are single measurements, then a normal distribution may still be valid...provided central limit theorem "version 2" applies: that your error in yi results from the accumulation of a bunch of independent sources of random error.
- If your yi measurements arise from processing of another variable with normally distributed error, then you may need to challenge this assumption.
- Assume the principle of maximum likelihood is valid.
Derivation
See the Bevington or Taylor books for derivations. For the special case of fixed σ for all [math]\displaystyle{ y_i }[/math], you can see the derivation here.
Formula for best fit (maximum likelihood) parameters
- [math]\displaystyle{ y = A + B*x }[/math]
General case, individual σi
- [math]\displaystyle{ A=\frac{\sum \frac{x_i^2}{\sigma_i^2} \sum \frac{y_i}{\sigma_i^2} - \sum \frac{x_i}{\sigma_i^2} \sum \frac{x_i y_i}{\sigma_i^2}}{\Delta} }[/math] [math]\displaystyle{ \mbox{,}~~~~~~~~~~~~~~~~~ \sigma_a^2 = \frac{1}{\Delta} \sum \frac {x_i^2}{\sigma_i^2} }[/math]
- [math]\displaystyle{ B=\frac{\sum \frac{1}{\sigma_i^2} \sum \frac{x_i y_i}{\sigma_i^2} - \sum \frac{x_i}{\sigma_i^2} \sum \frac{y_i}{\sigma_i^2}}{\Delta} }[/math] [math]\displaystyle{ \mbox{,}~~~~~~~~~~~~~~~~~ \sigma_b^2 = \frac{1}{\Delta} \sum \frac {1}{\sigma_i^2} }[/math]
- [math]\displaystyle{ \Delta=\sum \frac{1}{\sigma_i^2} \sum \frac{x_i^2}{\sigma_i^2} - \left (\sum \frac{x_i}{\sigma_i^2} \right)^2 }[/math]
Special case, constant σy (note: Δ has different units)
- [math]\displaystyle{ A=\frac{\sum x_i^2 \sum y_i - \sum x_i \sum x_i y_i}{\Delta_{fixed}} }[/math] [math]\displaystyle{ \mbox{,}~~~~~~~~~~~~~~~~~ \sigma_a^2 = \frac{\sigma_y^2}{\Delta_{fixed}} \sum x_i^2 }[/math]
- [math]\displaystyle{ B=\frac{N\sum x_i y_i - \sum x_i \sum y_i}{\Delta_{fixed}} }[/math] [math]\displaystyle{ \mbox{,}~~~~~~~~~~~~~~~~~ \sigma_b^2 = N \frac{\sigma_y^2}{\Delta_{fixed}} }[/math] (I believe this is the same as 1/N * σy2 / σx2, where σx2 is the population variance of the experimental x values (not the variance of an individual x measurement)
- [math]\displaystyle{ \Delta_{fixed}=N \sum x_i^2 - \left ( \sum x_i \right )^2 }[/math] (This is actually N2 times the population variance of x...not sure if that helps in any kind of understanding, though.)
[math]\displaystyle{ \sigma_y }[/math] can be inferred from the chi-squared value and the number of degrees of freedom. If you have an independent estimate of [math]\displaystyle{ \sigma_y }[/math] (e.g. SEM of several indpendent measurements), this should be consistent with the inferred uncertainty. LINEST (excel linear fitting) uses the implied value for calculating the uncertainty of the fit parameters.
- [math]\displaystyle{ \sigma_{y~implied}^2 = \frac {1}{N-2} \sum (y_i - A - Bx_i)^2 }[/math] [math]\displaystyle{ \mbox{,}~~~~~~~\mbox{recall,}~~ \chi^2 = \frac {1}{\sigma_y^2} \sum (y_i - A - Bx_i)^2 }[/math]
Example Excel Sheet
I have used the values from Table 6.1 in Chapter 6 of Bevington, 2nd edition. This is the excel sheet I showed in class on November 3, 2008. I've tried to make it a little easier to read -- but the sheet will not make any sense to you if you don't look at the formulas on this page and play around with the numbers a bit.
Notes
- Steve Koch 16:37, 9 November 2008 (EST): Thanks to Justin for finding a typo in the implied sigma!