User:Andy Maloney/Notebook/Lab Notebook of Andy Maloney/2009/12/29

From OpenWetWare
Jump to navigationJump to search
Lab notebook of Andy Maloney

Main project page
Previous entry      Next entry
<html><input type="radio"><a href="">PEM</a></input></html>

Median filtering

Finally back in action. Today I learned a little bit about median filtering and how we use it for our microtubule tracking software. It's pretty amazing and I'm super excited about learning how it works.

The best way to think about it is to just look at an example. If this is my data,

[math]\displaystyle{ \text{Data} = \{2,5,5,6,90,4,3,3\} }[/math]

then you can see that there is an outlier point at 90. Keeping the datum at 90 could cause problems in analysis so we need to get rid of it. Of course, you should never get rid of data unless you are absolutely positive it should not be there. The best way to ensure that this is not a good data point is to repeat the experiment several times to ensure that you are justified in getting rid of this point.

Okay, so the next thing we do is to choose a window size that is of an odd number so 3, 5, 7... is acceptable. We then scan this window over our data and get subsets of the original data. In this case I have that

[math]\displaystyle{ \begin{align} y_1 &= \{2,5,5\}\\ y_2 &= \{5,5,6\}\\ y_3 &= \{5,6,90\}\\ y_4 &= \{6,90,4\}\\ y_5 &= \{90,4,3\}\\ y_6 &= \{4,3,3\}. \end{align} }[/math]

I then take the median of each of these new subsets,

[math]\displaystyle{ \begin{align} y_1 &\rightarrow 5\\ y_2 &\rightarrow 5\\ y_3 &\rightarrow 6\\ y_4 &\rightarrow 6\\ y_5 &\rightarrow 4\\ y_6 &\rightarrow 3\\ \text{Median data} &= \{5,5,6,6,4,3\}. \end{align} }[/math]

Now, it is important to note that the length of my original data is 8 while the length of my median filtered data is 6. In order to reconcile the length difference, the original data needs to be padded. In the above case, it needs to be padded by two numbers so that you can get two more median filtered data subsets. There are two different approaches one can take when padding. You can pad the data by zeros or you can do what Koch called circularizing the data and pad it by the same extremum data points. In this case, the extremum data points in my original data set are 2 and 3. So, padding my original data by adding an extra 2 at the beginning and an extra 3 at the end will give me these two extra subsets,

[math]\displaystyle{ \begin{align} y_1^p &= \{2,2,5\} \rightarrow 2\\ y_2^p &= \{3,3,3\} \rightarrow 3\\ \text{Padded median data} &= \{2,5,5,6,6,4,3,3\} \end{align} }[/math]

Now my padded median data is the same length as my original data. It turns out that padding with the extremum data is important as I will discuss later.

Okay, the whole point of going through a median filter is to get rid of the outlier point 90, which as you can see, it did just that. But, this is not where the tracking program finishes. What is done is an average and a standard deviation from the median filtered data is taken. We then use these values (average = 4.25 STD = 1.49) to reduce the original data. Using 2 standard deviations from 4.25 will keep all the original data points except the 90 datum. This is good because it's the one we want to get rid of but it does it in an automated fashion.

This then gives the reduced data as:

[math]\displaystyle{ \text{Reduced data} = \{2,5,5,6,4,3,3\}. }[/math]

Of which, we can then find an average and standard deviation from this reduced data set which is the data point we are interested in.

LabVIEW pads the data for median filtering with zeros. This is not desirable because Koch and I ran a simulation using LabVIEW's median filtering and the circularized data median filtering. If you use an incredibly large window to median filter, LabVIEW will continually filter your data to zero while the circularized data median filter will filter the data to the average of the reduced data. This is definitely something to take note of.

Window size

Another issue is of window size. If I take my original data,

[math]\displaystyle{ \text{Data} = \{2,5,5,6,90,4,3,3\} }[/math]

and median filter it with circular data padding, I get

[math]\displaystyle{ \begin{align} y_1^p &= \{2,2,2,5,5\} \rightarrow 2\\ y_2^p &= \{2,2,5,5,6\} \rightarrow 5\\ y_1 &= \{2,5,5,6,90\} \rightarrow 5\\ y_2 &= \{5,5,6,90,4\} \rightarrow 5\\ y_3 &= \{5,6,90,4,3\} \rightarrow 5\\ y_4 &= \{6,90,4,3,3\} \rightarrow 4\\ y_3^p &= \{90,4,3,3,3\} \rightarrow 3\\ y_4^p &= \{4,3,3,3,3\} \rightarrow 3. \end{align} }[/math]

a median filtered data set of

[math]\displaystyle{ \text{Median data} = \{2,5,5,5,5,4,3,3\}. }[/math]

This new data set has an average of 4 and a standard deviation of 1.19. I'm still able to reduce the original data set to {2,5,5,6,4,3,3} by using 2 standard deviations from the average median filtered data set, however, which one do you use? There has to be some way of normalizing the median filtered data so that the window size doesn't matter.