Wilke:Information theory

Notice: The Wilke Lab page has moved to http://wilkelab.org.
The page you are looking at is kept for archival purposes and will not be further updated.
THE WILKE LAB

Concepts from information theory, such as entropy and mutual information, can provide useful insight in bioinformatics applications. One downside of these quantities is that they tend to be difficult to estimate without bias. For example, the naive estimator for entropy, $H=-(n_i/N) \sum_i \ln(n_i/N),$where $n_i$ are the counts of observations of type i and N is the total number of observations, tends to underestimate the true entropy for small samples.

However, estimators of entropy are by now relatively well understood, and we have a number of different estimators to choose from. The following python code implements four entropy estimators discussed in Grassberger 2003, Entropy Estimates from Insufficient Samplings, arXiv:physics/0307138.

#!/usr/bin/python

import math # for log function
from scipy.special.basic import * # for polygamma function, needed only by Entropy_H_psi(z).

EulerGamma = 0.57721566490153286060 # Euler Gamma constant

def Gi( n ):
'''Helper function needed for entropy estimation.
Defined by Grassberger 2003. http://arxiv.org/abs/physics/0307138
'''
if n == 0:
return 0
if n == 1:
return -EulerGamma-math.log(2)
if n == 2:
return 2-EulerGamma-math.log(2)
if (n % 2) == 1:
return Gi( n-1 )
return Gi( n-2 ) + 2./(n-1)

def Entropy_H_naive( z ):
'''Naive entropy estimator. Generally biased. Should be avoided.

z is a list of integer numbers, e.g., z=[1,4,3,2,2,6].
It counts the number of times a particular observation has
'''
N = sum(z) # total number of observations
return math.log( N ) - (1./N) * sum([n*math.log(n) for n in z])

def Entropy_H_Miller( z ):
'''Miller entropy estimator. Simplest correction to the
naive estimator.

z is a list of integer numbers, e.g., z=[1,4,3,2,2,6]. It counts
the number of times a particular observation has been made.
'''
N = sum(z) # total number of observations
return math.log( N ) - (1./N) * sum([n*(math.log(n)-0.5/n) for n in z])

def Entropy_H_psi( z ):
'''Entropy estimator according to Grassberger 1988.
Better then Miller estimator. Based on the digamma function psi.

z is a list of integer numbers, e.g., z=[1,4,3,2,2,6].
It counts the number of times a particular observation has
'''
N = sum(z) # total number of observations
return math.log( N ) - (1./N) * \
sum([n*(polygamma(0,n)+(-1.)**n/(n*(n+1))) for n in z])

def Entropy_H_G( z ):
'''Best entropy estimator according to Grassberger 2003,
http://arxiv.org/abs/physics/0307138

z is a list of integer numbers, e.g., z=[1,4,3,2,2,6]. It counts
the number of times a particular observation has been made.
'''
N = sum(z) # total number of observations
return math.log( N ) - (1./N) * sum([n*Gi(n) for n in z])

# ***********************************************
#         code to test the four functions
# ***********************************************
z = [1,1,1,1,1,1,1,2,2,2,3,3,5,6,7,8,10,24,147]
print "z =", z
print "Entropy estimates:", Entropy_H_naive( z ), Entropy_H_Miller( z ),\
Entropy_H_psi( z ), Entropy_H_G( z )


Example output:

z = [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 5, 6, 7, 8, 10, 24, 147]
Entropy estimates: 1.47054411571 1.51257951394 1.52893481177 1.5321868862