Wilke:Information theory

Concepts from information theory, such as entropy and mutual information, can provide useful insight in bioinformatics applications. One downside of these quantities is that they tend to be difficult to estimate without bias. For example, the naive estimator for entropy, $$H=-(n_i/N) \sum_i \ln(n_i/N),$$where $$n_i$$ are the counts of observations of type i and N is the total number of observations, tends to underestimate the true entropy for small samples.

However, estimators of entropy are by now relatively well understood, and we have a number of different estimators to choose from. The following python code implements four entropy estimators discussed in Grassberger 2003, Entropy Estimates from Insufficient Samplings, arXiv:physics/0307138.


 * 1) !/usr/bin/python

import math # for log function from scipy.special.basic import * # for polygamma function, needed only by Entropy_H_psi(z).

EulerGamma = 0.57721566490153286060 # Euler Gamma constant

def Gi( n ): '''Helper function needed for entropy estimation. Defined by Grassberger 2003. http://arxiv.org/abs/physics/0307138 '''	if n == 0: return 0 if n == 1: return -EulerGamma-math.log(2) if n == 2: return 2-EulerGamma-math.log(2) if (n % 2) == 1: return Gi( n-1 ) return Gi( n-2 ) + 2./(n-1)

def Entropy_H_naive( z ): '''Naive entropy estimator. Generally biased. Should be avoided. z is a list of integer numbers, e.g., z=[1,4,3,2,2,6]. It counts the number of times a particular observation has been made. '''	N = sum(z) # total number of observations return math.log( N ) - (1./N) * sum([n*math.log(n) for n in z])

def Entropy_H_Miller( z ): '''Miller entropy estimator. Simplest correction to the naive estimator. z is a list of integer numbers, e.g., z=[1,4,3,2,2,6]. It counts the number of times a particular observation has been made. '''	N = sum(z) # total number of observations return math.log( N ) - (1./N) * sum([n*(math.log(n)-0.5/n) for n in z])

def Entropy_H_psi( z ): '''Entropy estimator according to Grassberger 1988. Better then Miller estimator. Based on the digamma function psi. z is a list of integer numbers, e.g., z=[1,4,3,2,2,6]. It counts the number of times a particular observation has been made. '''	N = sum(z) # total number of observations return math.log( N ) - (1./N) * \ sum([n*(polygamma(0,n)+(-1.)**n/(n*(n+1))) for n in z])

def Entropy_H_G( z ): '''Best entropy estimator according to Grassberger 2003, http://arxiv.org/abs/physics/0307138 z is a list of integer numbers, e.g., z=[1,4,3,2,2,6]. It counts the number of times a particular observation has been made. '''	N = sum(z) # total number of observations return math.log( N ) - (1./N) * sum([n*Gi(n) for n in z])

z = [1,1,1,1,1,1,1,2,2,2,3,3,5,6,7,8,10,24,147] print "z =", z print "Entropy estimates:", Entropy_H_naive( z ), Entropy_H_Miller( z ),\ Entropy_H_psi( z ), Entropy_H_G( z )
 * 1)         code to test the four functions
 * 1)         code to test the four functions

Example output: z = [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 5, 6, 7, 8, 10, 24, 147] Entropy estimates: 1.47054411571 1.51257951394 1.52893481177 1.5321868862