Information Theory — Entropy

5 min readJan 26, 2022

Information Theory

A cornerstone of information theory is the idea of quantifying how much information there is in a message. More generally, this can be used to quantify the information in an event and a random variable, called entropy and is calculated using probability.
Calculating information and entropy is a useful tool in machine learning and is used as the basis for techniques such as feature selection, building decision trees, and, more generally, fitting classification models. As such, a machine learning practitioner requires a strong understanding and intuition for information and entropy.
Entropy provides a measure of the average amount of information needed to represent an event drawn from a probability distribution for a random variable.
The mathematician Claude Shannon had the insight that the more predictable some information is, the less space is required to store it.

Information theory is concerned with representing data in a compact fashion (a task known as data compression or source coding), as well as with transmitting and storing it in a way that is robust to errors (a task known as error correction or channel coding).

— Page 56, Machine Learning: A Probabilistic Perspective, 2012.

Measurements of information are widely used in artificial intelligence and machine learning, such as in the construction of decision trees and the optimization of classifier models.

Why unify information theory and machine learning? Because they are two sides of the same coin. […] Information theory and machine learning still belong together. Brains are the ultimate compression and communication systems. And the state-of-the-art algorithms for both data compression and error-correcting codes use the same tools as machine learning.

— Page v, Information Theory, Inference, and Learning Algorithms, 2003.

Calculate the Information for an Event

The intuition behind quantifying information is the idea of measuring how much surprise there is in an event. Those events that are rare (low probability) are more surprising and therefore have more information than those events that are common (high probability).

Low Probability Event: High Information (surprising).
High Probability Event: Low Information (unsurprising).

The basic intuition behind information theory is that learning that an unlikely event has occurred is more informative than learning that a likely event has occurred.

— Page 73, Deep Learning, 2016.

A skewed probability distribution has less “surprise” and in turn a low entropy because likely events dominate. Balanced distributions are more surprising and turn have higher entropy because events are equally likely.

Skewed Probability Distribution (unsurprising): Low entropy.
Balanced Probability Distribution (surprising): High entropy.

We can calculate the amount of information there is in an event using the probability of the event. This is called “Shannon information,” “self-information,” or simply the “information,” and can be calculated for a discrete event x as follows:

information(x) = -log( p(x) )

Where log() is the base-2 logarithm and p(x) is the probability of the event x.

The choice of the base-2 logarithm means that the units of the information measure are in bits (binary digits). This can be directly interpreted in the information processing sense as the number of bits required to represent the event.

Calculate the Entropy for a Random Variable

In effect, calculating the information for a random variable is the same as calculating the information for the probability distribution of the events for the random variable.
Calculating the information for a random variable is called “information entropy,” “Shannon entropy,” or simply “entropy“. It is related to the idea of entropy from physics by analogy, in that both are concerned with uncertainty.

… the Shannon entropy of a distribution is the expected amount of information in an event drawn from that distribution. It gives a lower bound on the number of bits […] needed on average to encode symbols drawn from a distribution P.

— Page 74, Deep Learning, 2016.

Entropy can be calculated for a random variable X with k in K discrete states as follows:

H(X) = -sum(each k in K p(k) * log(p(k)))

That is the negative of the sum of the probability of each event multiplied by the log of the probability of each event:

Applications

Shannon’s work found uses in data storage, spaceship communication, and even communication over the internet. Even if we are not working in any of those fields, ‘KL divergence’ is an idea derived from Shannon’s work, that is frequently used in data science. It tells you how good one distribution is at estimating another by comparing their entropies.
Communication and storage of information are what has made humans great, and Shannon’s work revolutionized the way we do so in the digital age.

Cross-Entropy

Cross-entropy is a measure from the field of information theory, building upon entropy and generally calculating the difference between two probability distributions. It is closely related to but is different from KL divergence that calculates the relative entropy between two probability distributions, whereas cross-entropy can be thought to calculate the total entropy between the distributions.
Cross-entropy is also related to and often confused with logistic loss, called log loss. Although the two measures are derived from a different source, when used as loss functions for classification models, both measures calculate the same quantity and can be used interchangeably.

If we consider a target or underlying probability distribution P and an approximation of the target distribution Q, then the cross-entropy of Q from P is the number of additional bits to represent an event using Q instead of P.
Cross-entropy can be calculated using the probabilities of the events from P and Q, as follows:

Where P(x) is the probability of the event x in P, Q(x) is the probability of event x in Q and log is the base-2 logarithm, meaning that the results are in bits. If the base-e or natural logarithm is used instead, the result will have the units called nats.

Reference:

https://machinelearningmastery.com/