To measure the difference between two probability distributions over the same variable x, a measure, called the Kullback-Leibler divergence, or simply, the KL divergence has been popularly used in the data mining literature. The concept was originated in probability theory and information theory.
The KL divergence, which is closely related to relative entropy, information divergence, and information for discrimination, is a non-symmetric measure of the difference between two probability distributions p(x) and q(x). Specifically, the Kullback-Leibler (KL) divergence of q(x) from p(x), denoted DKL(p(x), q(x)), is a measure of the information lost when q(x) is used to approximate p(x).
Let p(x) and q(x) are two probability distributions of a discrete random variable x. That is, both p(x) and q(x) sum up to 1, and p(x) > 0 and q(x) > 0 for any x in X.
The KL divergence measures the expected number of extra bits required to code samples from p(x) when using a code based on q(x), rather than using a code based on p(x). Typically p(x) represents the “true” distribution of data, observations, or a precisely calculated theoretical distribution. The measure q(x) typically represents a theory, model, description, or approximation of p(x).
The continuous version of the KL divergence is:
Although the KL divergence measures the “distance” between two distributions, it is not a distance measure. This is because the KL divergence is not a metric measure. It is not symmetric: the KL from p(x) to q(x) is generally not the same as the KL from q(x) to p(x). Furthermore, it need not satisfy triangular inequality. Nevertheless, DKL(P||Q) is a non-negative measure. DKL(P||Q) ≥ 0 and DKL(P||Q) = 0 if and only if P = Q.
The Jensen-Shannon divergence, or JS divergence for short, is another way to quantify the difference (or similarity) between two probability distributions.
It uses the KL divergence to calculate a normalized score that is symmetrical. This means that the divergence of P from Q is the same as Q from P, or stated formally:
- JS(P || Q) == JS(Q || P)
The JS divergence can be calculated as follows:
- JS(P || Q) = 1/2 * KL(P || M) + 1/2 * KL(Q || M)
Where M is calculated as:
- M = 1/2 * (P + Q)
And KL() is calculated as the KL divergence described in the previous section.
It is more useful as a measure as it provides a smoothed and normalized version of KL divergence, with scores between 0 (identical) and 1 (maximally different), when using the base-2 logarithm.
The square root of the score gives a quantity referred to as the Jensen Shannon distance or JS distance for short.