Variation of information

Short description: Measure of distance between two clusterings related to mutual information

In probability theory and information theory, the variation of information or shared information distance is a measure of the distance between two clusterings (partitions of elements). It is closely related to mutual information; indeed, it is a simple linear expression involving the mutual information. Unlike the mutual information, however, the variation of information is a true metric, in that it obeys the triangle inequality.^[1]^[2]^[3]

Information diagram illustrating the relation between information entropies, mutual information and variation of information.

Definition

Suppose we have two partitions $X$ and $Y$ of a set $A$ into disjoint subsets, namely $X = {X_{1}, X_{2}, \dots, X_{k}}$ and $Y = {Y_{1}, Y_{2}, \dots, Y_{l}}$ .

Let:

n = \sum_{i} | X_{i} | = \sum_{j} | Y_{j} | = | A |

p_{i} = | X_{i} | / n

and

q_{j} = | Y_{j} | / n

r_{i j} = | X_{i} \cap Y_{j} | / n

Then the variation of information between the two partitions is:

VI (X; Y) = - \sum_{i, j} r_{i j} [\log (r_{i j} / p_{i}) + \log (r_{i j} / q_{j})]

.

This is equivalent to the shared information distance between the random variables i and j with respect to the uniform probability measure on $A$ defined by $μ (B) := | B | / n$ for $B \subseteq A$ .

Explicit information content

We can rewrite this definition in terms that explicitly highlight the information content of this metric.

The set of all partitions of a set form a compact Lattice where the partial order induces two operations, the meet $\land$ and the join $\lor$ , where the maximum $\overset{―}{1}$ is the partition with only one block, i.e., all elements grouped together, and the minimum is $\overset{―}{0}$ , the partition consisting of all elements as singletons. The meet of two partitions $X$ and $Y$ is easy to understand as that partition formed by all pair intersections of one block of, $X_{i}$ , of $X$ and one, $Y_{i}$ , of $Y$ . It then follows that $X \land Y \subseteq X$ and $X \land Y \subseteq Y$ .

Let's define the entropy of a partition $X$ as

H (X) = - \sum_{i} p_{i} \log p_{i}

,

where $p_{i} = | X_{i} | / n$ . Clearly, $H (\overset{―}{1}) = 0$ and $H (\overset{―}{0}) = \log n$ . The entropy of a partition is a monotonous function on the lattice of partitions in the sense that $X \subseteq Y \Rightarrow H (X) \geq H (Y)$ .

Then the VI distance between $X$ and $Y$ is given by

VI (X, Y) = 2 H (X \land Y) - H (X) - H (Y)

.

The difference $d (X, Y) \equiv | H (X) - H (Y) |$ is a pseudo-metric as $d (X, Y) = 0$ doesn't necessarily imply that $X = Y$ . From the definition of $\overset{―}{1}$ , it is $VI (X, 1) = H (X)$ .

If in the Hasse diagram we draw an edge from every partition to the maximum $\overset{―}{1}$ and assign it a weight equal to the VI distance between the given partition and $\overset{―}{1}$ , we can interpret the VI distance as basically an average of differences of edge weights to the maximum

VI (X, Y) = | VI (X, \overset{―}{1}) - VI (X \land Y, \overset{―}{1}) | + | VI (Y, \overset{―}{1}) - VI (X \land Y, \overset{―}{1}) | = d (X, X \land Y) + d (Y, X \land Y)

.

For $H (X)$ as defined above, it holds that the joint information of two partitions coincides with the entropy of the meet

H (X, Y) = H (X \land Y)

and we also have that $d (X, X \land Y) = H (X \land Y | X)$ coincides with the conditional entropy of the meet (intersection) $X \land Y$ relative to $X$ .

Identities

The variation of information satisfies

VI (X; Y) = H (X) + H (Y) - 2 I (X, Y)

,

where $H (X)$ is the entropy of $X$ , and $I (X, Y)$ is mutual information between $X$ and $Y$ with respect to the uniform probability measure on $A$ . This can be rewritten as

VI (X; Y) = H (X, Y) - I (X, Y)

,

where $H (X, Y)$ is the joint entropy of $X$ and $Y$ , or

VI (X; Y) = H (X | Y) + H (Y | X)

,

where $H (X | Y)$ and $H (Y | X)$ are the respective conditional entropies.

The variation of information can also be bounded, either in terms of the number of elements:

VI (X; Y) \leq \log (n)

,

Or with respect to a maximum number of clusters, $K^{*}$ :

VI (X; Y) \leq 2 \log (K^{*})

Triangle inequality

References

↑ P. Arabie, S.A. Boorman, S. A., "Multidimensional scaling of measures of distance between partitions", Journal of Mathematical Psychology (1973), vol. 10, 2, pp. 148–203, doi: 10.1016/0022-2496(73)90012-6
↑ W.H. Zurek, Nature, vol 341, p119 (1989); W.H. Zurek, Physics Review A, vol 40, p. 4731 (1989)
↑ Marina Meila, "Comparing Clusterings by the Variation of Information", Learning Theory and Kernel Machines (2003), vol. 2777, pp. 173–187, doi:10.1007/978-3-540-45167-9_14, Lecture Notes in Computer Science, ISBN:978-3-540-40720-1

External links

Partanalyzer includes a C++ implementation of VI and other metrics and indices for analyzing partitions and clusterings
C++ implementation with MATLAB mex files

0.00

(0 votes)

Original source: https://en.wikipedia.org/wiki/Variation of information. Read more

[1] P. Arabie, S.A. Boorman, S. A., "Multidimensional scaling of measures of distance between partitions", Journal of Mathematical Psychology (1973), vol. 10, 2, pp. 148–203, doi: 10.1016/0022-2496(73)90012-6

[2] W.H. Zurek, Nature, vol 341, p119 (1989); W.H. Zurek, Physics Review A, vol 40, p. 4731 (1989)

[3] Marina Meila, "Comparing Clusterings by the Variation of Information", Learning Theory and Kernel Machines (2003), vol. 2777, pp. 173–187, doi:10.1007/978-3-540-45167-9_14, Lecture Notes in Computer Science, ISBN:978-3-540-40720-1

[1]

[2]

[3]