Variation of information

From HandWiki - Reading time: 6 min

Short description: Measure of distance between two clusterings related to mutual information

In probability theory and information theory, the variation of information or shared information distance is a measure of the distance between two clusterings (partitions of elements). It is closely related to mutual information; indeed, it is a simple linear expression involving the mutual information. Unlike the mutual information, however, the variation of information is a true metric, in that it obeys the triangle inequality.[1][2][3]

Information diagram illustrating the relation between information entropies, mutual information and variation of information.

Definition

Suppose we have two partitions X and Y of a set A into disjoint subsets, namely X={X1,X2,,Xk} and Y={Y1,Y2,,Yl}.

Let:

n=i|Xi|=j|Yj|=|A|
pi=|Xi|/n and qj=|Yj|/n
rij=|XiYj|/n

Then the variation of information between the two partitions is:

VI(X;Y)=i,jrij[log(rij/pi)+log(rij/qj)].

This is equivalent to the shared information distance between the random variables i and j with respect to the uniform probability measure on A defined by μ(B):=|B|/n for BA.

Explicit information content

We can rewrite this definition in terms that explicitly highlight the information content of this metric.

The set of all partitions of a set form a compact Lattice where the partial order induces two operations, the meet and the join , where the maximum 1 is the partition with only one block, i.e., all elements grouped together, and the minimum is 0, the partition consisting of all elements as singletons. The meet of two partitions X and Y is easy to understand as that partition formed by all pair intersections of one block of, Xi, of X and one, Yi, of Y. It then follows that XYX and XYY.

Let's define the entropy of a partition X as

H(X)=ipilogpi,

where pi=|Xi|/n. Clearly, H(1)=0 and H(0)=logn. The entropy of a partition is a monotonous function on the lattice of partitions in the sense that XYH(X)H(Y).

Then the VI distance between X and Y is given by

VI(X,Y)=2H(XY)H(X)H(Y).

The difference d(X,Y)|H(X)H(Y)| is a pseudo-metric as d(X,Y)=0 doesn't necessarily imply that X=Y. From the definition of 1, it is VI(X,1)=H(X).

If in the Hasse diagram we draw an edge from every partition to the maximum 1 and assign it a weight equal to the VI distance between the given partition and 1, we can interpret the VI distance as basically an average of differences of edge weights to the maximum

VI(X,Y)=|VI(X,1)VI(XY,1)|+|VI(Y,1)VI(XY,1)|=d(X,XY)+d(Y,XY).

For H(X) as defined above, it holds that the joint information of two partitions coincides with the entropy of the meet

H(X,Y)=H(XY)

and we also have that d(X,XY)=H(XY|X) coincides with the conditional entropy of the meet (intersection) XY relative to X.

Identities

The variation of information satisfies

VI(X;Y)=H(X)+H(Y)2I(X,Y),

where H(X) is the entropy of X, and I(X,Y) is mutual information between X and Y with respect to the uniform probability measure on A. This can be rewritten as

VI(X;Y)=H(X,Y)I(X,Y),

where H(X,Y) is the joint entropy of X and Y, or

VI(X;Y)=H(X|Y)+H(Y|X),

where H(X|Y) and H(Y|X) are the respective conditional entropies.

The variation of information can also be bounded, either in terms of the number of elements:

VI(X;Y)log(n),

Or with respect to a maximum number of clusters, K:

VI(X;Y)2log(K)

Triangle inequality

To verify the triangle inequality VI(X;Z)VI(X;Y)+VI(Y;Z), expand using the identity VI(X;Y)=H(X|Y)+H(Y|X). It simplifies to H(X|Z)H(X|Y)+H(Y|Z)The right side equals H(X,Y|Z), which is no less than the left side. This is intuitive, as X,Y|Z contains no less randomness than X|Z.

References

  1. P. Arabie, S.A. Boorman, S. A., "Multidimensional scaling of measures of distance between partitions", Journal of Mathematical Psychology (1973), vol. 10, 2, pp. 148–203, doi: 10.1016/0022-2496(73)90012-6
  2. W.H. Zurek, Nature, vol 341, p119 (1989); W.H. Zurek, Physics Review A, vol 40, p. 4731 (1989)
  3. Marina Meila, "Comparing Clusterings by the Variation of Information", Learning Theory and Kernel Machines (2003), vol. 2777, pp. 173–187, doi:10.1007/978-3-540-45167-9_14, Lecture Notes in Computer Science, ISBN:978-3-540-40720-1

Further reading

  • Arabie, P.; Boorman, S. A. (1973). "Multidimensional scaling of measures of distance between partitions". Journal of Mathematical Psychology 10 (2): 148–203. doi:10.1016/0022-2496(73)90012-6. 
  • Meila, Marina (2003). "Comparing Clusterings by the Variation of Information". Learning Theory and Kernel Machines. Lecture Notes in Computer Science 2777: 173–187. doi:10.1007/978-3-540-45167-9_14. ISBN 978-3-540-40720-1. 
  • Meila, M. (2007). "Comparing clusterings—an information based distance". Journal of Multivariate Analysis 98 (5): 873–895. doi:10.1016/j.jmva.2006.11.013. 
  • Kingsford, Carl (2009). "Information Theory Notes". http://www.cs.umd.edu/class/spring2009/cmsc858l/InfoTheoryHints.pdf. 
  • Kraskov, Alexander; Harald Stögbauer; Ralph G. Andrzejak; Peter Grassberger (2003). "Hierarchical Clustering Based on Mutual Information". arXiv:q-bio/0311039.

External links




Licensed under CC BY-SA 3.0 | Source: https://handwiki.org/wiki/Variation_of_information
23 views | Status: cached on July 15 2024 18:48:42
↧ Download this article as ZWI file
Encyclosphere.org EncycloReader is supported by the EncyclosphereKSF