In probability theory and information theory, the variation of information or shared information distance is a measure of the distance between two clusterings (partitions of elements). It is closely related to mutual information; indeed, it is a simple linear expression involving the mutual information. Unlike the mutual information, however, the variation of information is a true metric, in that it obeys the triangle inequality.[1][2][3]
Suppose we have two partitions [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math] of a set [math]\displaystyle{ A }[/math] into disjoint subsets, namely [math]\displaystyle{ X = \{X_{1}, X_{2}, \ldots, X_{k}\} }[/math] and [math]\displaystyle{ Y = \{Y_{1}, Y_{2}, \ldots, Y_{l}\} }[/math].
Let:
Then the variation of information between the two partitions is:
This is equivalent to the shared information distance between the random variables i and j with respect to the uniform probability measure on [math]\displaystyle{ A }[/math] defined by [math]\displaystyle{ \mu(B):=|B|/n }[/math] for [math]\displaystyle{ B\subseteq A }[/math].
We can rewrite this definition in terms that explicitly highlight the information content of this metric.
The set of all partitions of a set form a compact Lattice where the partial order induces two operations, the meet [math]\displaystyle{ \wedge }[/math] and the join [math]\displaystyle{ \vee }[/math], where the maximum [math]\displaystyle{ \overline{\mathrm{1}} }[/math] is the partition with only one block, i.e., all elements grouped together, and the minimum is [math]\displaystyle{ \overline{\mathrm{0}} }[/math], the partition consisting of all elements as singletons. The meet of two partitions [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math] is easy to understand as that partition formed by all pair intersections of one block of, [math]\displaystyle{ X_{i} }[/math], of [math]\displaystyle{ X }[/math] and one, [math]\displaystyle{ Y_{i} }[/math], of [math]\displaystyle{ Y }[/math]. It then follows that [math]\displaystyle{ X\wedge Y\subseteq X }[/math] and [math]\displaystyle{ X\wedge Y\subseteq Y }[/math].
Let's define the entropy of a partition [math]\displaystyle{ X }[/math] as
where [math]\displaystyle{ p_{i}=|X_i|/n }[/math]. Clearly, [math]\displaystyle{ H(\overline{\mathrm{1}})=0 }[/math] and [math]\displaystyle{ H(\overline{\mathrm{0}})=\log\,n }[/math]. The entropy of a partition is a monotonous function on the lattice of partitions in the sense that [math]\displaystyle{ X\subseteq Y\Rightarrow H(X) \geq H(Y) }[/math].
Then the VI distance between [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math] is given by
The difference [math]\displaystyle{ d(X,Y)\equiv |H\left(X\right)-H\left(Y\right)| }[/math] is a pseudo-metric as [math]\displaystyle{ d(X,Y)=0 }[/math] doesn't necessarily imply that [math]\displaystyle{ X=Y }[/math]. From the definition of [math]\displaystyle{ \overline{\mathrm{1}} }[/math], it is [math]\displaystyle{ \mathrm{VI}(X,\mathrm{1})\,=\,H\left(X\right) }[/math].
If in the Hasse diagram we draw an edge from every partition to the maximum [math]\displaystyle{ \overline{\mathrm{1}} }[/math] and assign it a weight equal to the VI distance between the given partition and [math]\displaystyle{ \overline{\mathrm{1}} }[/math], we can interpret the VI distance as basically an average of differences of edge weights to the maximum
For [math]\displaystyle{ H(X) }[/math] as defined above, it holds that the joint information of two partitions coincides with the entropy of the meet
and we also have that [math]\displaystyle{ d(X,X\wedge Y)\,=\,H(X\wedge Y|X) }[/math] coincides with the conditional entropy of the meet (intersection) [math]\displaystyle{ X\wedge Y }[/math] relative to [math]\displaystyle{ X }[/math].
The variation of information satisfies
where [math]\displaystyle{ H(X) }[/math] is the entropy of [math]\displaystyle{ X }[/math], and [math]\displaystyle{ I(X, Y) }[/math] is mutual information between [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math] with respect to the uniform probability measure on [math]\displaystyle{ A }[/math]. This can be rewritten as
where [math]\displaystyle{ H(X,Y) }[/math] is the joint entropy of [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math], or
where [math]\displaystyle{ H(X|Y) }[/math] and [math]\displaystyle{ H(Y|X) }[/math] are the respective conditional entropies.
The variation of information can also be bounded, either in terms of the number of elements:
Or with respect to a maximum number of clusters, [math]\displaystyle{ K^* }[/math]:
To verify the triangle inequality [math]\displaystyle{ \mathrm{VI}(X; Z ) \leq \mathrm{VI}(X; Y ) + \mathrm{VI}(Y; Z) }[/math], expand using the identity [math]\displaystyle{ \mathrm{VI}(X; Y ) = H(X|Y) + H(Y|X) }[/math]. It simplifies to [math]\displaystyle{ H(X | Z) \leq H(X | Y) + H(Y | Z) }[/math]The right side equals [math]\displaystyle{ H(X, Y | Z) }[/math], which is no less than the left side. This is intuitive, as [math]\displaystyle{ X, Y | Z }[/math] contains no less randomness than [math]\displaystyle{ X|Z }[/math].
Original source: https://en.wikipedia.org/wiki/Variation of information.
Read more |