A basic concept in cybernetics. In cybernetics one studies machines and living organisms only from the point of view of their ability to absorb information given to them, to store information in a "memory" , to transmit it over a communication channel, and to transform it into "signals" . The intuitive picture of information relative to certain quantities or phenomena contained in certain data is developed in cybernetics.
In certain situations it is just as natural to be able to compare various groups of data by the information contained in it as it is to compare plane figures by their "areas" : Independent of the manner of measuring areas one can prove that a figure $ A $ does not have a larger area than $ B $ if $ A $ can be completely included in $ B $( cf. Examples 1–3 below). The deeper fact that it is possible to express area by a number and thereby comparing figures of arbitrary shape is a result of an extensive mathematical theory. The analogue of this fundamental result in information theory is the statement that under definite, very wide, assumptions one may disregard the qualitative peculiarities of information and express its amount by a number. This number only describes the possibility of transmitting information over a communication channel and of storing it in machines with a memory.
Example 1. Specifying the position and velocity of a particle moving in a force field provides information on its position at any future moment of time; this information is, moreover, complete: its position can be exactly predicted. Specifying the energy of a particle also provides information, but this information is incomplete, obviously.
Example 2. The equality
$$ \tag{1 } a = b $$
provides information about the relation between the variables $ a $ and $ b $. The equality
$$ \tag{2 } a ^ {2} = b ^ {2} $$
provides less information (since (1) implies (2), but they are not equivalent). Finally, the equality (for real numbers)
$$ \tag{3 } a ^ {3} = b ^ {3} , $$
is equivalent to (1) and provides the same information, i.e. (1) and (3) are different forms of specifying the same information.
Example 3. Results of measurements of some physical quantity, performed within certain errors, provide information on its exact value. By increasing the number of observations one changes this information.
Example $ 3a $. The arithmetical average of results of observations also contains certain information about the quantity being measured. As is shown in mathematical statistics, if the errors have a normal probability distribution with known variance, then the arithmetical average contains all information.
Example 4. Suppose that the result of a measurement is a random variable $ \xi $. By transmitting $ \xi $ over a communication channel, $ \xi $ is distorted, so that at the receiving end of the channel one obtains the variable
$$ \eta = \xi + \theta $$
where $ \theta $ is independent of $ \xi $( in the sense of probability theory). The "output" $ \eta $ provides information on the "input" $ \xi $, and it is natural to assume that this information is smaller because $ \theta $ has "scattered" values.
In each of the examples given, data are compared with respect to providing information which is more complete or less. In Examples 1–3 the meaning of this comparison is clear and leads to the analysis of the equivalence or non-equivalence of certain relations. In Examples 3a and 4 this meaning needs to be made more precise. This is provided in mathematical statistics and information theory (for which these examples are typical).
At the basis of information theory is a definition suggested in 1948 by C.E. Shannon, of measuring the amount of information contained in one random object (event, variable, function, etc.) with respect to another. It consists in expressing the amount of information by a number. It can be extremely well explained in the simplest case when the random objects considered are random variables taking only a finite number of values. Let $ \xi $ be a random variable taking values $ x _ {1} \dots x _ {n} $ with probabilities $ p _ {1} \dots p _ {n} $ and let $ \eta $ be a random variable taking values $ y _ {1} \dots y _ {m} $ with probabilities $ q _ {1} \dots q _ {m} $. Then the information $ I ( \xi , \eta ) $ contained in $ \xi $ with respect to $ \eta $ is defined by the formula
$$ I ( \xi , \eta ) = \ \sum _ {i , j } p _ {ij} \mathop{\rm log} _ {2} \ \frac{p _ {ij} }{p _ {i} q _ {j} } , $$
where $ p _ {ij} $ is the probability of joint occurrence of $ \xi = x _ {i} $ and $ \eta = y _ {j} $, and the logarithm is to base 2. The information $ I ( \xi , \eta ) $ has a number of properties that are naturally required for a measure of quantity of information. Thus, always $ I ( \xi , \eta ) \geq 0 $, and equality holds if only if $ p _ {ij} = p _ {i} q _ {j} $ for all $ i $ and $ j $, i.e. if and only if $ \xi $ and $ \eta $ are independent random variables. Further, $ I ( \xi , \eta ) \leq I ( \eta , \eta ) $ and equality holds only if $ \eta $ is a function of $ \xi $( e.g. $ \eta = \xi ^ {2} $, etc.). More surprising is the fact that $ I ( \xi , \eta ) = I ( \eta , \xi ) $.
The quantity $ H ( \xi ) = I ( \xi , \xi ) = \sum _ {i} p _ {i} \mathop{\rm log} _ {2} ( 1 / p _ {i} ) $ is called the entropy of $ \xi $. The concept of the entropy is basic in information theory. The amount of information and the entropy are related by
$$ \tag{5 } I ( \xi , \eta ) = \ H ( \xi ) + H ( \eta ) - H ( \xi , \eta ) , $$
where $ H ( \xi , \eta ) $ is the entropy of the pair $ ( \xi , \eta ) $, i.e.
$$ H ( \xi , \eta ) = \ \sum _ {i , j } p _ {ij} \mathop{\rm log} _ {2} \ \frac{1}{p _ {ij} } . $$
The entropy turns out to be the average number of binary symbols necessary for differentiation (or description) of the possible values of a random variable. This makes it possible to understand the role of the amount of information
in "storing" information in machines with a memory. If $ \xi $ and $ \eta $ are independent random variables, then one needs on the average $ H ( \xi ) $ binary symbols to write down the values of $ \xi $, $ H ( \eta ) $ binary symbols for those of $ \eta $, and $ H ( \xi , \eta ) $ binary symbols for those of the pair $ ( \xi , \eta ) $. If $ \xi $ and $ \eta $ are dependent, then the average number of binary symbols necessary for writing down the pair $ ( \xi , \eta ) $ is less than $ H ( \xi ) + H ( \eta ) $, since $ H ( \xi , \eta ) = H ( \xi ) + H ( \eta ) - I ( \xi , \eta ) $.
Using deeper theorems, the role of the amount of information
in problems of information transmission over communication channels can be explained. The basic information-theoretic characteristic of channels, their so-called capacity (cf. Transmission rate of a channel), is defined in terms of the concept of "information" .
If $ \xi $ and $ \eta $ may take an infinite set of values, then by limit transition one obtains from :
$$ \tag{6 } I ( \xi , \eta ) = \int\limits \int\limits p ( x , y ) \mathop{\rm log} _ {2} \ \frac{p ( x , y ) }{p ( x) q ( y) } \ d x d y , $$
where $ p $ and $ q $ denote the corresponding probability densities. The entropies $ H ( \xi ) $ and $ H ( \eta ) $ do not exist in this case, but there is the formula, analogous to (5),
$$ \tag{7 } I ( \xi , \eta ) = h ( \xi ) + h ( \eta ) - h ( \xi , \eta ) , $$
where
$$ h ( \xi ) = \int\limits p ( x) \mathop{\rm log} _ {2} \frac{1}{p ( x) } d x $$
is the differential entropy of $ \xi $( $ h ( \eta ) $ and $ h ( \xi , \eta ) $ are defined likewise).
Example 5. Suppose that under the conditions of Example 4 the random variables $ \xi $ and $ \theta $ have normal probability distributions with mean zero and with variances equal to, respectively, $ \sigma _ \xi ^ {2} $ and $ \sigma _ \theta ^ {2} $. Then, as may be inferred from (6) or (7): $ I ( \eta , \xi ) = I ( \xi , \eta ) = ( 1 / 2 ) \mathop{\rm log} _ {2} ( 1 + \sigma _ \xi ^ {2} / \sigma _ \theta ^ {2} ) $. Thus, the amount of information in the "received signal" $ \eta $ with respect to the "transmitted signal" $ \xi $ tends to zero as the level of "noise" $ \theta $ grows (i.e. as $ \sigma _ \xi ^ {2} \rightarrow \infty $), and grows without bound when the "noise" vanishes (i.e. as $ \sigma _ \theta ^ {2} \rightarrow 0 $).
The case when the random variables $ \xi $ and $ \eta $ in Example 4 or 5 are stochastic functions (or, as one says, stochastic processes) $ \xi ( t) $ and $ \eta ( t) $, describing the variation of a quantity at the input, respectively output, of the channel, is of special interest. The amount of information in $ \eta ( t) $ with respect to $ \xi ( t) $ for a given level of noise (in acoustic terminology) may serve as a criterion of the quality of the channel itself.
In problems in mathematical statistics one also uses the concept of information (cf. Examples 3 and 3a). However, both by its formal definition as by the name it has been given, it differs from the concept defined above (in information theory). Statistics deals with a large number of results of observations and usually replaces the complete listing of them by certain combined characteristics. In this replacement information is sometimes lost, but under certain conditions the combined characteristics contain all the information contained in the complete data (this statement is explained at the end of Example 6 below). The concept of information was introduced into statistics by R.A. Fisher in 1921.
Example 6. Let $ \xi _ {1} \dots \xi _ {n} $ be the results of $ n $ independent observations of some quantity, normally distributed with probability density
$$ p ( x ; a ; \sigma ^ {2} ) = \ \frac{1}{\sigma \sqrt {2 \pi } } \mathop{\rm exp} \ \left \{ - \frac{( x - a ) ^ {2} }{2 \sigma ^ {2} } \right \} , $$
where the parameters $ a $ and $ \sigma ^ {2} $( the mean and variance) are unknown and must be estimated using the results of observations. Sufficient statistics (i.e. functions in the results of observations containing complete information on the unknown parameters) for this case are provided by the arithmetical average
$$ \overline \xi \; = \frac{1}{n} \sum _ { i= } 1 ^ { n } \xi _ {i} , $$
and the so-called empirical variance
$$ s ^ {2} = \frac{1}{n} \sum _ { i= } 1 ^ { n } ( \xi _ {i} - \overline \xi \; ) ^ {2} . $$
If $ \sigma ^ {2} $ is known, then by itself $ \overline \xi \; $ is a sufficient statistic (cf. Example 3a).
The meaning of the term "complete information" can be clarified in the following way. Suppose one has a function of the unknown parameter $ \phi = \phi ( a , \sigma ^ {2} ) $, let $ \phi ^ {*} = \phi ^ {*} ( \xi _ {1} \dots \xi _ {n} ) $ be an estimator for it that is free of systematic errors. Suppose that the quality of the estimator (its exactness) is measured (as is usual in problems in mathematical statistics) by the variance of the difference $ \phi ^ {*} - \phi $. Then there exists another estimator $ \phi ^ {**} $, not depending on the remaining $ \xi _ {i} $ but only on $ \overline \xi \; $ and $ \sigma ^ {2} $, that is not worse (in the sense of the criterion mentioned above) than $ \phi ^ {*} $. Fisher has also proposed a measure of the (average) amount of information with respect to an unknown parameter contained in one observation. This concept is revealed in the theory of statistical estimation.
For references, see Information, transmission of.
[a1] | C.E. Shannon, "A mathematical theory of communication" Bell. System Techn. J. , 27 (1948) pp. 379–423; 623–656 |
[a2] | C.E. Shannon, W. Weaver, "The mathematical theory of communication" , Univ. Illinois Press (1949) |
[a3] | T. Berger, "Rate distortion theory" , Prentice-Hall (1970) |