The part of information theory related to the study of processes of information transfer from a source to a receiver (the addressee), cf. Information, source of. In the theory of information transmission one studies optimum and near-optimum methods of information transmission over a communication channel under the assumption that the methods of encoding the message into the input signal of the channel and of decoding the output signal of the channel into an output message may vary within wide ranges (cf. Coding and decoding).
The general scheme of systems of information transmission, first considered by C. Shannon [1], can be described as follows. The source of information produces the message that has to be transmitted over the communication channel from the source to the receiver. It is usually assumed that this message is a random variable defined on some probability space $ ( \Omega , \mathfrak A , {\mathsf P} ) $, taking values in a measurable space $ ( \mathfrak X , S _ {\mathfrak X } ) $ and having a probability distribution $ p ( \cdot ) $. Often $ \xi = \{ {\xi ( t) } : {t \in \Delta } \} $, where $ \Delta $ is the set of parameter values of $ t $, is a stochastic process in discrete or continuous time and with values in a measurable space $ ( X , S _ {X} ) $. E.g., in the case of discrete time $ \xi = \{ {\xi _ {k} } : {k = 1 , 2 ,\dots } \} $ or $ \xi = \{ {\xi _ {k} } : {\dots , - 1 , 0 , 1 ,\dots } \} $; the random variables $ \xi _ {k} $, taking values in $ ( X , S _ {X} ) $, are called the components of the input message and are often treated as the messages produced by the source at the moment of time $ k $. The sets $ \xi ^ {n} = ( \xi _ {1} \dots \xi _ {n} ) $, taking values in the $ n $- fold product of $ ( X , S _ {X} ) $, $ ( X ^ {n} , S _ {X ^ {n} } ) $, are called segments of length $ n $ of the input message. The corresponding concepts are defined in an analogous way for the case of a continuous-time stochastic process as message.
The output message, received by the receiver, is also a random variable $ \widetilde \xi $, defined on the same probability space $ ( \Omega , \mathfrak A , {\mathsf P} ) $ and taking values in a measurable space $ ( \widetilde{\mathfrak X} , S _ {\widetilde{\mathfrak X} } ) $( in general, different from $ ( \mathfrak X , S _ {\mathfrak X } ) $). When $ \widetilde \xi $ is a stochastic process in discrete or continuous time one introduces in an analogous way the concepts of the space $ ( \widetilde{X} , S _ {\widetilde{X} } ) $ of values of the components at the output, and the space $ ( \widetilde{X} {} ^ {n} , S _ {\widetilde{X} {} ^ {n} } ) $ of values of the segments of length $ n $ of the output message.
The exactness of reproducibility of information is taken as the measure for the quality of the transmission of the message over the communication channel (cf. Information, exactness of reproducibility of). As a rule, if the transmission is over a communication channel with noise, even if the sets $ \mathfrak X $ and $ \widetilde{\mathfrak X} $ coincide, it is impossible to obtain absolute exactness, i.e. complete coincidence of the messages sent and received. Often, the requirement reflecting the exactness is treated statistically, by introducing the class $ W $ of admissible joint probability distributions for pairs $ ( \xi , \widetilde \xi ) $ of messages sent and received in the set of all probability measures on the product $ ( \mathfrak X \times \widetilde{\mathfrak X} , S _ {\mathfrak X } \times S _ {\widetilde{\mathfrak X} } ) $. The class $ W $ is often given by means of a non-negative measurable function $ \rho ( x , \widetilde{x} ) $, $ x \in \mathfrak X $, $ \widetilde{x} \in \widetilde{\mathfrak X} $, and a number $ a > 0 $: It is considered that a probability distribution of $ ( \xi , \widetilde \xi ) $ belongs to $ W $ only if
$$ \tag{1 } {\mathsf E} \rho ( \xi , \widetilde \xi ) \leq a . $$
Thus, the condition of exactness of reproducibility indicates by what amount the message received can differ from the message sent.
Messages produced by a source are sent over a communication channel. A channel $ ( Q , V ) $ is a set of two measurable spaces $ ( \mathfrak Y , S _ {\mathfrak Y } ) $, $ ( \widetilde{\mathfrak Y} , S _ {\widetilde{\mathfrak Y} } ) $; a transition function $ Q ( y , A ) $, $ y \in \mathfrak Y $, $ A \in S _ {\widetilde{\mathfrak Y} } $, that is measurable with respect to the $ \sigma $- algebra $ S _ {\mathfrak Y } $ for fixed $ A \in S _ {\widetilde{\mathfrak Y} } $ and is a probability measure on $ ( \widetilde{\mathfrak Y} , S _ {\widetilde{\mathfrak Y} } ) $ for fixed $ y \in \mathfrak Y $; and a subset $ V $ in the space of all probability measures on $ ( \mathfrak Y , S _ {\mathfrak Y } ) $. The spaces $ ( \mathfrak Y , S _ {\mathfrak Y } ) $, $ ( \widetilde{\mathfrak Y} , S _ {\widetilde{\mathfrak Y} } ) $ are called, respectively, the spaces of input and output signals of the channel, and $ V $ is a restriction on the distribution of the input signal. One says that two random variables $ \eta $ and $ \widetilde \eta $( defined on a probability space $ ( \Omega , \mathfrak A , {\mathsf P} ) $) are related through a channel $ ( Q , V ) $ if they take values in $ ( \mathfrak Y , S _ {\mathfrak Y } ) $ and $ ( \widetilde{\mathfrak Y} , S _ {\widetilde{\mathfrak Y} } ) $, respectively, and if for any $ A \in S _ {\widetilde{\mathfrak Y} } $ with probability 1 the conditional probability
$$ \tag{2 } {\mathsf P} \{ \widetilde \eta \in A \mid \eta \} = \ Q ( \eta , A ) , $$
and the probability distribution of $ \eta $ belongs to $ V $. Most often $ V $ is given by a measurable function $ \pi ( y) $, $ y \in \mathfrak Y $, and a number $ b > 0 $: It is considered that the probability distribution of $ \eta $ belongs to $ V $ only if
$$ \tag{3 } {\mathsf E} \pi ( \eta ) \leq b . $$
In the case of discrete channels $ V $ usually coincides with the set of all probability distributions, i.e. there is no restriction.
In fact, $ \mathfrak Y $ is the set of signals given by the transmitter, and $ \widetilde{\mathfrak Y} $ is the set of signals taken by the receiver (in applications $ \mathfrak Y $ and $ \widetilde{\mathfrak Y} $ often coincide). If the random variable $ \eta $ of the input signal is known, then (2) makes it possible to find the conditional distribution of the output signal $ \widetilde \eta $. The introduction of the restriction $ V $ is related to the fact that in many applications the distributions of the input signal cannot be taken arbitrarily (the case when it is supposed that the mean value of the square (the power) of the input signal does not exceed a fixed constant is typical). The case when the input and output signals are discrete- or continuous-time stochastic processes $ \eta = \{ \eta ( t) \} $, $ \widetilde \eta = \{ \widetilde \eta ( t) \} $ defined on a certain finite or infinite (at one or both sides) interval on the real axis and taking values in certain measurable spaces $ ( Y , S _ {Y} ) $ and $ ( \widetilde{Y} , S _ {\widetilde{Y} } ) $, respectively, is important in applications. E.g., if $ \eta = \{ \eta _ {1} , \eta _ {2} ,\dots \} $ and $ \widetilde \eta = \{ \widetilde \eta _ {1} , \widetilde \eta _ {2} ,\dots \} $ are random sequences, then the communication channel for which $ \eta $ and $ \widetilde \eta $ serve as input and output signals is often regarded as a sequence of channels (in the sense described above), called segments of the given channel; the input and output signals of these segments are the vectors
$$ \eta ^ {n} = ( \eta _ {1} \dots \eta _ {n} ) \ \ \textrm{ and } \ \widetilde \eta {} ^ {n} = \ ( \widetilde \eta _ {1} \dots \widetilde \eta _ {n} ) ,\ \ n = 1 , 2 ,\dots . $$
In order to convert the input message into a signal transmittable over the communication channel and the signal received at the output of the channel into an output message it is necessary to perform operations of encoding and decoding of messages. An encoding is a function $ f ( x) $ of $ x \in \mathfrak X $ with values in $ \mathfrak Y $, and a decoding is a function $ g ( \widetilde{y} ) $ of $ \widetilde{y} \in \widetilde{\mathfrak Y} $ with values in $ \widetilde{\mathfrak X} $. The set of values of $ f ( x) $, $ x \in \mathfrak X $, is called a code, and the individual elements of this set are called code words. Using an encoding $ f ( x) $ and a decoding $ g ( \widetilde{y} ) $ means that if the message took the value $ x \in \mathfrak X $, then one transmits $ y = f ( x) $ over the channel; if $ \widetilde{y} \in \widetilde{\mathfrak Y} $ is received at the output of the channel, then it is decoded into the output message $ \widetilde{x} = g ( \widetilde{y} ) $. One often considers random encodings in the theory of information transmission, i.e. the code words are chosen randomly in correspondence with a certain probability distribution.
A message with probability distribution $ p ( \cdot ) $, produced by the source, can be transmitted with exactness of reproducibility $ W $ over a channel $ ( Q , V ) $ by means of an encoding $ f ( \cdot ) $ and a decoding $ g ( \cdot ) $ if random variables $ \xi $, $ \eta $, $ \widetilde \eta $, $ \widetilde \xi $ forming a Markov chain can be constructed such that $ \xi $ has probability distribution $ p ( \cdot ) $, the probability distribution of $ ( \xi , \widetilde \xi ) $ belongs to $ W $, the pair $ ( \eta , \widetilde \eta ) $ is related through $ ( Q , V ) $, and
$$ \tag{4 } \eta = f ( \xi ) ,\ \ \xi = g ( \widetilde \eta ) . $$
The assumption that $ \xi $, $ \eta $, $ \widetilde \eta $, $ \widetilde \xi $ form a Markov chain reduces to the assumption that the conditional probability of $ \widetilde \eta $ for fixed values of $ \xi $ and $ \eta $ depends on $ \eta $ only, i.e. it means that the output signal depends only on the input signal and not on the value of the message encoded by it.
The basic problem studied in the transmission of information is as follows. One considers known and fixed: a source generating messages with probability densities $ p ( \cdot ) $; a communication channel $ ( Q , V ) $; and a condition of exactness of reproducibility $ W $. The problem is to clarify: Under what conditions does there exist an encoding $ f ( \cdot ) $ and a decoding $ g ( \cdot ) $ such that a message produced by the given source can be transmitted with given $ W $ over $ ( Q , V ) $? Solutions to this problem for various assumptions are called coding theorems, or Shannon theorems. Another naturally arising problem is that when transmittance is possible, to construct the most simple and efficient ways of encoding and decoding the transmission.
Shannon [1] introduced quantities that allow one to formulate an answer to the first problem posed. The amount of information, or simply the information, $ I ( \cdot , \cdot ) $ is the most important among these (cf. Information, amount of). If
$$ C = C ( Q , V ) = \ \sup I ( \eta , \widetilde \eta ) $$
is the capacity of the channel $ ( Q , V ) $( cf. Transmission rate of a channel), where the supremum is over all pairs $ ( \eta , \widetilde \eta ) $ related through $ ( Q , V ) $, and if the number
$$ \tag{6 } H _ {W} ( p) = \inf I ( \xi , \widetilde \xi ) $$
is the $ W $- entropy (cf. Entropy) of the message, where the infimum is over all pairs $ ( \xi , \widetilde \xi ) $ such that the joint probability distribution of $ ( \xi , \widetilde \xi ) $ belongs to $ W $, while $ \xi $ has probability distribution $ p ( \cdot ) $, then the following theorem of Shannon (a converse of coding theorems) holds: If a message with probability distribution $ p ( \cdot ) $ can be transmitted over $ ( Q , V ) $ with exactness of reproducibility $ W $, then
$$ \tag{7 } H _ {W} ( p) \leq \ C ( Q , V ) . $$
Sufficient conditions for the possibility of transmission of information are more difficult to obtain. Thus, (7) is sufficient only in a certain asymptotic sense, in which the main assumption is that $ H _ {W} ( p) \rightarrow \infty $. Hence, (7) is necessary and sufficient only, roughly speaking, when applied to the problem of transmitting a sufficiently large amount of information. The remaining necessary assumptions are of the nature of regularity assumptions, which in concrete cases are usually fulfilled. In order to formulate sufficient conditions for the possibility of transmission in exact terms it is necessary to introduce supplementary concepts.
A sequence of pairs of random variables $ \{ ( \zeta ^ {t} , \widetilde \zeta {} ^ {t} ) \}_{t=1}^ \infty $ is called information-stable if $ 0 < I ( \zeta ^ {t} , \widetilde \zeta {} ^ {t} ) < \infty $ and
$$ \tag{8 } \lim\limits _ {t \rightarrow \infty } \ \frac{i _ {\zeta ^ {t} \widetilde \zeta {} ^ {t} } ( \zeta ^ {t} , \widetilde \zeta {} ^ {t} ) }{I ( \zeta ^ {t} , \widetilde \zeta {} ^ {t} ) } = 1 $$
in the sense of convergence in probability. Here $ i _ {\zeta ^ {t} \widetilde \zeta {} ^ {t} } ( \cdot , \cdot ) $ is the information density (cf. Information, amount of) of $ ( \zeta ^ {t} , \widetilde \zeta {} ^ {t} ) $. A sequence of channels $ \{ ( Q ^ {t} , V ^ {t} ) \}_{t=1}^ \infty $ with $ C ( Q ^ {t} , V ^ {t} ) < \infty $ is called information-stable if there exists an information-stable sequence of pairs $ ( \eta ^ {t} , \widetilde \eta {} ^ {t} ) $, related through $ ( Q ^ {t} , V ^ {t} ) $, such that
$$ \tag{9 } \lim\limits _ {t \rightarrow \infty } \ \frac{I ( \eta , \widetilde \eta {} ^ {t} ) }{C ( Q ^ {t} , V ^ {t} ) } = 1 . $$
A sequence of messages with probability distributions $ p ^ {t} ( \cdot ) $ and exactness conditions $ W ^ {t} $ with $ H _ {W ^ {t} } ( p ^ {t} ) < \infty $, $ t = 1 , 2 \dots $ is called information-stable if there exists a sequence of pairs $ ( \xi ^ {t} , \widetilde \xi {} ^ {t} ) $ such that $ \xi ^ {t} $ has probability density $ p ( \cdot ) $, the probability distribution of $ ( \xi ^ {t} , \widetilde \xi {} ^ {t} ) $ belongs to $ W ^ {t} $, and
$$ \tag{10 } \lim\limits _ {t \rightarrow \infty } \ \frac{I ( \xi ^ {t} , \widetilde \xi {} ^ {t} ) }{H _ {W ^ {t} } ( p ^ {t} ) } = 1 . $$
Let $ V _ \epsilon $ be the set of probability distributions for which (3) with $ b $ replaced by $ b + \epsilon $ holds, and let $ W _ \epsilon $ be the exactness condition given by (1) with $ a $ replaced by $ a + \epsilon $, $ \epsilon > 0 $. The following coding theorem (Shannon) holds: Suppose one is given an information-stable sequence of messages with probability densities $ p ^ {t} ( \cdot ) $ and exactness conditions $ W ^ {t} $, as well as an information-stable sequence of channels $ ( Q ^ {t} , V ^ {t} ) $ such that the functions $ \pi ^ {t} ( \cdot ) $ and $ \rho ^ {t} ( \cdot , \cdot ) $ are uniformly bounded in $ t $. Let $ H _ {W ^ {t} } ( p ^ {t} ) \rightarrow \infty $ as $ t \rightarrow \infty $ and let
$$ \tag{11 } \lim\limits _ {t \rightarrow \infty } \textrm{ i nf } \ \frac{C ( Q ^ {t} , V ^ {t} ) }{H _ {W ^ {t} } ( p ^ {t} ) } > 1 . $$
Then for any $ \epsilon > 0 $ there exists an arbitrary large $ t _ {0} $ such that for all $ t \geq t _ {0} $ the messages with probability distributions $ p ^ {t} ( \cdot ) $ can be transmitted over $ ( Q ^ {t} , V ^ {t} ) $ with exactness of reproducibility $ W _ \epsilon ^ {t} $.
This formulation of a direct coding theorem is most general. The assumption that $ \pi ^ {t} ( \cdot ) $ and $ \rho ^ {t} ( \cdot , \cdot ) $ are uniformly bounded in $ t $ can be substantially weakened. The information stability of a sequence of messages or channels is true in a large number of particular cases of practical interest. Finally, under certain conditions one may replace $ W _ \epsilon ^ {t} $ by $ W ^ {t} $ and $ V _ \epsilon ^ {t} $ by $ V ^ {t} $ in the formulation of the theorem.
In the description of real situations the case when the sequence of channels considered in the theorem is the sequence of segments of a fixed channel, while the sequence of messages is the sequence of segments of a message from a fixed source with growing number of components is of most interest. This corresponds to the functioning of a communication system in time. Below some versions of the coding theorem and its converse are given for this situation, in a somewhat different form. Moreover, discrete stationary sources and memoryless communication channels are considered.
Suppose that a discrete stationary source $ U $ produces messages $ \xi = \{ {\xi _ {k} } : {\dots, - 1 , 0 , 1 ,\dots } \} $, where the individual components (the letters of the message) $ \xi _ {k} $ take values from some finite set of an alphabet $ X $ of volume $ M $, and generates letters of messages at the rate of one letter per unit of time. Let the components of the message $ \widetilde \xi = \{ {\widetilde \xi _ {k} } : {\dots , - 1 , 0 , 1 ,\dots } \} $ received by the addressee take values in the same alphabet $ X $( i.e. $ \widetilde{X} = X $). Suppose further that a discrete memoryless channel is being used, and that transmission over it is at the rate of one symbol per time interval $ t $. Suppose that there are no restrictions on the distribution of the input signal of the channel. Suppose that a segment of length $ L = L _ {N} $ of a message, $ \xi ^ {L} = ( \xi _ {1} \dots \xi _ {L} ) $, is transmitted over the channel by intervals of length $ N = [ L / t ] $( where $ [ x] $ is the integer part of a number $ x $) using certain encoding and decoding methods of the type described above. Then if $ \widetilde \xi {} ^ {L} = ( \widetilde \xi _ {1} \dots \widetilde \xi _ {L} ) $ is the corresponding segment of the message obtained by the addressee and $ \widehat{P} _ {e} $ is the average probability of an error in a source letter, defined by
$$ \tag{12 } \widehat{P} _ {e} = \frac{1}{L} \sum_{l=1}^ { L } {\mathsf P} \{ \widetilde \xi _ {1} \neq \xi _ {l} \} , $$
then the following theorem holds.
Converse of the coding theorem: Let $ H ( U) $ be the rate of creation of messages of the given discrete stationary source and let $ C $ be the transmission rate (on transmitted symbols) of the memoryless channel used. Then for all $ L $:
$$ \tag{13 } \widehat{P} _ {e} \mathop{\rm log} ( M - 1 ) + h ( \widehat{P} _ {e} ) \geq H ( U) - \frac{1} \tau C , $$
where
$$ h ( x) = - x \mathop{\rm log} x - ( 1 - x ) \mathop{\rm log} ( 1 - x ) . $$
Thus, if the rate of creation of messages $ H ( U) $ is larger than $ C / \tau $( the transmission rate of the channel on a source letter), then the average probability of an error in a source letter is bounded from below by a non-zero constant, whatever $ L $ and the methods of encoding and decoding. Hence, this probability does not tend to zero as $ L \rightarrow \infty $.
In order to formulate the direct statement of a coding theorem one needs the quantity
$$ \tag{14 } R _ {N} = \ \frac{ \mathop{\rm log} _ {2} M ^ {L _ {N} } }{N} . $$
When $ M = 2 $ and $ L _ {N} / \tau $ is an integer one has $ R _ {N} = \tau $, i.e. $ R _ {N} $ is in bits (cf. Bit) the number of binary symbols produced by the source within the transmission time of one symbol over the channel. Moreover, $ R _ {N} $ coincides with $ H ( U) $ if the components are independent and identically distributed. The probability of an error and the average probability of an error in a block of the source are defined respectively, by,
$$ \tag{15 } P _ {e,x ^ {L} } = \ {\mathsf P} _ {x ^ {L} } \{ \widetilde \xi {} ^ {L} \neq \xi ^ {L} \} , $$
$$ \tag{16 } \overline{P}\; _ {e} = \sum _ {x ^ {L} } {\mathsf P} \{ \xi ^ {L} = x ^ {L} \} P _ {e,x ^ {L} } . $$
Here $ P _ {x ^ {L} } ( \cdot ) $ is the conditional probability under the condition $ \xi ^ {L} = x ^ {L} \in X ^ {L} $. The following coding theorem holds: For all $ N $ and any $ R < C $ there are encoding and decoding methods such that for $ R _ {N} \leq R $ and all $ x ^ {L} \in X ^ {L} $,
$$ \tag{17 } P _ {e,x ^ {L} } \leq e ^ {- N E ( R) } $$
(this also holds for $ \overline{P}\; _ {e} $). Moreover, for $ 0 \leq R < C $ the function $ E ( R) $ is convex, positive and decreases with increasing $ R $( see also Erroneous decoding, probability of). Thus, this theorem also shows that for all $ R < C $ the probability of an error tends to zero, exponentially fast, as $ R \rightarrow \infty $.
There are generalizations of Shannon theorems to the cases of so-called compound channels and messages with independent parameters. Such generalizations are of interest because in practice it is impossible to regard the statistical parameters of the source of messages and the communication channel as completely known, since these parameters can sometimes change in the process of transmission. Therefore, it is appropriate to assure that the source of the messages and the communication channel belong to a certain class of possible sources and channels. One introduces, moreover, the minimax criterion for the quality of transmission, in which the quality of a given transmission method is estimated for the best possible sources and channels belonging to the given classes.
There are also generalizations of Shannon theorems to the transmission of information over a channel with feedback. The presence of complete feedback means that at the moment of time $ t $ it is considered that at the transmitting side of the channel (i.e. at its input) all exact values of the output signals at all moments $ t ^ \prime < t $ are known. In particular, for a memoryless channel with feedback the basic result is that the presence of feedback does not increase the transmission rate of the channel, although it may substantially decrease the complexity of the encoding and decoding devices.
Other generalizations to be mentioned are the theory of information transmission over channels with error synchronization, in which random synchronizations are possible, with as result the disturbance of the one-to-one correspondence between the input and output signal, as well as the theory of transmission over a channel with multiple directions, when there are several sources and receivers of information, and where transmission may proceed over several directions at the same time.
[1] | C. Shannon, "A mathematical theory of communication" Bell Systems Techn. J. , 27 (1948) pp. 379–423; 623–656 |
[2] | P.L. Dobrushin, "A general formulation of the fundamental theorem of Shannon in information theory" Uspekhi Mat. Nauk , 14 : 4 (1959) pp. 3–104 (In Russian) |
[3] | J. Wolfowitz, "Coding theorems of information theory" , Springer (1964) |
[4] | R. Gallager, "Theory of information and reliable communication" , Wiley (1968) |
[5] | A.A. Feinstein, "Foundations of information theory" , McGraw-Hill (1968) |
[6] | R.M. Fano, "Transmission of information. Statistical theory of communications" , M.I.T. (1963) |
[7] | A.A. Kharkevich, "Channels with noise" , Moscow (1965) (In Russian) |
[8] | J.M. Wozencraft, I.M. Jacobs, "Principles of communication engineering" , Wiley (1965) |
[9] | A.N. Kolmogorov, "Three approaches to the definition of the concept of "amount of information" " Probl. Peredachi Inform. , 1 : 1 (1965) pp. 3–11 (In Russian) |
[10] | M.S. Pinsker, "Information and informational stability of random variables and processes" , Holden-Day (1964) (Translated from Russian) |
[11] | B.R. Levin, "Theoretical foundations of statistical radiotechnics" , Moscow (1974) (In Russian) |
[12] | A.N. Kolmogorov, "The theory of information transmission" , Meeting of the USSR Acad. Sci. on Scientific Problems of Automatizing Society 15–20 Oct. 1956, Talin. , 1 , Moscow (1957) pp. 66–99 (In Russian) |
[13] | D. Slepian (ed.) , Key papers in the development of information theory , IEEE (1974) |
[14] | , Proc. 1975 IEEE-USSR Joint Workshop Inform. Theory (Moscow, 15–19 Dec. 1975) , IEEE (1976) |