Statistical decision theory

A general theory for the processing and use of statistical observations. In a broader interpretation of the term, statistical decision theory is the theory of choosing an optimal non-deterministic behaviour in incompletely known situations.

Inverse problems of probability theory are a subject of mathematical statistics. Suppose that a random phenomenon $ϕ$ occurs, described qualitatively by the measure space $(Ω, A)$ of all its elementary events $ω$ and quantitatively by a probability distribution $P$ of the events. The statistician knows only the qualitative description of $ϕ$ , and has only incomplete information on $P$ of the type $P \in P$ , where $P$ is a family of probability distributions. By making one or more observations of $ϕ$ and processing the data thus obtained, the statistician has to make a decision on $P$ and choose the most profitable way to proceed (in particular, it may be decided that insufficient material has been collected and that the set of observations has to be extended before final inferences be made). In classical problems of mathematical statistics, the number of independent observations (the size of the sample) was fixed and optimal estimators of the unknown distribution $P$ were sought. The general modern conception of a statistical decision is attributed to A. Wald (see [2]). It is assumed that every experiment has a cost which has to be paid for, and the statistician must meet the loss of a wrong decision by paying the "fine" corresponding to his error. Therefore, from the statistician's point of view, a decision rule (procedure) $Π$ is optimal when it minimizes the risk $R = R (P, Π)$ — the mathematical expectation of his total loss. This approach was proposed by Wald as the basis of statistical sequential analysis and led to the creation in statistical quality control of procedures which, with the same accuracy of inference, use on the average almost half the number of observations as the classical decision rule. In the formulation described, any statistical decision problem can be seen as a two-player game in the sense of J. von Neumann, in which the statistician is one of the players and nature is the other (see [3]). However, as early as 1820, P. Laplace had likewise described a statistical estimation problem as a game of chance in which the statistician is defeated if his estimates are bad.

The value of the risk $R (P, Π)$ depends both on the decision rule $Π$ and on the probability distribution $P$ that governs the distribution of the results of the observed phenomenon. As this "true" value of $P$ is unknown, the entire risk function $R (P, Π)$ has to be minimized with respect to $Π$ as a function in $P \in P$ for a given $Π$ . A decision rule $Π_{1}$ is said to be uniformly better than $Π_{2}$ if $R (P, Π_{1}) \leq R (P, Π_{2})$ for all $P \in P$ and $R (P, Π_{1}) < R (P, Π_{2})$ for at least one $P \in P$ . A decision rule $Π$ is said to be admissible if no uniformly-better decision rules exist. A class $C$ of decision rules is said to be complete (essentially complete) if for any decision rule $Π \notin C$ there is a uniformly-better (not worse) decision rule $Π^{⋆} \in C$ . The most important is a minimal complete class of decision rules which coincides (when it exists) with the set of all admissible decision rules. If the minimal complete class contains precisely one decision rule, then it will be optimal. Generally, the risk functions corresponding to admissible decision rules must also be compared by the value of some other functional, for example, the maximum risk. The optimal decision rule $Π_{0}$ in this sense,

$sup_{P \in P} R (P, Π_{0}) = inf_{Π} sup_{P \in P} R (P, Π) = R^{⋆},$

is called the minimax rule. Comparison using the Bayesian risk is also possible:

$R_{μ} (Π) = \int_{P} R (P, Π) μ {d P (\cdot)}$

— averaging the risk over an a priori probability distribution $μ$ on the family $P$ . This choice of functional is natural, especially when sets of experiments are repeated with a fixed marginal distribution $P_{m}$ in the $m$ - th set, whereas the ${P_{1}, P_{2}, \dots}$ prove to be a random series of measures with unknown distribution $μ$ ( see Bayesian approach). The optimal decision rule in this sense,

$R_{μ} (Π_{0}) = inf_{Π} R_{μ} (Π),$

is called the Bayesian decision rule with a priori distribution $μ$ . Finally, an a priori distribution $ν$ is said to be least favourable (for the given problem) if

$inf_{Π} R_{ν} (Π) = sup_{μ} inf_{Π} R_{ν} (Π) = R_{0} .$

Under very general assumptions it has been proved that: 1) for any a priori distribution $μ$ , a Bayesian decision rule exists; 2) the totality of all Bayes decision rules and their limits forms a complete class; and 3) minimax decision rules exist and are Bayesian rules relative to the least-favourable a priori distribution, and $R^{⋆} = R_{0}$ ( see [4]). The concrete form of optimal decision rules essentially depends on the type of statistical problem. However, in classical problems of statistical estimation, the optimal decision rule when the samples are large depends weakly on the chosen method of comparing risk functions.

Decision rules in problems of statistical decision theory can be deterministic or randomized. Deterministic rules are defined by functions, for example by a measurable mapping of the space $Ω^{n}$ of all samples $(ω^{(} 1) \dots ω^{(} n))$ of size $n$ onto a measurable space $(Δ, B)$ of decisions $δ$ . Randomized rules are defined by Markov transition probability distributions of the form $Π (ω^{(} 1) \dots ω^{(} n); d δ)$ from $(Ω^{n}, A^{n})$ into $(Δ, B)$ , which describe the probability distribution according to which the selected value $δ$ must also be independently "chosen" (see Statistical experiments, method of; Monte-Carlo method). The allowance of randomized procedures makes the set of decision rules of the problem convex, which greatly facilitates theoretical analysis. Moreover, problems exist in which the optimal decision rule is randomized. Even so, statisticians try to avoid them whenever possible in practice, since the use of tables or other sources of random numbers for "determining" inferences complicates the work and even may seem unscientific.

A statistical decision rule is by definition a transition probability distribution from a certain measurable space $(Ω, A)$ of results of the experiment into a measurable space $(Δ, B)$ of decisions. Conversely, every transition probability distribution $Π (ω; d δ)$ can be interpreted as a decision rule in any statistical decision problem with a measurable space $(Ω, A)$ of results and a measurable space $(Δ, B)$ of inferences (it can also be interpreted as a memoryless communication channel with input alphabet $Ω$ and output alphabet $Δ$ ). The statistical decision rules form an algebraic category with objects $Cap (Ω, A)$ — the totality of all probability distributions on measurable spaces $(Ω, A)$ , and morphisms — transition probability distributions of $Π$ . The invariants and equivariants of this category define many natural concepts and laws of mathematical statistics (see [5]). For example, an invariant Riemannian metric, unique up to a factor, exists on the objects of this category. It is defined by the Fisher information matrix. The morphisms of the category generate equivalence and order relations for parametrized families of probability distributions and for statistical decision problems, which permits one to give a natural definition of a sufficient statistic. The Kullback non-symmetrical information deviation $I (Q : P)$ , which characterizes the dissimilarity of the probability distributions $Q$ and $P$ ( see Information distance), is a monotone invariant in the category:

$I (Q_{1} : P_{1}) \geq I (Q_{2} : P_{2})$

if $(Q_{1}, P_{1}) \geq (Q_{2}, P_{2})$ , i.e. if $Q_{2} = Q Π$ and $P_{2} = P_{1} Π$ for a certain $Π$ . If in the problem of statistical estimation by a sample of fixed size $N$ there is a need to estimate the actual marginal probability distribution $P$ of the results of observations, which belongs a priori to a smooth family $P$ , then, given the choice $2 I (Q : P)$ for an invariant loss function for the decision $Q$ , the minimax risk proved to be

$R^{⋆} = N^{-} 1 \dim P + o (N^{-} 1) .$

The logic of quantum events is not Aristotelean; random phenomena of the micro-physics are therefore not a subject of classical probability theory. The formalism designed to describe them accepts the existence of non-commuting random variables and contains the classical theory as a degenerate commutative scheme. In the corresponding interpretation, many problems of the theory of quantum-mechanical measurements become non-commutative analogues of problems of statistical decision theory (see [6]).

References[edit]

[1]	A. Wald, "Sequential analysis" , Wiley (1947)
[2]	A. Wald, "Statistical decision functions" , Wiley (1950)
[3]	J. von Neumann, O. Morgenstern, "The theory of games and economic behavior" , Princeton Univ. Press (1944)
[4]	E.L. Lehmann, "Testing statistical hypotheses" , Wiley (1986)
[5]	N.N. Chentsov, "Statistical decision rules and optimal inference" , Amer. Math. Soc. (1982) (Translated from Russian)
[6]	A.S. Kholevo, "Probabilistic and statistical aspects of quantum theory" , North-Holland (1982) (Translated from Russian)

Comments[edit]

References[edit]

[a1]	J.O. Berger, "Statistical decision theory and Bayesian analysis" , Springer (1985)