In Probability Theory, the Large Deviations Theory concerns the asymptotic behaviour of remote tails of sequences of probability distributions. Some basic ideas of the theory can be tracked back to Laplace and Cramér, although a clear unified formal definition was introduced in 1966 by Varadhan[1]. Large Deviations Theory formalizes the heuristic ideas of concentration of measures and widely generalizes the notion of convergence of probability measures.
Roughly speaking, Large Deviation Theory concerns itself with the exponential decay of the probability measures of certain kinds of extreme or tail events, as the number of observations grows arbitrarily large.
Consider a sequence of independent tosses of a fair coin. The possible outcomes could be head or tail. Let us denote the possible outcome of the i-th trial by <math> X_i </math>, where we encode head as -1 and tail as 1. Now let <math> M_N </math> denote the mean value after <math> N </math> trials, namely
Then <math> M_N </math> lies between -1 and 1. From the law of large numbers (and also from our experience) we know that as N become larger and larger, <math> M_N </math> becomes closer and closer to <math> 0 </math> with increasing probability. Let us make this statement more precise. For a given value <math> x>0 </math>, let us compute the probability <math> P(M_N > x) </math> that <math> M_N </math> is greater than <math> x </math>. By Chebyshev's inequality it can be shown that <math> P(M_N > x) < exp(-x^2N/2) </math>. This bound is rather sharp, in a suitable technical sense. In other words the probability <math> P(M_N > x) </math> is decaying exponentially rapidly as N grows large, at a rate depending on x.
In the above mentioned example of coin-tossing we tacitly assumed that each toss is an independent trial. And for each toss, the probability of getting head or tail is always the same. This makes the random number <math> X_i </math> independent and identically distributed (i.i.d.). For i.i.d. variables whose common distribution satisfies a certain growth condition, large deviation theory states that the following limit exists:
<math>\lim_{N\to \infty} \frac{1}{N} \log P(M_N > x) = - I(x) </math>
The function <math> I(x) </math> is called the "rate function" or "Cramer function" or sometimes the "entropy function". Roughly speaking, the existence of this limit is what establishes the above mentioned exponential decay and allows us to conclude that for large <math>N</math>, <math> P(M_N >x) </math> takes the form:
<math> P(M_N >x) \approx \exp[-NI(x) ].</math>
which is the basic result of Large Deviations Theory in this setting. Note that the inequality given in the first paragraph, as opposed to the asymptotic formula presented here, requires an additional argument.
If we know the probability distribution of <math> X_i </math>, an explicit expression for the rate function can be obtained. This is given by a Legendre transform
<math>I(x) = \sup_{\theta > 0} [\theta x - \lambda(\theta)]</math>
where the function <math> \lambda(\theta) </math> is called the "Cumulant Generating Function (CGF)", given by
<math> \lambda(\theta) = \log E[\exp(\theta X)] </math>
Here <math> E[] </math> denotes expectation value with respect to the probability distribution function of <math> X_i </math> and <math> X </math> is any one of <math> X_i </math>s. If <math> X_i </math> follows a Gaussian distribution, the rate function becomes a parabola with its apex at the mean of the Gaussian distribution.
If the condition of Independent Identical Distribution is relaxed, particularly if the numbers <math>X_i</math> are not independent but nevertheless satisfies Markov Property, the basic large deviations result stated above can be generalized.
Given a Polish space <math>X</math> let <math>\{ \mathbb{P}_N\}</math> be a sequence of Borel probability measures on <math>X</math>, let <math>\{a_N\}</math> be a sequence of positive real numbers such that <math>\lim_N a_N=+\infty</math>, and finally let <math>I:X\to [0,+\infty]</math> be a lower semicontinuous functional on <math>X</math>. The sequence <math>\{ \mathbb{P}_N\}</math> is said to satisfy a Large Deviations Principle with speed <math>\{a_n\}</math> and rate <math>I</math>, iff for each Borel measurable set <math>E \subset X</math>
where <math>\bar{E}</math> and <math>E^\circ</math> denote respectively the closure and interior of <math>E</math>.
The first rigorous results concerning Large Deviations are due to the Swedish mathematician Harald Cramér, who applied them to model the insurance business. From the point of view of an insurance company, the earning is at a constant rate per month (the monthly premium) but the claims come randomly. For the company to be successful over a certain period of time (preferably many months), the total earning should exceed the total claim. Thus to estimate the premium you have to ask the following question : "What should we choose as the premium <math> q </math> such that over <math> N </math> months the total claim <math> C = \Sigma X_i </math> should be less than <math> Nq </math> ? " This is clearly the same question asked by the large deviations theory. Cramer gave a solution to this question for i.i.d. gaussian random variables, where the rate function is expressed as a power series. The results we have quoted above were later obtained by H. Chernoff, among other people. A very incomplete list of mathematicians who have made important advances would include S.R.S. Varadhan (who has won the Abel prize), D. Ruelle and O.E. Landford.
Establishing Large Deviations Principles is one of the most effective ways to gather information out of a probabilistic model. Some of the best known applications of Large Deviation Theory rise in Statistical Mechanics, Quantum Mechanics, Information Theory and Risk Management.
The rate function is related to the entropy in statistical mechanics. This can be heuristically seen in the following way. In statistical mechanics the entropy of a particular macro-state is related to the number of micro-states which corresponds to this macro-state. In our coin tossing example the mean value <math> M_N </math> could designate a particular macro-state. And the particular sequence of heads and tails which gives rise to a particular value of <math> M_N </math> constitutes a particular micro-state. Loosely speaking a macro-state having more number of micro-states giving rise to it, has higher entropy. And a state with higher entropy has more chance of being realised in actual experiments. The macro-state with mean value of half has the highest number micro-states giving rise to it and it is indeed the state with the highest entropy. And in most practical situation we shall indeed obtain this macro-state for large number of trials. The "rate function" on the other hand measures the probability of appearance of a particular macro-state. The smaller the rate function the higher is the chance of a macro-state appearing. In our coin-tossing the value of the "rate function" for mean value equal to half is zero. In this way one can see the "rate function" as the negative of the "entropy".