Some random distributions

Even random processes follow some rules, despite not being entirely predictable. Given a straightforward understanding of the nature of the process, it's often possible to deduce quite a lot about its behaviour. Much of what we can deduce about its behaviour can be expressed in terms of a probability distribution. In fact, two of these are sufficiently common and interesting to warrant their own pages: the gaussian (a.k.a. normal) and gamma distributions. I've also begun writing about a distribution with power-law tails at zero and infinity.

Simple distributions

The simplest case to describe is a distribution on pure numbers. Subject to, at most, a choice of scaling, this suffices to describe many one-parameter variates, such as the height distribution of a population, the length of time one waits at a bus stop or the number of people with whom you're going to have to share that big lottery win when you finally get it. Where the pure number in question is actually a whole number (in which case there's a natural choice for scaling, if any is needed at all) it's usually (but not always) a natural number (i.e. non-negative); the analysis in this case is discrete, with each possible outcome having a probability and relevant quantities being computed by summing the products of these probabilities with various functions of the variate. Otherwise, one has a real-valued variate, each individual outcome has formally zero probability and we can only discuss probabilities for the variate falling in an interval (or union of intervals); these can be expressed in terms of a measure, which can usually be represented by a density function that we integrate over an interval to obtain the probability of the variate falling in that interval. Relevant quantities are then computed in the same way as for the discrete case, but substituting the density for the probabilities and integration for summation.

Geometric distribution

Suppose I select one face of a coin and toss it repeatedly until it comes down with that face upwards, counting how many tosses it takes. I could likewise pick one face of a die and roll it repeatedly until it lands with that face upwards, counting the number of rolls. In each case, I perform a sequence of trials (coin tosses, die rolls) and the outcome of each trial is independent of the outcome of prior trials (althoug whether I bother to make the next trial may depend on the earlier trials). It's taken as given that each trial's probability of producing the waited-for outcome is the same as any other trial's. The geometric distribution describes how many trials I must do before the waited-for result arises.

In the general case, I have some random trial which, with probability p, yields a chosen outcome; and I count how many times I must repeat the trial before that outcome arises. The probability that the chosen outcome happens the very first time is simply p. The probability that it doesn't happen in the first n trials is power(n, 1−p), so the probability that it happens on trial 1+n is simply p.power(n, 1−p), with the very first trial being the special case with n = 0. Since sum(: power(n, q) ←n |{naturals}) is 1/(1−q) for 1>q>−1, taking q = 1−p yields p.sum(: power(n, 1−p) ←n |{naturals}) = 1, as required.

The expected number of trials I must make is then p.sum(: (1+n).power(n, 1−p) ←n |{naturals}), so consider f = (: sum(: power(n, t) ←n |{naturals}) ←t :) and observe that

f'(t): = sum(: n.power(n−1, t) ←n |{naturals}); = 0 + sum(: (1+n).power(n, t) ←1+n |{naturals})

so p.f'(1−p) is exactly the expected value we were looking for. However, since (for 1>t>−1) f(t) = 1/(1−t) = power(−1, 1−t), we can compute f'(t) = power(−2, 1−t) and infer that our expected value is p/p/p = 1/p. This makes sense: the probability per trial of success is p, so we expect one success per roughly 1/p trials.

Let N be the number of trials we actually end up making; it's a random variate and the probability that N is 1+n is p.power(n, 1−p). For any natural m, the expected value of (N+m)!/(N−1)! is p.sum(: (n+1+m)!.power(n, 1−p)/n! ←n |{naturals}), which is the (1+m)-th derivative of f, evaluated at 1−p, namely (1+m)!/power(p, 1+m). The case m = 0 is simply the expected value of N, seen above. The case m = 1 gives us the expected value of N.(1+N) as 2/p/p, so the expected value of N.N is 2/p/p −1/p and the variance of N is (1−p)/p/p. Higher values of m imply expected values for the higher powers of N; the expected powers of N are also known as moments of N. Because the successive derivatives of f yield the moments, f is known as the moment-generating function for N.

Poisson distribution

If some event has a certain probability per unit time of happening, we can ask for the distribution of intervals between times when it does happen. Where the geometric distribution was discrete (its random variate could only take whole values), this shall yield a continuum distribution (its random variate can take any positive real value, not just a whole value).

Multi-dimensional distributions

Simple one-dimensional distributions are fine when we only have one thing to measure, but what about when we have several ? A natural answer is to model these using a vector space, with one co-ordinate for each variate; we then have a single vector-valued variate to encode our data. As ever, the mode of a distribution is simply wherever it takes its highest value; this definition works just fine in a vector space.

When a variate's values lie in a vector space, we can carry over the definition of the mean quite straightforwardly from the one-dimensional case; scale each possible outcome (now a vector value the variate may take, rather than just a number) by its probability (or by the variate's distribution's density for that outcome) and sum (or integrate) over possible outcomes. The result shall be a vector, since it's obtained by scaling a bunch of vectors and summing. We can subtract this mean from any value the variate is capable of taking; both the specification of variance and the gaussian distribution call for us to do this and to square the result. This obliges us, in the multi-dimensional case, to ask what the square of a vector might be.

One common square of a vector is its inner product with itself, its squared length; but this depends on a choice of metric (or, equivalently, of basis). In general, a square of a vector is the result of supplying the vector as both inputs to some bilinear map on our vector space, V; and any bilinear map on V may be factorised via the tensor product of V with itself, V⊗V. Thus the most general square of a vector v in V that we can come up with is in fact the tensor v×v in V⊗V. Using this we can carry over the usual definition of the variance of our random variate, averaging (v−m)×(v−m) over possible (vector) values v of our variate, with m being the (vector) mean of our variate. The result is a tensor in V⊗V; as a scaled sum of squares, it is necessarilly symmetric so there is some basis of V for which it is diagonal.

The variance of a vector-valued variate is thus a quadratic form on the dual of the vector space of values of the variate. Another such tensor quantity we can define, which encodes correlations among the components of the variate, is a double average, over two variables ranging over the vector space, of the product of their differences from the mean, scaled by the variate's density at each.

So much for mode, mean and variance: what about median ? For a simple variate, the median is the mid-point for which the distribution's totals on either side of it are equal. One can stretch the definition of median to apply to vector-valued variates, but there is then no guarantee that there exists a median: indeed, having a median is a strong symmetry constraint on a distribution. None the less, some distributions do have medians; most obviously, the multi-dimensional gaussian, whose median is its mean.

Written by Eddy.