Median: a statistical mid-point

A median of a distribution P on {reals} (or any sub-set of it) is defined to be a value m for which P({real x: x < m}) ≤ P({reals})/2 ≥ P({real x: x > m}) As such, it is only defined in so far as P({reals}) is finite.

A median is an average of the distribution: it gives an indication of a typical value, albeit a different one than is given by the mean (multiply each outcome by its frequency, sum the result and divide by the total frequency; or use P to integrate (: x ←x :{reals}) and divide the result by P({reals})) or a mode (a value of maximal relative frequency, or at which the density associated with the distribution is maximal). Like any simple single number, it contains vastly less information than the distribution it summarizes; if you really want to understand the system, you should look to the distribution itself, not to any summary datum.

Preliminary remarks

A distribution on a set S integrates (which may actually mean a combination of integrating and summation) a function to yield a value, the distribution's integral of the function. When the function's outputs lie in some real-linear space V, so does the distribution's integral of the function; in particular, {reals} is a real-linear space. The restriction of some function to some sub-set of S can be interpreted as taking the function's value on the sub-set and taking the value zero (in the same vector space as the function's outputs) on the rest of S. A distribution on S may be fussy about which functions it integrates over which sub-sets of S; where it does integrate some function f over some subset U of S, (:f:U) is described as measurable and, in the case where f is 1 ←x, so is U. In our case, I'm taking S = {reals} and the only functions I'm actually interested in integrating are the restrictions of 1 ←x to various sub-sets of {reals}. For any measurable sub-set U of {reals}, P(U) is the integral of 1 ←x over U; its value is a real.

I'll describe a subset of {reals} as an interval precisely if it's {reals}, {real x: x < a}, {real x: x > a}, for some real a, the complement of any of these within the reals or any intersection of such intervals. I require that intervals be measurable (i.e. the restriction of 1 ←x to them must be integrable), hence valid inputs to P.

Any half-way decent distribution on {reals} can in fact integrate all continuous functions on a broad class of sub-sets of {reals}, obtainable by taking unions of intervals and intersections of such unions. In particular, one of the continuous functions a half-way decent distribution can integrate is x ←x, whose integral over {reals} is the mean of the distribution, mentioned above. However, for the discussion of the median, we only actually need to integrate 1 ←x, the constant function without output 1 regardless of input, over intervals.

The distribution may represent an actual data-set of samples from some population of data, with P(U) being the number of samples that fall in the set U; or we may have the probability distribution of some random variate, with P(U) being the probability of the variate taking a value in U. In the latter case, P({real x: x < m}) is the probability that the variate takes a value less than m, P({real x: x > m}) is the probability that it takes a value greater than m; their sum falls short of P({reals}) by P({m}), which is the probability that the variate takes exactly the value m. For a continuum distribution, with a finite density function, any such single-point P({m}) is always zero; I'll refer to this as the continuum case, below. It is in the nature of distributions (generally) that:

In particular, for any real m, we can partition {reals} into the single-point set {m} and the sets of values on either side of it; these three are disjoint and the last two are what the definition above constrains, for m to be a median. For brevity in what follows, define

Our partition about m yields L(m) +G(m) +P({m}) = P({reals}), with P({m}) ≥ 0; for a continuum distribution, furthermore, P({m}) = 0. We thus have L(m) +G(m) ≤ P({reals}) for all real m, with equality in the continuum case. For m to be a median, L(m) and G(m) must each be at most half of P({reals}); subtracting either from P({reals}) is that at least half of it, but is the result of adding P({m}) to the other, so L(m) +P({m}) and G(m) +P({m}) are each at least half of P({reals}); so each of L(m) and G(m) is within P({m}) below half of P({reals}), for m to be a median.

With the definition above, a distribution may have more than one median. If it has more than one, then every value between any two of them is also a median; with M = P({reals})/2, if a and b are medians with a < b and m lies between them, so a < m < b, then L(m) = L(b) −P({real x: m ≤ x < b}) ≤ L(b) ≤ M and likewise, G(m) ≤ G(a) ≤ M. Furthermore, G(b) +P({real x: x ≥ b}) = 2.M = P({real x: x ≥ a}) +L(a) with G(b) ≤ M ≥ L(a), so M ≤ 2.M −G(b) = P({real x: x ≤ b}) and M ≤ P({real x: x ≥ a}). Applying P(U∪V) +P(U∩V) = P(U) +P(V) with U∪V = {reals} and U∩V = {real x: a ≤ x ≤ b}, we get P({real x: a ≤ x ≤ b}) = 2.M −P({real x: x ≥ a}) −P({real x: x ≤ b}) ≤ 0 and, as P(U) ≥ 0 for all U, we conclude that P({real x: a ≤ x ≤ b}) = 0, i.e. our distribution's total weight in the interval between any two medians is zero.

When {medians of P} is a singleton, {m}, it's possible P({m}) is non-zero. Otherwise, {medians of P} is an interval and P({medians of P}) = 0. In this case, it is conventional to define the median to be the mid-point of the interval (even though, since P maps the interval to zero, it's not a possible outcome). For example, the numeric outcome of rolling a single fair six-sided die (with faces numbered from one to six) has probability half of being less than or equal to three and probability half of being greater than or equal to four; every value strictly between three and four is a median, so it is conventional to take 3.5 as the median.

The definition can straightforwardly be extended to any connected one-dimensional real manifold whose complement of each single point is disconnected (i.e. deleting one point splits the space in two), notably any one-dimensional real vector space. It suffices that we can, for each point m, discuss the sets of values on either side of it, that P gives a value to each side of each point in the manifold and that such splitting into sides is equivalent to defining a strict ordering on the values. Such a manifold always has some global chart and the entire discussion can be transformed, via any such chart, into a discussion of a distribution on {reals}; it is not sensitive to which such chart is used. Indeed, in the real case, the analysis is not sensitive to any monotonic reparameterisation (strict except, optionally, on intervals U with P(U) = 0). It is thus sufficient to discuss in terms of {reals}; and this specificity makes for a clearer exposition.

Further, the definition given above can be applied to any distribution over a set of values among which there is a well-defined ordering. It is most usually employed where the ordering is strict, as happens for simple numbers; but it can equally be used with a partial ordering (which is apt to yield many medians, without any means to select a middle one among these). I leave the analysis of such cases as an exercise for anyone who has a specific example they care about.

Existence

It is all well and good to define what a median is, for a distribution on {reals}, but do we know that every distribution on {reals} has a median ?

As long as P(U) ≥ 0 for every measurable subset U of {reals}, L is necessarily a monotonically increasing function, and G monotonically decreasing. Let M = P({reals})/2, as before, and observe that L(x) +G(x) ≤ 2.M for all x, with equality unless P({x}) > 0. I take it as given that, for sufficiently high x, G(x) is arbitrarily close to zero and, for sufficiently low x, L(x) is likewise close to zero. For any x with G(x) tiny, each t > x has 0 ≤ P({t}) ≤ G(x) and 0 ≤ G(t) ≤ G(x) hence L(t) = 2.M −G(t) −P({t}) ≥ 2.(M −G(x)) and, for high enough t, L(t) is arbitrarily close below 2.M; likewise, for low enough t, G(t) is arbitrarily close below 2.M. In particular, for at least some (high) real x, we have G(x) < M < L(x) and, for at least some (low) real x, we have L(x) < M < G(x).

Thus L(m)−G(m) is negative for sufficiently low m, positive for sufficiently high m and (non-strictly) monotonically increasing. We can infer that {real x: L(x) < G(x)} is either {real x: x < b} or {real x: x ≤ b}, for some real b; that {real x: L(x) > G(x)} is either {real x: x > a} or {real x: x ≥ a}, for some real a; and that b ≤ a. For any m strictly between a and b, we can infer that G(m) = M = L(m) and m is a median; so there exists a median whenever a > b.

In the remaining case, a = b, consider m = a = b as candidate median, with L(x) < M < G(x) for all x < m and L(x) > M > G(x) for all x > m. Given how G and L are defined: L(m) is the limit, as x tends to m from below, of L(x) < M; and G(m) is the limit, as x tends to m from above, of G(x) < M; as M is an upper bound in each case, each limit must be ≤ M; hence L(m) and G(m) are both ≤ M, so m is indeed a median.

Thus, for (non-negative) distributions on {reals}, there always exists at least one median. By extension, the same applies to any distribution on a one-dimensional real vector space. Which brings us to …

Multi-dimensional distributions

When we come to consider a distribution on a vector space – for example, the probability distribution for some event to happen at a specific position inside some experimental apparatus – we cannot apply the above definition directly, since there is no (natural) ordering of positions in a vector space of (real) dimension higher than 1.

In one dimension, a single value suffices to sub-divide the space of values in two; you can't (by continuous movement) get from one side of the value to the other without passing through the value. In two or more dimensions, however, you can always go round a single point, so it doesn't suffice to split the space in two. Instead, to split an n-dimensional space, one needs a hyper-plane of dimension n−1 or, as it's termed, of co-dimension 1. There are generally many of these through any given point. All the same, for any hyper-plane of co-dimension 1, we can use it to split the vector space in two; we can our distribution for its integrals of 1 ←x over each half; and we can ask whether the two integrals are both less than half the total for the distribution over the whole space; if they are, I'll say the hyper-plane splits the distribution evenly. The only reasonable way to extend the definition of median to vector-valued variates is then to only allow a value to be called a median if every hyper-plane of co-dimension 1 through that value splits the distribution evenly. Vector-valued variates' distributions do not, in general, have a median.

Counter-example

In the standard two-dimensional space {lists ({reals}:|2)}; place two equal small weights on {[0, y]: y is real}, one on each side of the origin, e.g. at [0, −3] and [0, 5]; likewise, place two equal small weights (not necessarily equal to the previous pair) on {[x, 0]: x is real}, one on each side of the origin, e.g. at [−2, 0] and [4, 0]; place two equal large weights in the {lists ({positives}:|2)} and {lists ({negatives}:|2)} quadrants; finally, place two equal large weights (not necessarily equal to the previous two), one in each of the other pair of quadrants. Whether each weight is localized at a single point or some more diffuse distribution is not crucial, so long as each weight, taken on its own, does have a median, which is in the indicated location, and no weight's part of the distribution straddles an axis except in so far as the above specification places it on that axis.

For any given y ≠ 0, consider the line {[x, y]: x is real}; the two weights on {[x, 0]: x is real} lie at least mostly on one side of it, along with at least all of one member of each other pair, while on the other side it has at most one member of each of these other pairs; so the latter side necessarily has less weight on it and our {[x, y]: x is real} line does not split the distribution evenly. The line {[x, 0]: x is real}, on the other hand, has two weights on it, each of which it splits evenly, and exactly one of each of the other pairs wholly on each side of it, balancing one another; so this line does split the distribution evenly. Likewise, the line {[0, y]: y is real} splits the distribution evenly and no line parallel to it does so. So the only candidate point for the median is the origin, [0, 0].

However, the locations of the large weights in the four quadrants were not at all constrained within their quadrants, so we can easily arrange for some lines through [0, 0] to not split the distribution evenly. For example, place the {lists ({positives}:|2)} weight at [1, 1000] and the {lists ({negatives}:|2)} weight at [−1000, −1], so that they lie on the same side of {[x, x]: x is real}; we can easily use our remaining freedom of position in the other weights to ensure that none of them straddle this line, hence it splits each other pair evenly, so that having both of this pair on one side of it makes it manifestly not split the whole distribution evenly.

One can extend the above analysis to a general distribution on the two-dimensional plane, as long as the total weight of each quadrant is equal to that of the opposite quadrant, each axis has equal weight on either side of the origin and each axis either has positive weight on each side of the origin or has, on each side of the origin, some interval for which: every measurable open set of the plane, that has non-empty intersection with the interval, has positive weight. This last condition just ensures that a line parallel to the axis but slightly offset from it has some weight between it and the axis, ensuring that only the axis splits the distribution evenly.

When is it possible ?

So let's look for the conditions under which there is a median.

For each hyper-plane of co-dimension 1, in a vector space V, there are members of dual(V) which map all displacements within the hyper-plane to zero; because the hyperplane has co-dimension 1, there is a one-dimensional space of such members of dual(V), so selecting any non-zero one of these, w, determines the rest, by simple real scaling; {t.w: t is real}. Since w is linear (from V to {reals}) and maps displacements within our hyperplane to zero, it maps every value in the hyperplane to the same real value, say h, so our hyperplane can be specified as {v in V: w(v) = h}. Had we chosen a different member of w's one-dimensional subspace of dual(V), t.w, we'd obtain t.h as its value on the hyperplane, but t.w and t.h would specify the same hyperplane. So chose a sub-set R of dual(V) that provides exactly one member of each one-dimensional sub-space of dual(V); e.g., for some metric on dual(V), half of the surface of its unit sphere (taking care to, likewise, only take half of the equator used to cut the sphere). For any p in V, there is then a one-to-one correspondence between members of R and hyper-planes of co-dimension 1 through p, (: {v in V: r(v) = r(p)} ←r :R).

Now, for any r in R, we have a one-dimensional family of hyper-planes {v in V: r(v) = t} with t varying over {reals}; because our distribution gives us total weights for each side, for each t, it induces a distribution on this family, which is one-dimensional so has a median. For each r in R, let h(r) be the set of reals m for which {v in V: r(v) = m} is a median in r's family; and let H = (: {v in V: r(v) in h(r)} ←r |R). When h(r) is a singleton, H(r) is a hyperplane; otherwise, h(r) is an interval and H(r) is a volume bounded by two hyperplanes parallel to r. Any median must lie in or on H(r) for every r in R; indeed, our distribution's medians are exactly the members of intersect(H).

For any basis (dual(V):q|dim) of our vector space's dual, each q(i) defines a one-dimensional sub-space of dual(V), {t.q(i): t is real}, which has a representative member of R, so we have some ({non-zero reals}:s|dim) for which s(i).q(i) is in R, for each i in dim. We can also, as usual, obtain a dual basis (V:b|dim) for which q(i)·b(i) = 1 for each i in dim and q(j)·b(i) = 0 for each distinct i, j in dim. For any ({reals}: F |dim) with F(i) in h(s(i).q(i)) for each i in dim, we can define M(q, F) = sum(: F(i).b(i)/s(i) ←i |dim); this is the point of intersection of the hyperplanes {v in V: s(i).q(i, v) = F(i)} for i in dim, each of which splits the space evenly. Each F(i) lies in, at most, some interval within {reals}, so the range of possible values of F gives a cuboid in which M(q, F) lies, which may have zero thickness (but is not empty) in some directions. Each point in this cuboid can be though of as a co-ordinate median for our basis (of dual(V) or equivalently of V). Changes of basis which merely rescale some of the q(i) won't change which cuboid we get (although they'll change its co-ordinates with respect to the basis) but changes in direction of any of the q(i) may change the cuboid. The cuboid may be a single point, when its thickness in all directions is zero; and, for any given basis, it is non-empty.

If there is any basis whose cuboid of co-ordinate medians lies entirely on one side of H(r) for any r in R, our distribution has no median. As shown above, this situation can easily arise. But, whenever intersect(H) is non-empty, each point in it is a median.

Quartiles, deciles and their kin

The median splits a distribution evenly in two: an obvious generalization is to try to split a distribution evenly in some larger number of parts. I hope it is obvious that this is only possible in the one-dimensional case.

When we split evenly in four, the sub-divisions of the distribution's range are known as quartiles; the first quartile is the set of values x for which L(x) = P({real y: y < x}) ≤ 1/4, the second has L(x) between 1/4 and 1/2, the third between 1/2 and 3/4 and the last has L(x) ≥ 3/4. Likewise, pentiles split a distribution in five, deciles in ten and percentiles in one hundred; since combining a suitable classical word for a number with the suffix -iles gives each of these, I'll refer in general to n-iles for any positive integer n, so quartiles are 4-iles and so on. The median, then, is the boundary between the low and high 2-iles of the distribution.

When neither side of the median has total weight half (so the median has some weight), one can argue for including the median in each 2-ile, in neither or just one. Likewise, for general n-iles, the boundaries may or may not be included in the n-iles they bound; but I'll settle for just identifying where those boundaries fall.

The lowest and highest n-iles have, as low and high boundaries respectively, the relevant extreme of the distribution's range, or relevant infinity if unbounded; these are never of significant interest – all the same, they are boundaries of n-iles, so I'll aim to make the definition deal gracefully with them as well as with the more interesting boundaries in between. We thus have 1+n boundaries splitting the distribution's range into n intervals; I'll index them from 0 for the low end of the lowest n-ile to n for the high end of the highest n-ile, so that the interesting boundaries are numbered 1 through n−1.

So, to split a distribution P on {reals} into n-iles, we need a list ({reals}:b|1+n) of boundaries between n-iles, with b(0) being −∞ or the low end of P's range and b(n) being +∞ or the high end of P's range. We require that b(1+i) ≥ b(i) for each i in n and we would like, for each i in n, the value of P({real x: b(i) < x < b(1+i)}) to be P({reals})/n. We won't be able to do that in all cases, e.g. for a discrete distribution on less than n values, but we need a specification that's as sensible as possible.

For any positive natural n and i in 1+n, an i-th n-ile split for a distribution P on {reals} (or a subset of it), with B = P({reals})/n and j = n−i, is a real m for which L(m) = P({real x: x < m}) ≤ i.B and G(m) = P({real x: x > m}) ≤ j.B.

The median is just the case with n = 2, i = j = 1. In the boundary cases, i = 0 requires P({real x: x < m}) = 0 and i = n implies j = 0 so requires P({real x: x > m}) = 0, as intended. Thus every real less than the start of P's range is a 0-th n-ile split for every n; and every real greater than the end of P's range is an n-th n-ile split for every n. If the range is unbounded you have no 0-th or n-th n-ile split, unless you entertain ±∞ as values to allow for them.

Exactly as we reasoned for the median's existence, at least when i and j are both positive, we find: b for which {real x: L(x) < i.B} is either {real x: x < b} or {real x: x ≤ b}; and a for which {real x: G(x) < j.B} is either {real x: x > a} or {real x: x ≥ a}; with b ≤ a. For all m, L(m) +G(m) ≤ n.B; again, any m strictly between b and a has m > b so L(m) ≥ i.B, hence G(m) ≤ n.B −L(m) ≤ n.B −i.B = j.B, and G(m) ≥ j.B, hence L(m) ≤ i.B (in fact, L(m) = i.B, G(m) = j.B), so m is an i-th n-ile split of P. If there are no values between b and a, then b = a, so consider m = a = b as candidate; as for the median, L(m) is the limit, as x tends to m from below, of L(x) < i.B, hence L(m) ≤ i.B; and G(m) is the limit, as x tends to m from above, of G(x) < j.B, hence G(m) ≤ j.B and, sure enough, m is an i-th n-ile of P.


Valid CSSValid HTML 4.01 Written by Eddy.