]> Distribution

Distribution

It's been widely commented that wealth and income are distributed rather unfairly in the modern world. To appreciate just how unfairly, you need to understand what distributions look like when their dynamics aren't being distorted by injustice. So I'm going to take an example data set and show you how it can be represented, in various ways, in particular as a distribution.

Raw Data

Let's start by looking at a data-set I can easily grab (the durations of the contiguous times my computer has been up and running, as reported by uptimed) and that – I can reasonably hope – no-one is going to get too emotional about:

The horizontal axis, here, is just a sequence number of times I've booted my computer; this is somewhat analogous to displaying how wealthy a sample of people are, organising the people in the sample by their date of birth, or some other property largely unrelated to how wealthy they are. In the present case, an actual time-series, it may actually be inustructive to look at the data in this form. All the analyses below, however, discard the order to look at how the data are spread out by their values.

Value-ordered

Now let's see what we get when we display the same data but sort the entries by the values we're actually depicting:

The horizontal axis here gives equal space to each member of the population; so we can easily identify things that relate to what fractions of the population are on various sides of lines: for example, here are the deciles:

One thing to notice is that this distribution – which has a long tail of low values and a narrow peak of high values – has most of the population bunched together at rather low scores. Well, so much for giving equal space to equal-sized portions of the population.

Another way to look at the data is to ask what fraction of the population you get within various ranges of values of the data. You can see that in the above by looking at the width of the portion where the heights of the bars fall in particular ranges. For example, chosing to cut at 5, 15 and 50:

We have 15 of our 70 items (21.4%) between 1 and 5 (the unit is days, by the way), 20 (28.6%) between 5 and 15, 24 (34.3%) between 15 and 50 and the remaining 11 (15.7%) between 50 and 183 (or, indeed, between 50 and 200, or 2 million).

We can represent that in various ways. For example, we could move my four boxes, above, all to the left, so that they line up along the axis; the width of each box is then the number of people in the interval. Then again, if I'd chosen the deciles as the boundaries between intervals, we'd get ten boxes of different heights but all of the same width, by construction; the taller boxes would look bigger when, in fact, each box represents the same proportion of the population. We could put a horizontal bar, of fixed height, whose length is the fraction of the population, at the mid-point of each interval; the areas of the bars shall then be in the right proportions – but it won't be clear where the boundaries of the intervals are.

Before we go on to explore how to represent that, though, let's look at the decision as to where to split the distribution; I somewhat arbitrarily used 5, 15 and 50 above, but it would clearly make sense to let properties of the distribution guide our choice of where to split. We could use the deciles (or the n-iles for some other n), but these tend to split the low values more than the high values; there are purposes for which that's good, but it tends to describe the high end of the distribution rather coarsely.

Pareto Partition

One way to partition the data is at its Pareto parameter. This splits the population where the proportion on one side of the split is equal to the proportion of the total held by the rest. Summing our data, to show the sum of the shortest n uptimes as a function of n, we get:

In our case, the 19 biggest uptimes (19/70 is 27.1%) account for 71.9% of the total uptime (summing to 99%) while the 20 biggest (28.6%) account for 73.4% of the total uptime (sum 102%). The 20th-longest and 19th-longest were 31.7 an 34.5, with an average close to 33. So (roughly speaking) the top 27% of the population, those over 33 days, accounts for 73% of the total uptime, while the bottom 73% accounts for 27% of the total.

We can, of course, repeat this partition on the two parts into which we've split the population. In the shorter 51 of the uptimes, the shortest 33 runtimes (64.7%, under 12 days) account for 35.2% of the total; of the full population, these are 47% accounting for 9.9% of the total. Their complement, 18 of the shorter 51 (35.2%, 12 to 33 days), account for 64.8% of their total; overall, these are 26% accounting for 18.2% of the total. Splitting the top 19, the lower 12 (63%, 33 to 67 days) account for 38%; overall, these are 17% accounting for 27.3%. The other 7 (36.8%, over 67 days) account for 62.0% of the upper portion's total; overall, these are 10% accounting for 44.5% of the total.

Density

So, now we have a nice way to partition our data, let's return to how to represent how much of our data fall in the resulting parts. For reference, I'll include the boxes from the previous example, but shunt them up against the axis. Let's mark each interval with a box, spanning the interval, whose area is proportional to the fraction of the population it represents:

I chose the constant of proportionality here so that the bottom box from the previous depiction is exactly filled by its area-correct box. Each box above it spans more days of variation in uptime, i.e. it's taller, so spreads its portion of the population thinner; each block's width is a smaller fraction of that of the box from the previous depiction. We can now flip that on its side to get the quantity we were measuring (uptime duration) as the horizontal axis.

So now the uptime, in days, is the horizontal axis, where it was the vertical axis before. The earlier horizontal axis counted the number of runs; but I labelled it with percentages, allowing us to read it as a proportion of runs, which is dimensionless. The green boxes have time in one direction (vertical above, horizontal hereafter) and number of runs in the other. The blue blocks share the same time direction, but the proportion of runs is now represented by their area; dividing that by the time measured by one side, we see their other side is proportion/time which, as proportion is dimensionless, is just 1/time. So, from here onwards, the vertical axis (blue block height) has units 1/day; but it's not a frequency, it's a density. We shall, in due course, interpret our blue blocks as a piecewise constant approximation to a density function whose integral, from one uptime x to another y, is the number of runs that ended in a reboot between x and y days into the run. Our initial blue blocks give right answers to that if x and y are both cut points (1, 12, 33, 67 or 183 days)

Now, we got this via my Pareto-based choice of the points at which to split the range of values: you can fairly wonder how much it depends on that. So let's look at what I'd have got with a few other choices, starting with equal width intervals of the range of values:

We get a bit more detail, but the total areas in various intervals stay about the same; and the data get a bit fragmentary. We can address that by splitting the range into intervals containing equal fractions of the population, such as the pentiles, which works reasonably well; but when we go to deciles the results get quite spiky, as the first few are very close together. The finest such sub-division places each sample datum in the middle of an interval; where a few samples are close together, those in between fall in very narrow intervals, so yield extremely high density, inversely proportional to the interval width. One remedy for this is to use, in each interval, a density obtained by averaging over it and a few intervals to each side. This is apt to change the total density but provides a better-smoothed density to describe the observed data.

I've reduced the blur near the boundary, so that each interval is blurred with only as many neighbours as it has on both sides, so I'm left with a high spike at the start, but that looks like it's actually a real feature of this data-set. (Soon after a reboot, I'm less likely to have started anything that it'd be tiresome to shut down for another reboot; so I'm more likely to respond to a new software upgrade by rebooting to let it take effect, which makes short up-times more common. Of course, if something's wrong with the machine, that caused the previous reboot, either that thing stays wrong (making another reboot likely) or I fix the problem, in which case I'm likely to reboot to make the fix take effect; so this also increases the incidence of short uptimes; but usually reboots are due to power lapses or the desire to activate some new software updates.)

This last approach is effectively using a moving interval instead of strictly cutting the distribution into intervals. We can be more systematic about this, dispensing with the cutting; about each point (even the ones between sample data), look at the number of samples within some interval about that point and divide that number by the interval's width to get the density at that point. A fixed-width window will have the same problems as above for fixed-width intervals, turning each isolated block into a triangular rise to a peak and fall back down to zero; so let's try using an interval symmetric about each value but containing some fixed proportion of the population.

One thing to notice here is the tails; on the right, this reaches almost to 370, well beyond the actual data (and clipped from the display). It's not as obvious on the left, but it also reaches beyond the range of the actual data.

Scaling density by data

Thus far, we've been looking at how many members of the population have how high a score; my data-set represents how long my machine stayed up between successive reboots. If we'd been looking at how much money each citizen of some nation earned, the analogous plots would have shown what proportion of the population falls into each of various income brackets. In that case, another perspective on the data is to look at what proportion of the total national income goes to people in each income band; roughly, this would scale up the height of each bar by the typical income of the folk with incomes in that bar's range; so the bars would grow taller to the right and shorter to the left. The analogous plot for my data-set would give each bar an area equal to the amount of time my machine stayed up during the assorted runs; you can then think of the relative area of each bar as the probability that a randomly selected day, during the study period described by the data set, fell into a run (between boots) whose duration fell in the range described by the given bar. For the analogous plot for the wealth distribution of a nation, the area of each bar could be characterised (slightly inaccurately but faithfully in spirit) as the probability that a randomly selected unit of currency belonged to a person with wealth in that band.

First, let's look at the pentiles, which we saw before, and contrast the density (blue blocks) with the proportion of the total in each interval. Again, the area of the rectangle is what's significant; it indicates the total number of days spent in runs that ended after an amount of time in the given uptime interval. So the heights of the boxes aren't particularly relevant; but at least splitting an interval should yield bars whose average height is that of the original. So the heights may be sensitive to where we cut, but only mildly so.

We can see that the long low tail of our density accounts for a disproportionate amount of the total, as should not be entirely surprising. Although most times I reboot my computer are only a short while after the previous time, the times that it runs for many months actually account for a respectable proportion of the time it's running.

Of course, in this situation, we should probably be splitting the data on some different intervals, since the pentiles are cuts based on how many of the population fall where. So let's give the evenly-spaced cuts over time a go, again contrasted with their densities:

Let's also look at how the Pareto partitioning looks, since that's also concerned with the sum accounted for by the different portions of the population. I'll further sub-divide the earlier four bands, each at its Pareto split (the eight threshold values I've used here are 7, 12, 20, 33, 42, 67 and 144; these only approximate the Pareto splits), to also get a case with eight:

As you can see, this rather brings out what the Pareto subdivision is designed to explore: the way the upper portions disproportionately influence the total.

Valid CSS ? Valid XHTML ? Written by Eddy.