5.2. Percentiles#

Recap: Population Parameters, Sample Statistics#

Recall that the population is the universe of cases we want to describe. There is some parameter in that population that, from our perspective, is fixed and unknown. What we can do is take large random samples from the population and use the statistics from those samples to estimate the population parameter.

We use the term “estimate”, because we will never know the value of the population parameter for sure. But it is reasonable that we try to say how uncertain we are about our estimate. More crudely, we want to know how much “error” we have around our estimate; intuitively, if we rely on the statistic from the sample we drew, how wrong are we likely to be in terms of understanding of the “true” value the parameter takes in the population?

This all means that we need to answer the following question in as systematic a way as we can:

how different could this estimate have been, if we had drawn a different random sample?

To get the intuition here, remember that as we saw in the previous chapter, different random samples produce different sample statistics (in that case, median wait times). But for a typical problem, we will draw one random sample, one time, and thus have only one statistic. So our problem is to use that one statistic from that one sample—and some tools below—to understand how large or small the population parameter might plausibly be. Before we show how that can be done, in this section we define a new measure that we can calculate from a sample and that will be useful in general.# Percentiles

We define the \(p^{th}\) percentile as the

value of the data such that \(p\)% of the observations fall below that value, and (100-p)% fall above it.

We will give some examples momentarily, but note that some particular percentiles have special names:

the 50th percentile of the data is the median, which we have already met. It is the “middle value”, and half the observations are above it (in value) and half are below it (in value)
the 25th percentile of the data is the first or lower quartile. 75% of the observations lie above this value.
the 75th percentile of the data is the third or upper quartile. 25% of the observations lie above this value.

Just for completeness, note that the 100th percentile of the data is the largest value—there are no observations larger. The particular way that these percentiles are defined and implemented in software varies a little, but this should not make a big difference for large samples.

Calculating Percentiles#

One standard way to calculate percentiles is as follows.

Start with a sample of values. Let’s say we have \(n=8\) and they are:

\[ 1,3,9,7,5,3,11,3 \]

Rank, or sort, the observations from smallest to largest:

\[ 1,3,3,3,5,7,9,11 \]

Find the \(p\)% of \(n\), which is literally \(\frac{p}{100}\times n\). Call that number \(k\). Now…

if \(k\) is an integer (a whole number), take the \(k\)th element of the sorted sample. That’s the \(p^{th}\) percentile.
if \(k\) is not an integer, round it up to the next integer, and take that element of the sorted sample. That’s the \(p^{th}\) percentile.

For our sample, consider calculating the 25th percentile. The relevant value of \(k\) is \(0.25\times 8=2\). That’s an integer, so we take the 2nd element of the ordered sample, which is 3. For the 75th percentile, the relevant value of \(k\) is \(0.75\times 8=6\), so we take the 6th element of the ordered sample, which is 7.

What about, say, the 83rd percentile? Now the relevant value of \(k\) is \(0.83\times 8= 6.64\). We round that to 7 (the next integer), and our 83rd percentile is thus 9.

Percentiles are useful when we want to get a sense of where a given observation is in the distribution of all observations. So, for example, weight percentiles of babies might tell us how healthy a given child is – or how well it is developing – relative to all other children. It does this in a way that the weight itself, e.g. 8 lbs 3 oz, does not automatically convey. Similarly, standardized tests performance is often based on the percentile the student achieves. Thus, we may not care that a student got 59 responses correct out of 122 questions, but we might update quite a bit depending on whether that performance puts the student at the 10th or 99th percentile.

Interquartile Range#

While percentiles tell us where a given unit is in the distribution, the Interquartile Range (IQR) tells us how the distribution looks as a whole. Specifically, the IQR is the

is the difference between the first and third quartiles in the sample

For our sample above, this is 7-3 = 4. Intuitively, this is telling us about the spread of the ‘middle part’ of the distribution. When it is larger, the distribution is more spread out around its median; when it is smaller, the distribution is less spread out.

Data Science for Everyone

Percentiles

Contents

5.2. Percentiles#

Recap: Population Parameters, Sample Statistics#

Calculating Percentiles#

Interquartile Range#