Part 2: Data and Sampling Distributions for data science
The work in this 3 part explainer series is highly inspired by the Book: Naked Statistics- Stripping the Dread from the Data by Charles Wheelan. I have also heavily referred to a textbook: Practical Statistics for Data Scientists by O’Reilly Media. I would like to sincerely thank the authors for their efforts and explanations.
The article contains a detailed summary of data and sampling distributions. This article is your ticket to skipping the tedious trek through endless textbook pages and countless videos. Welcome to the “I Studied Stats So You Don’t Have To” series
In a world where enormous amounts of data are generated daily, it’s easy to fall into the belief that traditional statistical methods, such as sampling, have become obsolete.
The argument is generally referred to as ‘Big Data hubris’, which posits that Big data analytics is a substitute for traditional analytics rather than a supplement. The implicit assumption here is that big data usually fetches accurate results. The cautionary tale of the Literary Digest poll is a prime example of how a big data project, at least at the time, can lead to false conclusions.
Despite the abundance of data, the importance of sampling, and distributions remains paramount for efficient data handling and bias reduction. Even in big data projects, predictive models are often developed and piloted using samples (Bruce et al., 2020).
The article which follows gives an introduction to the following concepts:
- Random Sampling and Sampling bias
- Selection bias — Data snooping and the vast search effect
- Sampling Distributions and CLT
- Standard Error and Confidence intervals
- The Bootstrap method
The primary objective of this article is to explain the intricacies of sampling distribution and their relevance in the modern landscape of big data
Introduction to Sample and Populations
In statistics, a sample refers to a subset of a larger dataset that accurately represents a specific population of interest. Given the inherent challenges in collecting data from an entire population, statistical analysis is often performed on a sample. The primary objective is to draw meaningful conclusions about the entire population based on insights derived from the sample. Even though samples contain information from a relatively small fraction of the larger population, statistical methods enable us to extend the findings of the sample to the broader population of interest.
Here are instances where the art of sampling comes alive:
- Medical Studies: Research frequently involves selecting a sample of patients to assess the effectiveness of a new treatment.
- Political Polling: Polls are conducted using a sample of the population to gauge public opinion on various issues and elections.
- Environmental Studies: Sampling soil or water from a specific area assists in monitoring pollution levels and environmental health.
In statistics, a ‘population’ has a unique meaning compared to its use in biology. Here, it refers to a carefully defined, and imagined, larger collection of data. This might not be something we can physically touch, but rather an idea we create. We use the term ‘imagined’ to highlight that, sometimes, even statisticians lack full knowledge about this dataset — we might not know everything about its characteristics or size. For instance, think about a researcher trying to find out what products people prefer. They might not have all the information about how big the group they’re studying is.
The population (denoted by ’N’) has an underlying theoretical distribution (refer to part 1 to know more about distributions and histograms). Whereas a sample (represented by ’n’) has an empirical distribution. A lot of the traditional inferential statistics focused on the theoretical population distribution whereas modern statistics focus more on samples and empirical distribution.
Sampling is generally used to move from the left-hand side to the right-hand side.
The Soup Example to understand sampling:
Don’t worry if you can’t understand concepts of samples and population yet, here’s a simplified example to understand them conceptually. Imagine preparing a pot of soup that requires seasoning. Due to the pot’s size, tasting the entire contents isn’t feasible. Instead, you rely on a single spoonful. This act of acquiring the spoonful mirrors the concept of sampling, with the spoon’s contents representing the sample. The entire pot of soup symbolizes the population. Using the information gathered from the spoonful (sample), you extrapolate that the pot of soup (population) shares a similar taste profile.
Sampling and Bias
As suggested, sampling involves selecting a smaller subset of data from a large population. Various methods exist for sampling, with some being more efficient at generating high-quality data than others, Random sampling is at the core of these methods. Random sampling ensures that each member of the population has an equal chance of being chosen. In essence, random sampling stands as the bedrock of crafting a representative dataset.
For a random sampling recipe, you need these ingredients :
- A well-defined population: For example, if you are trying to sample your consumers, how would you define a “consumer”? Do you take all the records from your database where the purchase amount>0? do you consider people who returned the products?
- You need an accessible population: for example, think about randomly selecting 100 customers from the pool. But it doesn’t end there. The process’s choreography and flow matter too, especially with real-time data.
Moreover, random sampling techniques vary depending on the characteristics of the population under study. One technique, Stratified sampling is used when there’s variation within the population and differences exist in varying proportions. Imagine probing preferences across racial groups like Hispanics and African Americans. Dividing this diverse population into strata and drawing random samples from each stratum ensures an accurate representation.
It should be noted that data quality in data science involves the completeness, format, and accuracy of each data point. However, in statistics data quality adds the notion of representativeness. A statistician would ask the question: “Does the extracted sample effectively represent the population it was drawn from?”
Sampling Bias
Let’s return to our soup analogy: Imagine you’re tasting the soup to check its seasoning. The spoonful you take (which is like sampling) helps you decide whether the whole pot (the entire population) is seasoned well. But here’s the catch — the spoonful you took might not give a fair picture. If you didn’t stir the soup after seasoning, the sample could be unbalanced, leading to wrong inferences about the pot. This is sampling bias.
Sampling bias shows up when the chosen sample doesn’t fairly represent the whole population. This happens because of non-random differences between the two. Such bias can result in wrong or misleading outcomes. For example, back in 1936, the Literary Digest surveyed for the US presidential election. They had only polled their subscribers, who were wealthier than the general population. As a result, their sample (a spoonful of soup) wasn’t representative of the population (the potful of soup) in a non-random way.
To dodge sampling bias, researchers must select their samples thoughtfully and ensure that any differences between the sample and the population are random and insignificant. It’s important to note that there’s a distinction between errors caused by bias and errors stemming from random chance. statistics can handle errors stemming from random chance, but a bias will lead to wrong conclusions.
Selection Bias
In the world of statistics, a popular adage resonates: “If you torture the data long enough, it will confess.” This cautionary saying points to a critical issue known as selection bias, where data is cherry-picked either consciously or unconsciously, leading to erroneous and misguided conclusions.
If you create a well-defined problem and hypothesis and design a suitable experiment to test it, chances are that you can be very confident in the results that sidestep selection bias. This is often not the case, when working with data, especially large datasets there is a risk of data snooping — data snooping is a practice of going through the data and testing multiple, unspecified hypotheses until something interesting appears
For example, a stock trader working with price data may notice from historical data that earnings per share is a good predictor of the stock price. But this may not be true for new unseen data. The Data will indeed give you something interesting if you look hard enough, but is it reliable? There is a real risk of consciously or unconsciously cherry-picking the data.
To reduce selection bias, you can do multiple things.
- Keep a holdout set — A portion of the original dataset is kept aside for testing the claim you’re making
- Target shuffling — randomly rearranging the target variable values while keeping the input features unchanged to assess model performance
- Work with a pre-defined hypothesis
Sampling Distributions and CLT
We usually take a sample to measure something, like an average (a sample statistic), or to model something for predictions. Because our insights rely on samples, a degree of error is natural; a different sample could yield slightly different outcomes. Therefore we are interested in how different it could be from the population, focusing on what we call ‘sampling variability’.
In an ideal scenario, we could repeatedly draw samples and directly observe the distribution of sample statistics. However, due to practical constraints, it’s not always feasible to work with numerous samples. Nonetheless, we can tackle this limitation by distinguishing between two types of distributions: the ‘data distribution,’ which portrays individual data points, and the ‘sampling distribution,’ which characterizes the distribution of sample statistics.
Back to soup:
you have a very very large pot of soup, and you want to check if the seasoning is evenly spread. To do this, you can’t rely on just one spoonful of soup; you need to taste multiple spoonfuls to get a better idea. Each time you taste a spoonful, you’re essentially measuring the ‘mean seasoning’ — a sample statistic. As you continue testing and measuring, you’ll observe slight variations in the ‘mean seasoning’ from spoonful to spoonful. These variations in the ‘mean seasoning’ indicate sample variability.
Okay we are ditching the soup example now, it's becoming ridiculous
Now, if you were to plot all the ‘mean seasoning’ values you measured on a graph, you’d see a distribution. This distribution is called the ‘sampling distribution.’ It shows you how the ‘mean seasoning’ varies across all the different samples you could take from the same large pot of soup. Sampling distributions are usually very normalized and bell-shaped, so it is easier to draw conclusions from them, here’s an illustration that proves this phenomenon.
To those who couldn't understand here’s a process flow of creating the above sampling distribution (2nd chart):
Randomly Sample 5 Income Values -> 2)Calculate Mean -> 3)Repeat steps one and two 1,000 Times -> Obtain 1,000 Mean Values
To create a sampling distribution, you took samples of 5 and averaged them, you repeated this process 1000 times to obtain 1,000 mean values.
The infamous Central Limit Theorem (CLT) and Standard Error
As you can notice, the data distribution (top chart) is broadly spread out as expected with income data. however, the histograms of the means of 5 and 20 are compact and bell-shaped. The phenomenon we see is called the central limit theorem — It states that means drawn from multiple samples will resemble the bell-shaped normal curve (even if the source population is not normally distributed). The CLT permits us to apply familiar statistical techniques even when the underlying population distribution is not normally distributed, essentially it helps us make reasonable assumptions and predictions about our population. This theorem forms the backbone of various statistical analyses, such as hypothesis testing and confidence intervals.
CLT forms a lot of traditional statistics. Data scientists should be aware of the role; however, a new method called bootstrapping is always available to modern tech which provides a robust alternative to traditional CLT approaches.
Standard Error and Confidence intervals
So far, we’ve explored the concept of sampling distributions and how they shed light on the behavior of sample statistics. Now, let’s delve into other crucial aspects: the standard error and confidence intervals.
In the realm of statistics, standard error plays a pivotal role when it comes to sampling distributions. While the sampling distribution helps us visualize how sample statistics vary, the standard error quantifies this variability. The standard error is used in creating confidence intervals which provide a range of likely values for a population parameter based on our statistics. Don’t worry, we’ll introduce an example to understand it.
Out of soup and into chocolates:
Imagine you’re a quality control analyst at a chocolate factory, responsible for ensuring the uniformity of chocolate bar weights. To estimate the true average weight of all bars produced, you adopt a strategy: sample 50 bars daily over a month, crafting a collection of 30 sample means. You quickly realize that the sample means vary by some amount, that is there is some ‘wiggle’ room in the average. The standard error is the quantification of the said ‘wiggle’ in your means.
Let's assume that the average for 30 distributions is 50 grams, accompanied by a standard error of 0.75 grams. The average, our sample statistic, tends to “wiggle” or fluctuate around by about 0.75 grams. This wiggling signifies the natural variation that exists in our sample means. To make this more concrete, let’s consider two scenarios: one sampling distribution might exhibit an average of 49.25 grams, while another could reveal an average of 50.75 grams. This variance of 0.75 grams captures the typical range within which these sample averages dance around the true population mean. In essence, the standard error provides us with a numerical measure of this “wiggle,” helping us understand how much our calculated averages might vary as we explore different samples.
Armed with this understanding, you decide to create a 95% confidence interval (yes you say intervals in a high percentage) around your sample mean. Using the standard error, you calculate the interval, resulting in a range of 49 to 52 grams. This essentially means this: ‘You are sure that any chocolate you take from the production will weigh between 49 to 52 grams 95% of the time’.
Now Imagine weighing every chocolate on that production line. This is the power of statistics, with little effort, you can make informed assumptions about your population.
In essence, standard error gives you an idea of how much your sample mean might ‘wiggle,’ and the confidence interval provides a range where you believe the actual population parameter resides (confidence interval is calculated with standard error). These tools empower you to make informed decisions about the quality of chocolate bars and the precision of your estimates.
A 5-minute video on confidence intervals and standard errors
In practice, this approach of collecting new samples to estimate the standard error is typically not feasible (and statistically very wasteful). Fortunately, it turns out that it is not necessary to draw brand-new samples; instead, you can use bootstrap resamples.
The Bootstrap
As we understood, you create sampling distributions by repeatedly drawing samples from the population. One easy and effective way to get the sampling distribution is to draw additional samples, with replacements, from the sample itself. We simply re-place each observation after a draw, effectively creating an infinite population.
- Draw a sample value, record it, and then replace it.
- Repeat n times.
- Record the mean of the n resampled values.
- Repeat steps 1–3 R times.
R is the number of iterations of the bootstrap, usually set arbitrarily. The more iterations you do, the more accurate the estimates.
The recorded means are used a) to calculate standard deviation or error b) to produce a histogram c) to Find a confidence interval.
The bootstrap does not compensate for a small sample size; it does not create new data, nor does it fill in holes in an existing data set. It merely informs us about how lots of additional samples would behave when drawn from a population like our original sample.
Calculating a confidence interval is simplified using the bootstrap technique. After generating the sampling distributions through the algorithm mentioned earlier, you can establish a confidence interval by trimming the distribution’s ends by (100-x)/2% to identify an x% confidence interval.
let’s say you’re aiming for a 90% confidence interval. In this case, you would trim the ends of the distribution by (100–90)/2 or 2.5% on each side. What you’re essentially doing is identifying a 90% confidence interval, which encompasses the range within which you’re quite confident the true population parameter lies. This powerful technique harnesses the information from your samples to provide a reliable estimate of uncertainty and precision.
The article provides a condensed overview, anticipating some loss of detail due to its summarization of extensive research spanning hundreds of pages and numerous videos. Despite the limitations of brevity, I’ve made every effort to encompass essential concepts. If you find any inaccuracies or areas that require further clarity, please don’t hesitate to point them out.
Happy learning!
Citations
Wheelan, C. (2014, January 13). Naked Statistics: Stripping the Dread from the Data (1st ed.). W. W. Norton & Company.
Bruce, P., Bruce, A., & Gedeck, P. (2020, June 2). Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python (2nd ed.). O’Reilly Media.