Part 1: Understanding Descriptive statistics and EDA.
The work in this 3 part explainer series is highly inspired from the Book: Naked Statistics- Stripping the dread from the data by Charles Wheelan. I have also heavily referred to a textbook: Practical Statistics for Data Scientists by O’Reilly Media. I would like to sincerely thank the authors for their efforts and explanations.
The article contains a detailed summary of EDA and descriptive statistics. This article is your ticket to skipping the tedious trek through endless textbook pages and countless videos. Welcome to the “I Studied Stats So You Don’t Have To” series
Expletory Data Analysis (or EDA) is the first step in any analytical process. As the name suggests, the goal of EDA is to explore the data and get important information about it.
Let me paint a picture for you to develop an intuition about the concept. After a long day of work, you are trying to find a movie to watch on Netflix. Because you are picky (at least I am), you go through a myriad of cast lists, genres, trailers, and ratings (such as IMDb or Google Reviews). You decide on a movie after a lot of ‘Exploration’. When you engage in these activities you are exploring the dataset (list of movies available on Netflix).
Historically speaking, EDA was not a thing before the birth of computer science. In fact, the term data analysis (and EDA) itself was coined in 1962 by John Tukey. Before that statistics mostly focused on inference, and was largely developed in the late 1800s and 1900s. John Tukey forged the links between statistics and computer science.
But, what is Descriptive statistics and How does it relate to EDA you ask? To understand that we go back to the movie example. When you looked at the IMDb ratings and Google ratings to determine the better choice, those ratings are descriptive statistics. When you told yourself “There are an awful lot of action movies nowadays” you are secretly using descriptive statistics (not the right way to use frequency distributions though). As we can see both of these concepts are very closely related; in fact, without descriptive statistics, EDA does not exist.
As the name suggests Descriptive statistics describes and summarizes the dataset on hand, it gives you context about the data. Descriptive statistics is one of the two main branches of statistics (Inferential statistics is the other branch, which focuses on making predictions and drawing conclusions). In the modern day, descriptive statistics are calculated with the help of computer software and statistical packages. R, Python, SQL, and Microsoft excel all have functions that can help you explore and describe the dataset.
Deep Dive: Types of Descriptive Statistics
There are three measures of Descriptive statistics.
- Measures of Central Tendency: Mean (Average), Median, Mode
- Measures of Dispersion: Range, Standard Deviation, Variance
- Measures of Distribution: Kurtosis, Skewness, Frequency, Percentile.
It is very important to educate yourself in this regard because the results of your analysis are directly dependent on the measures you choose. Statisticians always say Garbage in, Garbage out — If the foundations of your analysis are weak, the result is not going to be accurate, even misleading.
Measures of Central Tendency
Also called measures of location, these measures give you a single value estimate of where most data is located in a data table.
The most commonly used central value is the arithmetic mean or the average. To calculate it you have to divide the sum of observations by the count of observations. Even though it is very intuitive to understand, the mean can be misleading if not careful about the outliers (the most extreme values of a dataset).
To understand this let us take a fairly common example :
Suppose you’re sitting at a bar with 8 other people. The average yearly income of all individuals (observations) is 40,000 USD. In the next minute, Bill Gates walks into the Bar, and suddenly, the average income of 10 people increases to roughly 100 million USD (assuming Bill Gates earns 1 billion dollars a year). Now, none of the original 9 people are that rich. Bill Gates is pulling the average income, he is the outlier here (Wheelan, 2014).
To avoid this sort of analytical outcome you can do two things 1)Remove the outlier from the equation (Trimmed mean), or 2)Calculate the Median of the dataset.
You calculate the median by 1)Sorting the observations from the smallest to the largest and 2) Picking the middle value from the sorted values. If we were to pick the median of the bar table before and after Bill Gates walked in, it would stay the same.
Not choosing the right kind of metric for your exploration will lead you to wrong inferences, and inevitably to wrong decisions. Sometimes, propagandists display wrong metrics to mislead the general population.
“Most people use statistics like a drunk man uses a lamppost; more for support than illumination” — Andrew Lang
There are other ways to reduce the impact of outliers, many analysts use weighted means and trimmed means to make their datasets robust (insensitive to extreme values).
Measures of Variability
Measures of Variability, or Dispersion, measure whether the values in a dataset are tightly clustered or spread out. There is a lot of emphasis on variability in statistics — statisticians work hard to reduce, and identify variability.
Just as there are formulas to measure central tendency, there are ways to measure variability. The most reported and widespread measure is the standard deviation.
You can measure the Standard deviation of a data frame by following these steps :
- Calculating a measure of central tendency i.e. mean, median for a data table
- Calculating deviations by calculating the difference between the mean and each observation. This should give you n different deviations, where n is the count of observations (sample size)
- sum the squared value of deviations calculated in the last step (because if deviations themselves are summed up, the value will be zero. Negative deviations will be offset by the positive ones)
- Divide the value calculated in the above step by the number of observations (n) — This value is referred to as Variance. (usually divided by (n-1) if the dataset contains a sample)
- Square root the variance to get your standard deviation — This step is done to scale the data to the original dataset
For those who don’t want to get into manual calculations (because in most cases, the computer will do it for you); If the end standard deviation is low, it means the data is closely gathered around the mean. On the contrary, a high standard deviation means that the data is scattered around the mean.
As in the case of central tendency, the standard deviation is not a robust measure. It is sensitive to outliers, and squaring the dataset may make it a little jumpier. Depending upon the dataset, you may have to use another measure of variability, median absolute deviation. Alternatively, you can compute a trimmed standard deviation as well — by deducting the outliers from the sample dataset.
In summary, measures of variability give you a range, which in turn tells you what is normal about the dataset, and what is out of the normal. It can point you to potential outliers, such as Bill Gates in a bar.
Exploring the Data Distribution
Measures of distribution show you the shape of the dataset. There are formulas such as Kurtosis and Skewness that can be computed but they are out of scope for this article. Instead, I’ll talk about important terms and visualizations that can help you make more sense of the dataset you’re dealing with.
Percentiles and Box plots: Percentiles are a measure of relative value. the Pth Percentile (of a dataset) is a value such that, at Least P percent takes this value or less and, (100 — P) percent values take this value or more. (Bruce et al., 2020) e.g. if you score 10 percentile on the SATs, you can say that 90% of people scored more than you and 10% of people scored less than you.
The Median is the 50th percentile of a dataset.
A common way to visualize percentile is a boxplot, boxplots can give you a lot of information about a dataset. In a single visualization, you can see values for outliers, median, Extreme percentiles, and the IQR (values between 25th and 75th percentile).
Frequency Tables, Histograms, and Density Plots: A frequency table divides values into equally spaced values (referred to as bins) and tells us how many observations fall in each bin. Frequency tables are usually visualized by histograms with bins on the x-axis and frequency on the y-axis. For example, you can divide an attribute of Age in a dataset into 10 different bins. (e.g. 0–9, 10–19 etc.), then you can count the number of observations in each bin (age group). If we put the bins of age and count of observations in a table, it is called a frequency table, if we plot a visualization of that table, it using bars it is called histograms.
Box plots and Histograms are related in a sense:
Both frequency tables and percentiles summarize the data by creating bins. In general, quartiles (percentiles in multiples of 25. i.e. 25, 50) and deciles (percentiles in multiples of 10) will have the same count in each bin — equal-count bins, but the bin sizes will be different. The frequency table, by contrast, will have different counts in the bins — equal-size bins. (Bruce et al., 2020)
Density plots are the same as histograms but density uses proportion values (%) instead of counts. Histograms use bars, and Density plots generally give a continuous line.
To show an example I have plotted a Density plot (left) and a histogram (right) side by side in python seaborn. The data shows the Age distribution of Olympic players. The data suggests that most players are between the age of 20 to 30.
Bar and Pie charts: Bar charts and Pie charts are used with categorical data (Know more about categorical data). They show simple calculations on proportions. The pie chart is not esteemed in the data community because it shows less information compared to other methods.
Bar charts are very popular visualizations, which usually show one categorical variable on the x-axis and proportions associated with it on the y-axis. The only difference between bar charts and histograms is that: bar charts show different categories of a factor variable, whereas histograms represent a single variable divided into numerical bins.
Two Categorical Variables — Contingency Tables, Bar Charts, Box Plots
Sometimes you might have to compare multiple categorical variables. For example, you might want to compare the prices of different hotel chains; or you might want to compare the salaries of different departments of a company. To do this you compare numerical values grouped according to factored categorical data.
Contingency tables, commonly known as pivot tables (in MS Excel) can show you different measures such as counts, sums, and percentages concerning different categories.
Similarly, you can compare multiple categorical variables with different visualizations:
Bar Charts
Box Plots
The article provides a condensed overview, anticipating some loss of detail due to its summarization of extensive research spanning hundreds of pages and numerous videos. Despite the limitations of brevity, I’ve made every effort to encompass essential concepts. If you find any inaccuracies or areas that require further clarity, please don’t hesitate to point them out.
Happy learning!
Citations
Wheelan, C. (2014, January 13). Naked Statistics: Stripping the Dread from the Data (1st ed.). W. W. Norton & Company.
Bruce, P., Bruce, A., & Gedeck, P. (2020, June 2). Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python (2nd ed.). O’Reilly Media.