Every year, thousands of students around the globe — some willingly, some unwillingly — enroll in introductory statistics/data analytics/consumer analytics courses. And it’s not difficult to find students who would put these courses on top of their scariest academic experiences.
In this article, I try to take the perspective of the average student enrolled in a typical introductory statistics course and explain five reasons why we struggle to grasp the course’s core content.
1. Randomness is unintuitive
The Cambridge Dictionary defines “random” as happening, done, or chosen by chance rather than according to a plan.
As a human being, I do not possess an intuitive understanding of random events. I love to pretend things happen for a reason. Otherwise, how do I make my existence meaningful?
Think about sports. Suppose the captain of a sports team (especially in cricket) loses multiple coin tosses in a row. In that case, supporters blame the captain as if it’s his fault. Also, suppose several 50-50 decisions go against the team you support. In that case, you may start smelling conspiracy against your team and blame the referee/umpire. In both circumstances, although the outcomes are driven by randomness (implying that a purpose/reason doesn’t drive these events), humans often don’t perceive these outcomes as random events.
As statistics deals with randomness, much of the subject matter is often unintuitive.
Before I took my first statistics course, had you asked me, “How would you divide people into two groups which, on average, should be identical?” I would have said, “I would gather their information, start looking for pairs of people who are identical to each other, assign them to the two groups, and keep repeating the process.”
And then I took statistics and learned that the most effective way would be to rely on random assignment (e.g., coin flips)! 🤦🏾♂️
Indeed, the power of randomness to create presumably organized outcomes is hard to fathom. Instructors should spend more time building a solid understanding of this crucial concept.
2. Common sense doesn’t work
Let’s say your instructor is teaching you how to interpret the results of a t-test.
In the example, they are trying to investigate whether there is a difference in annual income between two groups of people. After running the test, they conclude:
“The difference in the sample average annual income between the two groups is $100. However, this difference is not significantly different from 0 at the 5% significance level.”
Statements such as the above one can be challenging for a beginner.
At first, the instructor says the difference in average income is $100, and then they say this $100 difference isn’t different from 0. “How can the value of something simultaneously be $100 and not different from 0?” wonders the beginner. 🤔
Unfortunately, there is no easy way to correctly explain statistical jargon. I realized it while struggling to grasp them as a student and even more while trying to explain them in my previous articles:
Why You Should Prefer Confidence Interval over p-value
Communicating Results of Your Statistical Analysis
3. Statistical tests are “acausal”
The key purpose of statistical inference is to explore the value of an unknown population parameter based on a sample estimate. On the contrary, causal inference requires counterfactual reasoning. I explain the fundamentals of causal inference in the following article:
How to Explore the Effect of Doing Something? (Part 1)
Applied Causal Inference 101: Counterfactual Worlds and The Experimental Ideal
The problem is that statistical tests — such as t-test, chi-squared test, F-test, etc. — are done for the purpose of statistical inference and not for causal inference. Whether you can make valid causal claims based on the findings of your statistical tests depends on the data-generating process and the model you estimate.
Many instructors do not sufficiently explain this key issue. Consequently, students start making causal interpretations based on a finding’s statistical significance (i.e., whether the p-value is < 0.05). For example, beginners often struggle to detect the difference between the following statements:
- A 1 unit change in X is positively associated with 2.3 units change in Y
- A 1 unit increase in X causes/leads to/generates 2.3 units increase in Y
Some instructors do mention that “correlation doesn’t imply causation” but do not explain the concept well enough. Fortunately, these concepts can be fairly easily explained using some simple directed acyclic graphs. For regression analysis, I describe some of these concepts in the following articles:
Confounding Variable and Spurious Correlation: Key Challenge in Making Causal Inference
The desire to solve problems is perhaps natural to all humans. Inability to identify the causes of a problem, particularly…
Regression and Causal Inference: Which Variables Should Be Added to The Model?
Struggle and (Potential) Remedy
4. Lost in formulae and software functions
As a student, I have attended multiple introductory graduate courses in which many of my classmates were lost in mathematical formulae and running statistical tests in R. Also, as a teaching assistant in an undergraduate course, I have observed many students getting lost in Excel functions. Consequently, many students cannot focus on internalizing statistical reasoning (which should be the number one learning outcome in any introductory statistics course).
Of course, students need certain mathematical and software skills for a deeper understanding of statistics. However, not all students in an introductory course arrive with the required preparations, complicating the teaching process.
Nevertheless, the core ideas of statistical reasoning can be explained using toy examples. Instructors should ensure that students focus most of their time on building a solid understanding of the fundamental concepts and not worrying about something like how to manually calculate the standard error of a coefficient under time pressure during the mid-term/final exam.
5. Fake datasets don’t inspire
In many courses (especially online tutorials), instructors generate random numbers from statistical distributions to create variables and use (only) these fake variables to explain how to conduct statistical tests.
I understand the value of such a practice. First, generating fake datasets can help create toy examples. Second, the process is convenient as the instructor doesn’t have to forage for a publicly available dataset (which can be messy due to missing data). Third, the instructor — along with teaching a statistical method — can teach Monte Carlo simulation. Fourth, the steps of statistical tests (e.g., a two-sample t-test) don’t change depending on the data-generating process. So, one may argue that there should be no differential effect on student learning depending on using fake datasets vs. real-world datasets.
However, in the context of introductory statistics courses, I argue that using real-world datasets (along with fake datasets) facilitates the average student’s learning.
I feel more empowered when I learn how to conduct a statistical test using a real-world dataset. Why? Because as a student of applied social sciences, I enrolled in a statistics course not for the love of statistical distributions and tests but for the sake of learning some methods which help me investigate some real-world phenomena. Working with a real-world dataset makes me feel like I am progressing in the right direction.
Regardless of whether we become data analysts, given that all of us are consumers of data, a solid understanding of statistical reasoning can help us become critical thinkers and better decision-makers.
The introductory statistics course is the first and last time many students receive training on describing and analyzing data. Consequently, instructors of these courses should prioritize the following:
- Appreciate the fact that the concepts that seem obvious to them (e.g., consequences of random sampling/assignment, the central limit theorem, etc.) may not seem obvious to a beginner
- Help students build a solid understanding of statistical reasoning using toy examples
- Save students from getting lost in formulae and software functions
- Use graphs to explain under what circumstances statistical association between two variables can be interpreted as a causal effect
- Empower and inspire students by teaching statistical analyses on real-world datasets