Five Reasons Why Statistics is So Hard to Learn
Every year, thousands of students around the globe — some willingly, some unwillingly — enroll in introductory statistics, data analytics, consumer analytics, and many other related courses. And it is not difficult to find students who would put these courses on top of their scariest academic experiences.
In this article, I try to take the perspective of the typical student enrolled in an introductory statistics course and explain five reasons why we struggle to grasp the course’s core content.
1. Randomness is unintuitive
The Cambridge Dictionary defines random as “happening, done, or chosen by chance rather than according to a plan.”
As a human being, I do not possess an intuitive understanding of random events. I love to pretend things happen for a reason. Otherwise, how do I make my existence meaningful?
Think about sports. Suppose the captain of a sports team, especially in cricket, loses several coin tosses in a row. Many supporters then start blaming the captain as if it is their fault. Similarly, if several fifty-fifty decisions go against the team you support, you may begin to suspect a conspiracy and blame the referee or umpire. In both situations, although the outcomes are driven by randomness and not by any deliberate purpose, people often do not perceive these outcomes as random events.
As statistics deals with randomness, much of the subject matter is often unintuitive.
Before I took my first statistics course, had you asked me, “How would you divide people into two groups which, on average, should be identical?” I would have said, “I would gather their information, start looking for pairs of people who are identical to each other, assign them to the two groups, and keep repeating the process.”
And then I took statistics and learned that the most effective way would be to rely on random assignment (e.g., coin flips)! 🤦🏾♂️
Indeed, the power of randomness to create presumably organized outcomes is hard to fathom. Instructors should spend more time building a solid understanding of this crucial concept.
2. Common sense does not work
Let’s say your instructor is teaching you how to interpret the results of a t-test.
In the example, they are trying to investigate whether there is a difference in annual income between two groups of people. After running the test, they conclude:
“The difference in the sample average annual income between the two groups is $100. However, this difference is not significantly different from 0 at the 5% level.”
Statements such as the above one can be challenging for a beginner.
At first, the instructor says the difference in average income is $100, and then they say this $100 difference is not different from 0. The beginner wonders, how can the value of something simultaneously be $100 and not different from zero? 🤔
Unfortunately, there is no easy way to correctly explain statistical jargon. I realized it while struggling to grasp them as a student and even more while trying to explain them in my previous articles.
3. Statistical tests are “acausal”
The key purpose of statistical inference is to explore the value of an unknown population parameter based on a sample estimate.
On the contrary, causal inference requires counterfactual reasoning.
I explain the fundamentals of causal inference in the following article:
The problem is that statistical tests — such as t-test, chi-squared test, F-test, etc. — are done for the purpose of statistical inference and not for causal inference. Whether you can make valid causal claims based on the findings of your statistical tests depends on the data-generating process and the model you estimate.
Many instructors do not sufficiently explain this key issue. Consequently, students start making causal interpretations based on a finding’s statistical significance (i.e., whether the p-value is < 0.05).
For example, beginners often struggle to detect the difference between the following statements:
- A one-unit change in X is positively associated with 2.3 units change in Y
- A one-unit increase in X causes/leads to/generates 2.3 units increase in Y
Some instructors do mention that “correlation does not imply causation” but do not explain the concept well enough. Fortunately, these concepts can be fairly easily explained using some simple directed acyclic graphs.
In the context of regression analysis, I describe some of these concepts in the following articles:
4. Lost in formulae and software functions
As a student, I attended multiple graduate-level statistics courses in which many of my classmates were lost in mathematical formulae and running statistical tests in R. Also, as a teaching assistant in an undergraduate-level consumer analytics course, I observed many students getting lost in Excel functions. Consequently, many students cannot focus on internalizing statistical reasoning, which should be the primary learning outcome in any introductory statistics course.
Of course, students need certain mathematical and software skills for a deeper understanding of statistics. Nevertheless, the core ideas of statistical reasoning can be explained using only toy examples.
Instructors should ensure that students spend most of their time building a solid understanding of fundamental concepts, rather than worrying about how to manually calculate the standard error of a coefficient under time pressure during an exam.
5. Fake datasets do not inspire
In many courses (especially in online ones), instructors generate random numbers from statistical distributions to create variables and use these fake variables to explain how to conduct statistical tests.
I understand the value of this approach. First, generating fake datasets can help create simple examples. Second, the process is convenient because the instructor does not have to search for a publicly available dataset, which can be messy due to missing data. Third, the instructor, along with teaching a statistical method, can teach Monte Carlo simulation. Fourth, the steps of statistical tests do not change depending on the data-generating process. So, one may argue that there should be no differential effect on student learning depending on using fake datasets vs. real-world datasets.
However, in the context of introductory statistics courses, I argue that using real-world datasets, alongside fake datasets for toy examples, can enhance the typical student’s learning.
Personally, I feel more empowered when I learn how to conduct a statistical test using a real-world dataset. Why? Because as a student of applied social sciences, I enrolled in a statistics course not out of love for statistical distributions and tests, but to learn methods that help me investigate real-world phenomena. Working with a real-world dataset makes me feel like I am progressing in the right direction.
Wrapping Up
Regardless of whether we become data analysts, since all of us are consumers of data products, a solid understanding of statistical reasoning can help us become critical thinkers and better decision-makers.
The introductory statistics course is the first and last time many students receive training on describing and analyzing data. Consequently, instructors of these courses should prioritize the following:
- Appreciate the fact that the concepts that seem obvious to them (e.g., consequences of random sampling and assignment, the central limit theorem, etc.) may not seem obvious to a beginner.
- Help students build a solid understanding of statistical reasoning using toy examples.
- Save students from getting lost in formulae and software functions.
- Use graphs to explain under what circumstances statistical association between two variables can be interpreted as a causal effect.
- Empower and inspire students by teaching statistical analyses using real-world datasets.
** If you found this article helpful, please consider following me! Also, please consider becoming an Email Subscriber so that you receive an email next time I post something!**
