Understanding Regression Output: A Lesson for Absolute Beginners (Part 1)
Linear regression is one of the most powerful weapons in the arsenal of a data analyst. Although the core concept of linear regression seems intuitive to anyone who understands the equation of a straight line (y = b + mx), the inferential parts of regression may seem quite complicated if the person is yet to enroll in an inferential statistics course.
Thanks to the wider availability of different software packages, estimating an empirical model using regression is no more a challenge; however, correctly interpreting the results is crucial as it is so easy to get it all (or partially) wrong.
In this two-part article, I will explain how to interpret different parts of the regression output. In this first part, I discuss:
1. How to interpret the coefficients of the predictors of a model estimated using regression
2. How to identify whether the coefficients are statistically significant
Data and Background
Using the Wine Quality Dataset provided by Cortez et al. (2009), I estimate a model to predict the quality of red wine based on its alcohol content, pH level, and density.
Model and Basic Ideas
We estimate the following model:
quality = b0 + b1*alcohol + b2*pH + b3*density + u
Here, b0 is the intercept term. It is the expected value (i.e., the value we would find on average) of quality if all three predictors added to the model are equal to 0 (which, by the way, is not useful in this example because we have no such observation in the dataset for which any of the three predictors is 0. Also, if alcohol = 0, the liquid cannot be red wine, right? 😊 Regardless, the intercept can be useful in other contexts).
Next, b1, b2, and b3 are the estimated coefficients of alcohol, pH, and density, respectively. They indicate the strength of the association between each of the predictors and the outcome.
Finally, u is the error term. Theoretically, u consists of all other predictors of quality except the ones (i.e., alcohol, pH, and density) added to the model.
In R, I write the following code:
m1<-lm(quality~alcohol+pH+density,data=winequality.red)
summary(m1)
Estimating a model using regression takes just a couple of lines of code! Isn’t it easy? 😀
The Output
First, please notice that each predictor’s intercept and coefficient have an Estimate, a Standard Error, a t-value, and a p-value (Pr(>|t|).
Generally, we are interested in the “statistical significance” of the estimated coefficients of the predictors. For that purpose, we can use either the t-value or the p-value. I will use the p-value approach in this article as I find it more intuitive.
Question #1: What does the coefficient of alcohol = 0.39704 mean?
Answer: It means that holding the values of other predictors (i.e., pH and density) constant, a one-unit increase in alcohol is, on average, associated with a 0. 39704 unit increase in the quality of red wine. More concisely, we can say that alcohol level has a positive association with the quality of red wine.
Question #2: Looking at the regression output, do you think the coefficient of alcohol is “statistically significant”?
Answer: Statistical significance is related to a hypothesis test. The finding of the hypothesis test helps infer the value of a population parameter based on the sample estimate of it.
Population parameter = true coefficient of a predictor in the population
Sample estimate = estimate of the population parameter based on sample data
To decide on the statistical significance, we need two things:
1) Hypothesized coefficient value: We need to have a hypothesized value of the coefficient of the predictor in the population. In the R output, the hypothesized value of each coefficient is 0. For each of these coefficients, we are testing the null hypothesis (the default hypothesis) that there is no association (coefficient = 0) between the predictor and the outcome in the population.
2) Significance Level: Typically, we select a 5% significance level (but we can select 0.1%, 1%, 10%, etc.). This is just an arbitrary choice.
The p-value expresses the probability of finding a test statistic (t-statistic) at least as extreme as the one obtained (e.g., 20.996 for alcohol), given that the null hypothesis is true (e.g., the true coefficient of alcohol in the population= 0).
If the p-value<0.05, we reject the null hypothesis that the true population coefficient is 0 (and conclude: the coefficient is not 0 → there is some association between the predictor and the outcome → the coefficient is statistically significant)
On the contrary, if the p-value≥ 0.05, we fail to reject the null hypothesis that the true population coefficient is 0 (and conclude: the coefficient is 0 → there is no association between the predictor and the outcome → the coefficient is statistically insignificant)
In the output, we see that for the hypothesis test that the true population coefficient of alcohol = 0, the p-value is < 2e-16, which is definitely smaller than 0.05.
This result indicates that given the true population coefficient of alcohol = 0, the probability of finding a t-statistic=20.996 is less than 0.0000000000000002. This small p-value suggests that the null hypothesis is extremely unlikely to be true.
Consequently, we reject the null hypothesis that the population coefficient =0 (no association) and conclude that alcohol level has a significant positive association with the quality of red wine.
Please note that statistical significance does not necessarily imply real-world significance. I explain it in detail in the following article:
Based on our above discussion, can you interpret the coefficients of the other two predictors (i.e., pH and density)? Are they statistically significant? Hopefully, you can answer both questions! 😊
Before we finish, you may wonder: “Does the positive association between alcohol level and the quality of wine imply that alcohol level has a positive effect on the wine quality?”
Great question! Although the positive association may be evidence of a positive effect, our current model is (possibly) insufficient to make any causal conclusion. For inferring causality, we need to worry about confounding variables (which requires subject matter knowledge of the causal relationships). For a quick introduction to confounding variables, you can read my previous article:
In this article’s second and final part, I will discuss how to interpret some other important things shown in the R regression output. Here is the link to the second part:
Data Courtesy: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547–553, 2009.