Understanding Regression Output: A Lesson for Absolute Beginners (Part 2)

Vivekananda Das
3 min readDec 25, 2020

--

Photo by lilartsy on Unsplash

In this second part, we are going to discuss how to interpret the following three values shown at the bottom of the regression output in R:

(1) Multiple R-squared,

(2) Adjusted R-squared, and

(3) F-statistic

If you missed the first part of this article (which discusses how to interpret the coefficients of predictors and their statistical significance), here is the link:

Here is a quick recap! We are trying to predict the quality of red wine depending on the alcohol level, pH level, and density. We estimate the following model:

quality = b0 + b1*alcohol + b2*pH + b3*density + u

The Output

Multiple R-squared

The multiple R-squared value of 0.2528 means: our model with three predictors (alcohol, pH, and density) explains 25.28% of the total variation in the wine quality.

The R-squared value ranges from 0 to 1, where 0 indicates a model that explains nothing (worst predictive model) of the total variation in the outcome, and 1 indicates a model that explains everything (best predictive model) of the total variation in the outcome.

The key issue with the R-squared value is that if we keep adding more and more predictors, the R-squared value always goes up, even if some of these predictors do not increase the predictive ability of the model.

This results in an overfitted model (which refers to a model that is more accurate than it should be given the small sample data; such a model may appear to predict the outcome quite accurately when applied to the sample data but will perform poorly when applied to out-of-sample data).

Adjusted R-squared

The adjusted R-squared value increases with the addition of new predictors only if they improve the predictive ability of the model beyond the improvement expected by chance. By accounting for the inclusion of new predictors, the adjusted R-squared value tries to fix the key issue in the R-squared value.

The interpretation of the adjusted R-squared value is similar to the R-squared value. In our example, the adjusted R-squared value of 0.2513 means: our model with three predictors (alcohol, pH, and density) explains 25.13% of the total variation in the wine quality.

Notice, in our example, the adjusted R-squared value is slightly smaller than the R-squared value. Generally, the adjusted R-squared value tends to be smaller than the R-squared value.

F-statistic

The F-statistic, right at the bottom of the R output, refers to a hypothesis test called the “Overall F-test.” In this hypothesis test, we are testing a null hypothesis that says: the coefficients of all the predictors are equal to zero.

In the context of our example, the null hypothesis would be: neither alcohol level nor pH level nor density have any association with the quality of wine.

If the p-value < 0.05, we reject the null hypothesis

If the p-value >= 0.05, we fail to reject the null hypothesis

We reject the null hypothesis because we see a p-value of 2.2e^-16 (obviously smaller than 0.05).

By rejecting the null hypothesis, we can all claim that at least one of the three predictors has statistically significant association with the outcome. But the result of this test does not indicate anything regarding which of these predictors have significant association with the outcome.

If we are interested in identifying the statistical significance of each of the predictors, we can look at the coefficients part of the R output and take a p-value approach as described in the part 1 of this article.

Data Courtesy: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547–553, 2009.

--

--

Vivekananda Das

Sharing synthesized ideas on Data Analysis in R, Data Literacy, Causal Inference, and Well-being | Assistant Prof @ UUtah | More: https://vivekanandadas.com