10. Visualizing Linear Regression#
Visualizing linear regression and its assumptions is a crucial step in understanding the model’s performance and ensuring that the assumptions behind linear regression are met. One of the most common visualizations is the scatter plot of the data along with the regression line. This helps to observe the relationship between the independent variable (X) and the dependent variable (Y). The regression line represents the predicted values of Y for given values of X, while the scatter plot shows the actual data points. Ideally, the points should cluster around the line, indicating a linear relationship between the variables.
To check the assumptions of linear regression, various diagnostic plots are used:
Linearity: This assumption requires that the relationship between the independent and dependent variables is linear. You can verify it using a residuals vs. fitted values plot. If the residuals (differences between observed and predicted values) are randomly scattered around zero, this suggests that the linearity assumption holds.
Homoscedasticity: This refers to the constant variance of residuals across all levels of the independent variable. Again, a residuals vs. fitted values plot helps—residuals should appear randomly distributed without forming any discernible pattern, such as a funnel shape.
Independence: Residuals should be independent of each other, meaning that the error terms are not correlated. A plot of residuals against time (if the data has a time structure) can help check this assumption.
Normality of Residuals: This assumption implies that the residuals follow a normal distribution. A normal Q-Q plot of the residuals is typically used, where the points should roughly follow a straight diagonal line if the residuals are normally distributed.
These visual checks are essential in diagnosing any violations of the linear regression assumptions, which, if unaddressed, could lead to inaccurate predictions or misleading interpretations of the model.
In addition to the basic diagnostic plots, Scale-Location and Residuals vs. Leverage plots offer deeper insights into potential issues with linear regression models.
The Scale-Location plot helps to check the assumption of homoscedasticity (constant variance of residuals). This plot shows the square root of the absolute residuals (on the y-axis) versus the fitted values (on the x-axis). The goal is to check if the residuals are spread equally across all levels of the fitted values. Ideally, the points should be randomly scattered around a horizontal line without any clear pattern. If you see a funnel shape (i.e., residuals become more spread out as the fitted values increase), it indicates heteroscedasticity, meaning that the variance of the residuals changes with the fitted values, violating the homoscedasticity assumption. Corrective measures, such as transforming the dependent variable (e.g., log or square-root transformations), may be necessary to fix this.
The Residuals vs. Leverage plot is used to identify influential data points that can disproportionately affect the regression model. Leverage measures how much influence an observation has based on its position in the predictor space (x-values). In this plot, the residuals (y-axis) are plotted against the leverage values (x-axis). A good model should not have high residuals for points with high leverage, as these points could unduly influence the regression model. Points that lie far from the bulk of the data in this plot, especially those with high leverage and large residuals, are called influential points. They are often highlighted using Cook’s distance contours, which quantify the influence of these points. If any points fall beyond these contours, you should investigate them closely, as they may be outliers or points that have a large influence on the model’s coefficients.
Both of these plots offer critical information about model diagnostics and help to identify where assumptions of linear regression may be violated, allowing for potential corrective measures.
Getting ready#
In addition to plotly
, numpy
and pandas
, make sure the scipy
Python library avaiable in your Python environment
You can install it using the command:
pip install scipy
For this recipe we will create a data set
Import the Python modules
numpy
,pandas
. Import thenorm
object fromscipy.stats
. This object will allow us to generate random samples from a normal distribution. This will help us to create data sets to be used in this recipe.
import numpy as np
import pandas as pd
from scipy.stats import norm
Create the data set to be used in this recipe
n = 400
x = np.linspace(0, 15, n)
epsilon = norm().rvs(n)
sigma = 2
y = 2*x + sigma*epsilon
data = pd.DataFrame({'x':x, 'y':y})
# n = 200
# x = np.linspace(0, 15, n)
# epsilon = norm(loc=20, scale=100).rvs(n)
# y = 0.5*x**3 + epsilon -10
# data2 = pd.DataFrame({'x':x, 'y':y})
How to do it#
Import the
plotly.express
module aspx
import plotly.express as px
df = data
Diagnose the linearity assumption with a scatter plot between the two variables
fig = px.scatter(df, x='x', y ='y',
trendline_color_override="red",
trendline="ols",
height=600, width=800,
title='Scatter with OLS trend line')
fig.show()
Retrieve the linear regression results from the
Figure
object
results_table = px.get_trendline_results(fig)
results = results_table['px_fit_results'][0]
results.summary()
Dep. Variable: | y | R-squared: | 0.953 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.953 |
Method: | Least Squares | F-statistic: | 8059. |
Date: | Sun, 16 Feb 2025 | Prob (F-statistic): | 2.95e-266 |
Time: | 21:32:30 | Log-Likelihood: | -831.67 |
No. Observations: | 400 | AIC: | 1667. |
Df Residuals: | 398 | BIC: | 1675. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 0.0079 | 0.194 | 0.041 | 0.967 | -0.373 | 0.389 |
x1 | 2.0061 | 0.022 | 89.773 | 0.000 | 1.962 | 2.050 |
Omnibus: | 1.008 | Durbin-Watson: | 1.918 |
---|---|---|---|
Prob(Omnibus): | 0.604 | Jarque-Bera (JB): | 1.112 |
Skew: | 0.103 | Prob(JB): | 0.574 |
Kurtosis: | 2.844 | Cond. No. | 17.5 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Obtain the fitted, and residual values from the results
fitted = results.fittedvalues
residuals = results.resid
To diagnose the Homoscedasticity assumption, plot a scatter between the fitted and the residuals
fig = px.scatter(x =fitted, y=residuals,
trendline_color_override="red",
trendline="ols",
height=600, width=800,
title='Residuals vs Fitted Plot')
fig.show()
Calculate the square root of the absolute residuals
influence = results.get_influence()
residual_norm = influence.resid_studentized_internal
residual_norm_abs_sqrt = np.sqrt(np.abs(residual_norm))
To diagnose the homoscedasticity make a Scale-Location plot using a
scatter
between the fitted values and the square root of the absolute residuals
fig = px.scatter(x =fitted, y=residual_norm_abs_sqrt,
trendline_color_override="red",
trendline="ols",
height=600, width=800,
title='Scale-Location Plot')
fig.update_layout(xaxis_title="Fitted values", yaxis_title=r'\sqrt{|Standardized Residuals|}')
fig.show()
Obtain the Leverage from the
influence
object by usinghat_matrix_diag
leverage = influence.hat_matrix_diag
# cooks_distance = influence.cooks_distance[0]
# nparams = len(results.params)
# nresids = len(residual_norm)
Make a Residual vs Leverage plot by making a
scatter
between theleverage
and the `square root of the absolute residuals
fig = px.scatter(x =leverage, y=residual_norm,
height=600, width=800,
trendline_color_override="red",
trendline="ols",
title='Residual vs Leverage Plot')
fig.update_layout(xaxis_title="Leverage", yaxis_title="Standardized Residuals")
fig.show()
To diagnose the Normality assumption create a QQ-plot comparing the theoretical against the sample quantiles
from statsmodels.graphics.gofplots import ProbPlot
QQ = ProbPlot(residual_norm)
theoretical_quantiles = QQ.theoretical_quantiles
sample_quantiles = QQ.sample_quantiles
fig = px.scatter(x =theoretical_quantiles, y=sample_quantiles,
height=600, width=800,
title='Normal QQ Plot')
fig.add_traces(px.line(x=theoretical_quantiles, y=theoretical_quantiles, color_discrete_sequence=["red"]).data, )
fig.update_layout(xaxis_title="Theoretical Quantiles", yaxis_title="Standardized Residuals")
fig.show()