Visualizing Linear Regression

10. Visualizing Linear Regression#

Visualizing linear regression and its assumptions is a crucial step in understanding the model’s performance and ensuring that the assumptions behind linear regression are met. One of the most common visualizations is the scatter plot of the data along with the regression line. This helps to observe the relationship between the independent variable (X) and the dependent variable (Y). The regression line represents the predicted values of Y for given values of X, while the scatter plot shows the actual data points. Ideally, the points should cluster around the line, indicating a linear relationship between the variables.

To check the assumptions of linear regression, various diagnostic plots are used:

Linearity: This assumption requires that the relationship between the independent and dependent variables is linear. You can verify it using a residuals vs. fitted values plot. If the residuals (differences between observed and predicted values) are randomly scattered around zero, this suggests that the linearity assumption holds.
Homoscedasticity: This refers to the constant variance of residuals across all levels of the independent variable. Again, a residuals vs. fitted values plot helps—residuals should appear randomly distributed without forming any discernible pattern, such as a funnel shape.
Independence: Residuals should be independent of each other, meaning that the error terms are not correlated. A plot of residuals against time (if the data has a time structure) can help check this assumption.
Normality of Residuals: This assumption implies that the residuals follow a normal distribution. A normal Q-Q plot of the residuals is typically used, where the points should roughly follow a straight diagonal line if the residuals are normally distributed.

These visual checks are essential in diagnosing any violations of the linear regression assumptions, which, if unaddressed, could lead to inaccurate predictions or misleading interpretations of the model.

In addition to the basic diagnostic plots, Scale-Location and Residuals vs. Leverage plots offer deeper insights into potential issues with linear regression models.

The Scale-Location plot helps to check the assumption of homoscedasticity (constant variance of residuals). This plot shows the square root of the absolute residuals (on the y-axis) versus the fitted values (on the x-axis). The goal is to check if the residuals are spread equally across all levels of the fitted values. Ideally, the points should be randomly scattered around a horizontal line without any clear pattern. If you see a funnel shape (i.e., residuals become more spread out as the fitted values increase), it indicates heteroscedasticity, meaning that the variance of the residuals changes with the fitted values, violating the homoscedasticity assumption. Corrective measures, such as transforming the dependent variable (e.g., log or square-root transformations), may be necessary to fix this.

The Residuals vs. Leverage plot is used to identify influential data points that can disproportionately affect the regression model. Leverage measures how much influence an observation has based on its position in the predictor space (x-values). In this plot, the residuals (y-axis) are plotted against the leverage values (x-axis). A good model should not have high residuals for points with high leverage, as these points could unduly influence the regression model. Points that lie far from the bulk of the data in this plot, especially those with high leverage and large residuals, are called influential points. They are often highlighted using Cook’s distance contours, which quantify the influence of these points. If any points fall beyond these contours, you should investigate them closely, as they may be outliers or points that have a large influence on the model’s coefficients.

Both of these plots offer critical information about model diagnostics and help to identify where assumptions of linear regression may be violated, allowing for potential corrective measures.

Getting ready#

In addition to plotly, numpy and pandas, make sure the scipy Python library avaiable in your Python environment You can install it using the command:

pip install scipy 

For this recipe we will create a data set

Import the Python modules numpy, pandas. Import the norm object from scipy.stats. This object will allow us to generate random samples from a normal distribution. This will help us to create data sets to be used in this recipe.

import numpy as np
import pandas as pd
from scipy.stats import norm

Create the data set to be used in this recipe

n = 400
x = np.linspace(0, 15, n)
epsilon = norm().rvs(n)
sigma = 2
y = 2*x + sigma*epsilon
data = pd.DataFrame({'x':x, 'y':y})

# n = 200
# x = np.linspace(0, 15, n)
# epsilon = norm(loc=20, scale=100).rvs(n)
# y = 0.5*x**3 + epsilon -10
# data2 = pd.DataFrame({'x':x, 'y':y})

How to do it#

Import the plotly.express module as px

import plotly.express as px

df = data

Diagnose the linearity assumption with a scatter plot between the two variables

fig = px.scatter(df, x='x', y ='y', 
                 trendline_color_override="red",
                 trendline="ols", 
                 height=600, width=800,
                 title='Scatter with OLS trend line')
fig.show()

Retrieve the linear regression results from the Figure object

results_table = px.get_trendline_results(fig)
results = results_table['px_fit_results'][0]
results.summary()

OLS Regression Results
Dep. Variable:	y	R-squared:	0.953
Model:	OLS	Adj. R-squared:	0.953
Method:	Least Squares	F-statistic:	8059.
Date:	Sun, 16 Feb 2025	Prob (F-statistic):	2.95e-266
Time:	21:32:30	Log-Likelihood:	-831.67
No. Observations:	400	AIC:	1667.
Df Residuals:	398	BIC:	1675.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	0.0079	0.194	0.041	0.967	-0.373	0.389
x1	2.0061	0.022	89.773	0.000	1.962	2.050

Omnibus:	1.008	Durbin-Watson:	1.918
Prob(Omnibus):	0.604	Jarque-Bera (JB):	1.112
Skew:	0.103	Prob(JB):	0.574
Kurtosis:	2.844	Cond. No.	17.5

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Obtain the fitted, and residual values from the results

fitted = results.fittedvalues
residuals = results.resid

To diagnose the Homoscedasticity assumption, plot a scatter between the fitted and the residuals

fig = px.scatter(x =fitted, y=residuals, 
                 trendline_color_override="red",
                 trendline="ols", 
                 height=600, width=800,
                 title='Residuals vs Fitted Plot')
fig.show()

Calculate the square root of the absolute residuals

influence = results.get_influence()
residual_norm = influence.resid_studentized_internal
residual_norm_abs_sqrt = np.sqrt(np.abs(residual_norm))

To diagnose the homoscedasticity make a Scale-Location plot using a scatter between the fitted values and the square root of the absolute residuals

fig = px.scatter(x =fitted, y=residual_norm_abs_sqrt, 
                 trendline_color_override="red",
                 trendline="ols", 
                 height=600, width=800,
                 title='Scale-Location Plot')
fig.update_layout(xaxis_title="Fitted values", yaxis_title=r'\sqrt{|Standardized Residuals|}')
fig.show()

Obtain the Leverage from the influence object by using hat_matrix_diag

leverage = influence.hat_matrix_diag
# cooks_distance = influence.cooks_distance[0]
# nparams = len(results.params)
# nresids = len(residual_norm)

Make a Residual vs Leverage plot by making a scatter between the leverage and the `square root of the absolute residuals

fig = px.scatter(x =leverage, y=residual_norm, 
                 height=600, width=800,
                 trendline_color_override="red",
                 trendline="ols", 
                 title='Residual vs Leverage Plot')

fig.update_layout(xaxis_title="Leverage", yaxis_title="Standardized Residuals")
fig.show()

To diagnose the Normality assumption create a QQ-plot comparing the theoretical against the sample quantiles

from statsmodels.graphics.gofplots import ProbPlot
QQ = ProbPlot(residual_norm)
theoretical_quantiles = QQ.theoretical_quantiles
sample_quantiles = QQ.sample_quantiles

fig = px.scatter(x =theoretical_quantiles, y=sample_quantiles, 
                 height=600, width=800,
                 title='Normal QQ Plot')
fig.add_traces(px.line(x=theoretical_quantiles, y=theoretical_quantiles, color_discrete_sequence=["red"]).data, )
fig.update_layout(xaxis_title="Theoretical Quantiles", yaxis_title="Standardized Residuals")
fig.show()

Visualizing Linear Regression

Contents

10. Visualizing Linear Regression#

Getting ready#

How to do it#