# Making a scatter with a trend line

Combining a scatter plot with a trend line creates a powerful visual tool for analyzing relationships between variables while also illustrating overall trends in the data. 

In a scatter plot, individual data points are plotted based on their values for two variables, providing insight into the correlation or distribution of those variables. When a trend line is added, it helps to summarize the overall direction of the data, showing whether there is a positive, negative, or no correlation between the variables. Trend lines can also help **identify outliers or anomalies** in the data that might not be immediately noticeable from just the scatter plot.

üöÄ When to use them:

This combination is particularly useful when you want to see both the individual data points and the general pattern they form. For example, in regression analysis, a scatter plot with a trend line can visually depict how well a model fits the data, making it valuable in fields like economics, science, or marketing when examining relationships between variables (e.g., advertising spend vs. sales). 

‚ö†Ô∏è Be aware:

However, these charts are less useful when there is little to no relationship between variables, as the trend line may be misleading or not informative. Additionally, with a large amount of data or heavily clustered points, the scatter plot can become crowded, making it difficult to interpret the results or spot individual data points clearly.

## Getting ready


In addition to `plotly`, `numpy` and `pandas`, make sure the following Python libraries avaiable in your Python environment

-  `statsmodels` 
-  `scipy`

You can install it using the command:

```
pip install statsmodels, scipy 
```

For this recipe we will create two data sets

1. Import the Python modules `numpy`, `pandas`. Import the [`norm`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html) object from `scipy.stats`. This object will allow us to generate random samples from a normal distribution. This will help us to create data sets to be used in this recipe.

In [2]:
import numpy as np
import pandas as pd
from scipy.stats import norm

2. Create two data sets to be used in this recipe:

- `data1` : which contains two variables, `x` and `y`, with a linear relationship
- `data2` : which contains two variables, `x` and `y`, with a non-linear relationship

In [3]:
n = 200
x = np.linspace(0, 15, n)
epsilon = norm().rvs(n)
sigma = 2
y = 2*x + sigma*epsilon
data1 = pd.DataFrame({'x':x, 'y':y})

In [4]:
n = 200
x = np.linspace(0, 15, n)
epsilon = norm(loc=20, scale=100).rvs(n)
y = 0.5*x**3 + epsilon -10
data2 = pd.DataFrame({'x':x, 'y':y})

## How to do it

1. Import the `plotly.express` module as `px`

In [5]:
import plotly.express as px

2. Make a simple scatter plot to illustrate the points in the `data1` data set using the function `scatter`

In [7]:
df = data1
fig = px.scatter(df, x='x', y ='y', 
                 height=600, width=800,
                 title='Just a simple scatter')
fig.show()

We can observe that there is a linear relationship between the variables! 

### Linear Trend

3. Add a line that captures the linear relationship in the data. To do this,  simply add the argument `trendline` and pass the string `ols`.  This will draw the line determined by the Ordinary Least Squares regression (OLS) method.

In [8]:
fig = px.scatter(df, x='x', y ='y', 
                 trendline="ols",
                 height=600, width=800,
                 title='Scatter with OLS trend line')
fig.show()

4. Change the color of the trend line by using `trendline_color_overrride`

In [9]:
fig = px.scatter(df, x='x', y ='y', 
                 trendline_color_override="red",
                 trendline="ols", 
                 height=600, width=800,
                 title='Scatter with OLS trend line')
fig.show()

5. Retrieve the results of the OLS algorithm by using the `plotly` function `get_trendline_result` and passing your figure object.

In [10]:
results_table = px.get_trendline_results(fig)
results_table

Unnamed: 0,px_fit_results
0,<statsmodels.regression.linear_model.Regressio...


Let's check wha type of object this returns

In [11]:
type(results_table)

pandas.core.frame.DataFrame

It is a pandas `DataFrame`

6. Extract the object containing the results from the `DataFrame`. This is a `statsmodels.regression.linear_model.RegressionResultsWrapper` object

In [12]:
results = results_table['px_fit_results'][0]
results

<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x116d7ea10>

In [13]:
type(results)

statsmodels.regression.linear_model.RegressionResultsWrapper

7. Get the full details on the regression by using the method `summary` from the `results` object. This method returns a `DataFrame`

In [14]:
results.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.958
Model:,OLS,Adj. R-squared:,0.958
Method:,Least Squares,F-statistic:,4545.0
Date:,"Sat, 07 Sep 2024",Prob (F-statistic):,1.6e-138
Time:,22:21:35,Log-Likelihood:,-407.0
No. Observations:,200,AIC:,818.0
Df Residuals:,198,BIC:,824.6
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-0.4156,0.262,-1.585,0.115,-0.933,0.101
x1,2.0385,0.030,67.416,0.000,1.979,2.098

0,1,2,3
Omnibus:,0.358,Durbin-Watson:,2.195
Prob(Omnibus):,0.836,Jarque-Bera (JB):,0.334
Skew:,0.098,Prob(JB):,0.846
Kurtosis:,2.962,Cond. No.,17.4


Note that there is a similar method namee `summary2`. This also returns a `DataFrame` with a summary. However, this is a **experimental** version and as such it must be used with caution. 

### Non-Linear Trend

1. Make a scatter plot to illustrate the points in the `data2` data set. Include the OLS regression line to contrast it against the data. It is clear that the data does not show a linear relationship

In [17]:
df = data2
fig = px.scatter(df, x='x', y ='y', 
                 trendline="ols", 
                 trendline_color_override="red",
                 height=600, width=800,
                 title='Scatter with OLS trend line')
fig.show()

2. Import the `statsmodels.formula.api` as `smf`. This will help us to set a non-linear model based on the data in `data2`

In [18]:
import statsmodels.formula.api as smf

3. Fit a OLS non-linear model to the data by using the `smf.ols` and passing

- `formula` This is a sring which specifies the non-linear curve that we want to fit. In this case we are going to fit a cubic polynomial
- `data` The `DataFrame` with the data set to be fitted

In [19]:
model = smf.ols(formula='y ~ I(x**3)', data = df).fit()

In [25]:
predicted = model.predict(df.x)

4. Plot the scatter together with the curve given by the fitted polynomial evaluated in the `x` variable

In [26]:
fig = px.scatter(df, x='x', y ='y',
                 height=600, width=800,
                 title='Scatter + Fitted Polynomial')
fig.add_scatter(x=df.x, y =predicted, name="Fitted Polynomial")
fig.show()

5. Get the full details of the model by using the method `summary`

In [27]:
model.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.962
Model:,OLS,Adj. R-squared:,0.962
Method:,Least Squares,F-statistic:,4988.0
Date:,"Sat, 07 Sep 2024",Prob (F-statistic):,2.32e-142
Time:,22:26:21,Log-Likelihood:,-1193.8
No. Observations:,200,AIC:,2392.0
Df Residuals:,198,BIC:,2398.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,17.8759,8.959,1.995,0.047,0.209,35.543
I(x ** 3),0.4929,0.007,70.623,0.000,0.479,0.507

0,1,2,3
Omnibus:,0.679,Durbin-Watson:,2.032
Prob(Omnibus):,0.712,Jarque-Bera (JB):,0.808
Skew:,0.113,Prob(JB):,0.668
Kurtosis:,2.785,Cond. No.,1710.0
