1. Making scatter plot#
A scatter plot or scatter chart is a type of graph that displays data points based on two variables along the x and y axes. Each point represents a pair of values, providing insight into the relationship between those variables.
🚀 When to use them:
Scatter charts are particularly effective for visualizing patterns of correlation or distribution, such as whether an increase in one variable leads to a corresponding increase (positive correlation) or decrease (negative correlation) in the other. They are widely used in fields like statistics, data science, and economics to examine relationships between variables, identify outliers, or determine clustering patterns.
Scatter charts are most useful when you need to identify, analyze, and interpret the relationships or correlations between two continuous variables. For instance, they are commonly used in scientific research to investigate variables such as height vs. weight or sales revenue vs. advertising spend. Scatter plots make it easy to spot trends or anomalies in the data.
⚠️ Be aware:
Scatter charts may not be the best choice when dealing with large datasets that could result in overlapping points (also known as “overplotting”), making it difficult to distinguish individual data points. They also tend to be less effective for categorical or non-numeric data and might not clearly display relationships when the data has no significant correlation or is highly scattered without any discernible pattern.
Getting ready#
For this recipe we will load the Gapminder
data set and filter the data for the year 2007.
import plotly.express as px
df = px.data.gapminder()
df = df[df.year==2007]
df.head()
country | continent | year | lifeExp | pop | gdpPercap | iso_alpha | iso_num | |
---|---|---|---|---|---|---|---|---|
11 | Afghanistan | Asia | 2007 | 43.828 | 31889923 | 974.580338 | AFG | 4 |
23 | Albania | Europe | 2007 | 76.423 | 3600523 | 5937.029526 | ALB | 8 |
35 | Algeria | Africa | 2007 | 72.301 | 33333216 | 6223.367465 | DZA | 12 |
47 | Angola | Africa | 2007 | 42.731 | 12420476 | 4797.231267 | AGO | 24 |
59 | Argentina | Americas | 2007 | 75.320 | 40301927 | 12779.379640 | ARG | 32 |
How to do it#
Visualising two variables#
Make a simple scatter using
px.scatter
and passing the data frame as well as the names of the two columns that will be ploted asx
andy
respectively. Then, use the methodshow
to display it.
fig = px.scatter(df, x='gdpPercap', y ='lifeExp')
fig.show()
Add a title to your chart by passing a string as the input
title
fig = px.scatter(df, x='gdpPercap', y ='lifeExp',
title='Gap Minder Data: GDP per Capita vs Life Expectancy')
fig.show()
Customise the size of the figure by using the inputs
height
andwidth
. Both have to be integers and correspond to the size of the figure in pixels.
fig = px.scatter(df, x='gdpPercap', y ='lifeExp',
height=600, width=800,
title='Gap Minder Data: GDP per Capita vs Life Expectancy')
fig.show()
Introducing a third variable#
Use the input
color
to specify the color of the dots according to a third categorical variable.
In this case, we pass continent
which allows us to observe if the relationship between GDP per capita and life expectancy is different depending on the continent.
fig = px.scatter(df, x='gdpPercap', y ='lifeExp',
color='continent',
height=600, width=800,
title='Gap Minder Data: GDP per Capita vs Life Expectancy')
fig.show()
An alternative way to introduce a third variable is by using the input symbol
. This would make the marks different according to the specified variable. Let’s take a look at the result when passing continent
.
fig = px.scatter(df, x='gdpPercap', y ='lifeExp',
symbol='continent',
height=600, width=800,
title='Gap Minder Data: GDP per Capita vs Life Expectancy')
fig.show()
You can also use both methods together as follows
fig = px.scatter(df, x='gdpPercap', y ='lifeExp',
symbol='continent', color='continent',
height=600, width=800,
title='Gap Minder Data: GDP per Capita vs Life Expectancy')
fig.show()
There is More#
Further Customisation#
Customise the colors used in the scatter by using the input
color_discrete_sequence
fig = px.scatter(df, x='gdpPercap', y ='lifeExp',
color='continent',
height=600, width=800,
color_discrete_sequence=px.colors.qualitative.Bold,
title='Gap Minder Data: GDP per Capita vs Life Expectancy')
fig.show()
Use a pre-defined template. The Plotly library comes pre-loaded with several themes that you can get started using right away.
fig = px.scatter(df, x='gdpPercap', y ='lifeExp',
color='continent',
height=600, width=800,
color_discrete_sequence=px.colors.qualitative.Bold,
template="plotly_dark",
title='Gap Minder Data: GDP per Capita vs Life Expectancy')
fig.show()