9. Understanding ECDF charts#
Empirical Cumulative Distribution Function (ECDF) charts, are a powerful visualization tool used to represent the cumulative distribution of a dataset. An ECDF chart displays the proportion of data points that fall below a particular value, allowing viewers to see how data accumulates across the range of values in the dataset. The x-axis represents the data values, while the y-axis indicates the cumulative proportion of observations. As you move from left to right along the x-axis, the ECDF increases, reflecting the cumulative count of observations. This visualization effectively communicates both the distribution shape and the probability of observing a value below a certain threshold.
ECDF charts are particularly useful in several scenarios. They provide a clear understanding of the distribution of continuous data, making it easier to identify features such as skewness, multimodality, and the presence of outliers. Unlike histograms, which can be sensitive to bin width and may obscure finer details of the data distribution, ECDFs present a smooth and accurate depiction of the underlying distribution without losing information about individual data points. Additionally, ECDFs are beneficial for comparing multiple groups simultaneously, as multiple ECDF curves can be plotted on the same chart, allowing for quick visual comparisons of their distributions. This makes them invaluable in fields such as statistics, finance, and environmental science, where understanding the distribution of data is crucial for analysis and decision-making.
However, ECDFs may not be appropriate in all situations. For smaller datasets, the cumulative nature of the ECDF might make it less informative than other visualizations, such as histograms or box plots, which can show more detail about the data distribution. Moreover, ECDFs can become cluttered and harder to interpret when dealing with a large number of groups or categories, particularly if the curves overlap significantly. In these cases, it may be more effective to focus on a smaller number of groups or use alternative visualization techniques that convey the desired information more clearly. Additionally, for categorical data, ECDFs are not suitable, as they are designed for continuous variables and rely on a meaningful ordering of values.
Getting ready#
In addition to plotly
, numpy
and pandas
, make sure the scipy
Python library avaiable in your Python environment
You can install it using the command:
pip install scipy
For this recipe we will create two data sets
Import the Python modules
numpy
,pandas
. Import thenorm
object fromscipy.stats
. This object will allow us to generate random samples from a normal distribution. This will help us to create data sets to be used in this recipe.
import numpy as np
import pandas as pd
from scipy.stats import norm
Create two data sets to be used in this recipe
n = 400
sample1 = norm().rvs(n)
sample2 = norm(loc=3, scale=0.5).rvs(n)
data1 = pd.DataFrame({'Normal': sample1})
samples = np.concatenate( (sample1, sample2))
labels = ['Sample 1']*n + ['Sample 2']*n
data2 = pd.DataFrame({'Data': samples, 'Label':labels})
How to do it#
Import the
plotly.express
module aspx
import plotly.express as px
df = data1
Create a simple
ECDF
chart by using the functionecdf
fig = px.ecdf(df, x ="Normal")
fig.show()
Add a title to your chart by passing a string as the input
title
into the functionecdf
; and customise the size of the figure by using the inputsheight
andwidth
. Both have to be integers and correspond to the size of the figure in pixels.
fig = px.ecdf(df, x ="Normal",
height = 600, width = 800,
title="Empirical Cumulative Distribution Function")
fig.show()
Add markers to the ECDF by setting the input
markers
toTrue
. This is particularly important in the case of discrete distributions
fig = px.ecdf(df, x ="Normal",
markers=True,
height = 600, width = 800,
title="Empirical Cumulative Distribution Function")
fig.show()
Customise the color of the trace by using the input
color_discrete_sequence
fig = px.ecdf(df, x ="Normal",
color_discrete_sequence=['teal'],
height = 600, width = 800,
title="Empirical Cumulative Distribution Function")
fig.show()
Choose the norm of the chart by setting the argument
ecdfnorm
If
'probability'
, values will be probabilities normalized from 0 to 1If
'percent'
, values will be percentages normalized from 0 to 100If
None
values will be raw counts or sums
fig = px.ecdf(df, x ="Normal",
ecdfnorm="percent",
color_discrete_sequence=['orange'],
height = 600, width = 800,
title="Empirical Cumulative Distribution Function")
fig.show()
Customise the plot further by setting the argument
ecdfmode
to one of ‘standard’, ‘complementary’ or ‘reversed’
If
'standard'
, the ECDF is plotted such that values represent data at or below the pointIf
'complementary'
, the CCDF is plotted such that values represent data above the pointIf
'reversed'
, a variant of the CCDF is plotted such that values represent data at or above the point.
fig = px.ecdf(df, x ="Normal",
ecdfmode="complementary",
color_discrete_sequence=['purple'],
height = 600, width = 800,
title="Empirical Cumulative Distribution Function")
fig.show()
Add marginal charts via the argument
marginal
to enrich the ECDF chart
fig = px.ecdf(df, x ="Normal",
marginal="histogram",
color_discrete_sequence=['teal'],
height = 600, width = 800,
title="Empirical Cumulative Distribution Function")
fig.show()
fig = px.ecdf(df, x ="Normal",
marginal="rug",
color_discrete_sequence=['orchid'],
height = 600, width = 800,
title="Empirical Cumulative Distribution Function")
fig.show()