9. Understanding ECDF charts#

Empirical Cumulative Distribution Function (ECDF) charts, are a powerful visualization tool used to represent the cumulative distribution of a dataset. An ECDF chart displays the proportion of data points that fall below a particular value, allowing viewers to see how data accumulates across the range of values in the dataset. The x-axis represents the data values, while the y-axis indicates the cumulative proportion of observations. As you move from left to right along the x-axis, the ECDF increases, reflecting the cumulative count of observations. This visualization effectively communicates both the distribution shape and the probability of observing a value below a certain threshold.

ECDF charts are particularly useful in several scenarios. They provide a clear understanding of the distribution of continuous data, making it easier to identify features such as skewness, multimodality, and the presence of outliers. Unlike histograms, which can be sensitive to bin width and may obscure finer details of the data distribution, ECDFs present a smooth and accurate depiction of the underlying distribution without losing information about individual data points. Additionally, ECDFs are beneficial for comparing multiple groups simultaneously, as multiple ECDF curves can be plotted on the same chart, allowing for quick visual comparisons of their distributions. This makes them invaluable in fields such as statistics, finance, and environmental science, where understanding the distribution of data is crucial for analysis and decision-making.

However, ECDFs may not be appropriate in all situations. For smaller datasets, the cumulative nature of the ECDF might make it less informative than other visualizations, such as histograms or box plots, which can show more detail about the data distribution. Moreover, ECDFs can become cluttered and harder to interpret when dealing with a large number of groups or categories, particularly if the curves overlap significantly. In these cases, it may be more effective to focus on a smaller number of groups or use alternative visualization techniques that convey the desired information more clearly. Additionally, for categorical data, ECDFs are not suitable, as they are designed for continuous variables and rely on a meaningful ordering of values.

Getting ready#

In addition to plotly, numpy and pandas, make sure the scipy Python library avaiable in your Python environment You can install it using the command:

pip install scipy 

For this recipe we will create two data sets

  1. Import the Python modules numpy, pandas. Import the norm object from scipy.stats. This object will allow us to generate random samples from a normal distribution. This will help us to create data sets to be used in this recipe.

import numpy as np
import pandas as pd
from scipy.stats import norm
  1. Create two data sets to be used in this recipe

n = 400
sample1 = norm().rvs(n)
sample2 = norm(loc=3, scale=0.5).rvs(n)
data1 = pd.DataFrame({'Normal': sample1})
samples =  np.concatenate( (sample1, sample2))
labels = ['Sample 1']*n + ['Sample 2']*n 
data2 = pd.DataFrame({'Data': samples, 'Label':labels})

How to do it#

  1. Import the plotly.express module as px

import plotly.express as px
df = data1
  1. Create a simple ECDF chart by using the function ecdf

fig = px.ecdf(df, x ="Normal")
fig.show()
  1. Add a title to your chart by passing a string as the input title into the function ecdf; and customise the size of the figure by using the inputs height and width. Both have to be integers and correspond to the size of the figure in pixels.

fig = px.ecdf(df, x ="Normal",
              height = 600, width = 800,
              title="Empirical Cumulative Distribution Function")
fig.show()
  1. Add markers to the ECDF by setting the input markers to True. This is particularly important in the case of discrete distributions

fig = px.ecdf(df, x ="Normal",
              markers=True,
              height = 600, width = 800,
              title="Empirical Cumulative Distribution Function")
fig.show()
  1. Customise the color of the trace by using the input color_discrete_sequence

fig = px.ecdf(df, x ="Normal",
              color_discrete_sequence=['teal'],
              height = 600, width = 800,
              title="Empirical Cumulative Distribution Function")
fig.show()
  1. Choose the norm of the chart by setting the argument ecdfnorm

  • If 'probability', values will be probabilities normalized from 0 to 1

  • If 'percent', values will be percentages normalized from 0 to 100

  • If None values will be raw counts or sums

fig = px.ecdf(df, x ="Normal",
              ecdfnorm="percent",
              color_discrete_sequence=['orange'],
              height = 600, width = 800,
              title="Empirical Cumulative Distribution Function")
fig.show()
  1. Customise the plot further by setting the argument ecdfmode to one of ‘standard’, ‘complementary’ or ‘reversed’

  • If 'standard', the ECDF is plotted such that values represent data at or below the point

  • If 'complementary', the CCDF is plotted such that values represent data above the point

  • If 'reversed', a variant of the CCDF is plotted such that values represent data at or above the point.

fig = px.ecdf(df, x ="Normal",
              ecdfmode="complementary",
              color_discrete_sequence=['purple'],
              height = 600, width = 800,
              title="Empirical Cumulative Distribution Function")
fig.show()
  1. Add marginal charts via the argument marginal to enrich the ECDF chart

fig = px.ecdf(df, x ="Normal",
              marginal="histogram",
              color_discrete_sequence=['teal'],
              height = 600, width = 800,
              title="Empirical Cumulative Distribution Function")
fig.show()
fig = px.ecdf(df, x ="Normal",
              marginal="rug",
              color_discrete_sequence=['orchid'],
              height = 600, width = 800,
              title="Empirical Cumulative Distribution Function")
fig.show()