5. Using Violin Plots as an alternative to Box Plots#

A violin plot is a data visualization tool that combines the features of a box plot with a kernel density plot to provide a richer representation of the distribution of a dataset. The central part of a violin plot contains a box plot, showing the median, interquartile range (IQR), and potential outliers, while the surrounding shape mirrors the density of the data, displaying how frequently data points occur at different values. This dual representation allows viewers to observe both the summary statistics (like in a box plot) and the full distribution shape, such as multimodality or skewness, which can be missed in simpler plots. The width of the violin at any given point reflects the density of the data, with wider sections indicating a higher concentration of data points.

Violin plots are particularly useful when you’re comparing the distributions of multiple groups, as they offer a more detailed view of the data’s spread and shape than a box plot alone. They are commonly used in exploratory data analysis, especially when it is important to understand the distribution shape, not just the summary statistics. For example, in a dataset with multiple peaks (bimodal or multimodal data), a violin plot would clearly show these multiple modes, while a box plot would not. This makes them valuable in fields like biology, finance, or any area involving complex datasets where understanding subtle variations in distribution is critical.

However, violin plots may not be appropriate in every situation. One potential drawback is that the inclusion of density estimates can make the plot harder to interpret, especially for non-expert audiences unfamiliar with the concept of kernel density estimation. In such cases, the visual complexity of a violin plot might obscure rather than clarify the data. Additionally, if the dataset is small, the density estimation may not be accurate, potentially misleading viewers about the true distribution of the data. Violin plots also don’t always provide a clear sense of individual data points, which may be critical in smaller samples. For small datasets or when you need to emphasize individual data values and outliers, a traditional box plot or a box plot combined with jitter might be more appropriate choices.

Getting ready#

In addition to plotly, numpy and pandas, make sure the scipy Python library avaiable in your Python environment You can install it using the command:

pip install scipy 
  1. Import the Python modules numpy, pandas. Import the norm object from scipy.stats. This object will allow us to generate random samples from a normal distribution. This will help us to create data sets to be used in this recipe.

import numpy as np
import pandas as pd
from scipy.stats import norm, t
  1. Create a data set to be used in this recipe

n = 200
sample1 = norm(loc=2).rvs(n)
sample2 = t(df=3).rvs(n)
samples =  np.concatenate( (sample1, sample2))
labels = ['Normal']*n + ['t-Student']*n 
data = pd.DataFrame({'Data': samples, 'Label':labels})

How to do it#

  1. Import the plotly.express module as px

import plotly.express as px
df = data
  1. Create a simple violin plot comparing the two samples by using the function violin. Make sure you pass the argument color to differentiate the samples

fig = px.violin(df, y="Data", 
                color="Label",
                height = 500, width = 800,
                title='Box Plots')
fig.show()
  1. Customize the colors by using color_discrete_sequence. You can pass a specific list of colors or one of the palettes available in plotly express

fig = px.violin(df, y="Data", 
                color="Label",
                color_discrete_sequence=['teal', 'purple'],
                height = 500, width = 800,
                title='Box Plots')
fig.show()
fig = px.violin(df, y="Data", 
                color="Label",
                color_discrete_sequence=px.colors.qualitative.Prism,
                height = 500, width = 800,
                title='Box Plots')
fig.show()
  1. Add the box-plot to your violin traces by setting the input box as True

fig = px.violin(df, y="Data", 
                color="Label",
                box= True,
                color_discrete_sequence=['teal', 'purple'],
                height = 500, width = 800,
                title='Box Plots')
fig.show()
  1. Show the original data points by setting the input points as 'all'

fig = px.violin(df, y="Data", 
                color="Label",
                points="all",
                box=True,
                color_discrete_sequence=['teal', 'purple'],
                height = 500, width = 800,
                title='Box Plots')
fig.show()
  1. Overlay the two samples by setting the argument violinmode as 'overlay'

fig = px.violin(df, y="Data", 
                color="Label",
                points="all",
                violinmode="overlay",
                box=True,
                color_discrete_sequence=['teal', 'purple'],
                height = 500, width = 800,
                title='Box Plots')
fig.show()