# Making Histograms

A **histogram** is a type of bar chart used to represent the distribution of a data set. It displays data by grouping numerical values into bins or intervals along the x-axis and representing the frequency or count of data points within each bin on the y-axis. This graphical tool helps in visualizing the shape of the data distribution, identifying patterns like skewness, and recognizing whether the data follows a particular distribution (e.g., normal distribution). Histograms are particularly useful when you want to understand the underlying frequency distribution of continuous variables, such as age, income, or temperature.

However, histograms are not always the best tool for every type of data. For instance, they are inappropriate for categorical or discrete data, as those types of data don't have a continuous range to be grouped into intervals. For categorical data, bar charts or pie charts are better suited because they show individual categories rather than aggregated ranges. 

Additionally, histograms can sometimes be misleading if the bin size is not chosen carefully. Too few bins may oversimplify the data, while too many bins can make the data appear overly complex, obscuring the underlying pattern. Thus, histograms should be used when dealing with continuous data and when there is an interest in understanding the overall shape of the data distribution.


**Density Scale**

In addition to representing data with frequencies, histograms can also be displayed on a **density scale**. In a frequency histogram, the height of each bar corresponds to the number of data points that fall within each bin. However, in a density histogram, the y-axis represents the **density**â€”the relative frequency of data points per unit of the variable on the x-axis. To construct a density histogram, the frequency for each bin is divided by the total number of data points and by the bin width. This scaling ensures that the area under the histogram adds up to 1, which allows it to approximate a probability distribution, making it easier to compare data sets with different sample sizes or different bin widths.

Density histograms are particularly useful when you want to focus on the **proportion** of data in each bin rather than the absolute count. They are especially relevant in statistical contexts where data is being compared to theoretical probability distributions, such as the normal distribution. However, like frequency histograms, they are not appropriate for categorical data, and they may also become less interpretable if the bin widths are poorly chosen. When deciding between a frequency or density scale, consider whether you are interested in the raw counts (frequency) or in understanding the shape of the data distribution in relation to probability (density).


**Cummulative Form**

Histograms can also be displayed in a **cumulative form**, which shows the cumulative frequency or cumulative density of data points up to each bin. Instead of displaying the number (or density) of data points that fall within each individual bin, a cumulative histogram adds up the frequencies or densities from all previous bins, so each bar represents the total count or proportion of data points up to that bin. This provides a visual representation of the cumulative distribution of the data, allowing you to see how data accumulates over its range.

Cumulative histograms are particularly useful when you're interested in understanding the proportion of data points below a certain threshold, such as determining what percentage of data falls below a given value. They can also help visualize patterns in how data accumulates, such as identifying points where the rate of accumulation changes, which might indicate outliers, thresholds, or distribution shifts. 

When using cumulative histograms, it's important to decide whether you want the cumulative version of a **frequency histogram** (where you display cumulative counts) or a **density histogram** (where you display cumulative proportions or probabilities). In both cases, the cumulative plot will always be non-decreasing, as each bin adds the current frequency or density to the total from previous bins.

Cumulative histograms are less effective for understanding the distribution shape at a glance compared to regular histograms, as they obscure the exact frequencies within each bin. They are also not suitable for categorical data but work well for continuous data when cumulative trends or thresholds are of interest.

## Getting ready


In addition to `plotly`, `numpy` and `pandas`, make sure the [`scipy`](https://scipy.org) Python library avaiable in your Python environment
You can install it using the command:

```
pip install scipy 
```

For this recipe we will create two data sets

1. Import the Python modules `numpy`, `pandas`. Import the [`norm`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html) object from `scipy.stats`. This object will allow us to generate random samples from a normal distribution. This will help us to create data sets to be used in this recipe.

In [2]:
import numpy as np
import pandas as pd
from scipy.stats import norm

2. Create two data sets to be used in this recipe

In [3]:
n = 400
sample1 = norm().rvs(n)
sample2 = norm(loc=3, scale=0.5).rvs(n)

In [4]:
data1 = pd.DataFrame({'Normal': sample1})

In [5]:
samples =  np.concatenate( (sample1, sample2))
labels = ['Sample 1']*n + ['Sample 2']*n 
data2 = pd.DataFrame({'Data': samples, 'Label':labels})

## How to do it

1. Import the `plotly.express` module as `px`

In [6]:
import plotly.express as px

### One data set

1. Make a simple histogram to illustrate the distribution of the data in `data1` using the function `histogram`

In [7]:
df = data1
fig = px.histogram(df, x="Normal")
fig.show()

2. Add a title to your chart by passing a string as the input `title` into the function `histogram`
3. And customise the size of the figure by using the inputs `height` and `width`. Both have to be integers and correspond to the size of the figure in pixels.

In [8]:
fig = px.histogram(df, x="Normal", 
                   height = 500, width = 800,
                   title='Sample from a Normal Distribution')
fig.show()

4. Change the histogram to **density scale** by setting the input `histnorm` as `'probability density'`

In [9]:
fig = px.histogram(df, x="Normal", 
                   histnorm='probability density',
                   height = 500, width = 800,
                   title='Sample from a Normal Distribution')
fig.show()

5. Customize the number of bins using the input `nbins`. 
Choosing the number of bins in a histogram is crucial because it directly affects how the data distribution is visualized. If the number of bins is too small, the histogram may oversimplify the data, potentially hiding important details or trends. Conversely, if too many bins are used, the histogram can become overly detailed, making the distribution appear noisy and fragmented, which may obscure the overall pattern. Striking the right balance is essential to accurately represent the underlying distribution of the data.

In [10]:
fig = px.histogram(df, x="Normal", 
                   nbins=25,
                   histnorm='probability density',
                   height = 500, width = 800,
                   title='Sample from a Normal Distribution')
fig.show()

6. Customise the color of the bars using the input `color_discrete_sequence` as follows. Note that we have to pass a list of strings, where each string corresponds to a color.  In this case, we pass the color `teal`

In [11]:
fig = px.histogram(df, x="Normal", 
                   color_discrete_sequence=['teal'],
                   nbins=25,
                   histnorm='probability density',
                   height = 500, width = 800,
                   title='Sample from a Normal Distribution')
fig.show()

7. Customise the transparency of the histogram using the input `opacity`. 

In [12]:
fig = px.histogram(df, x="Normal", 
                   opacity=0.75,
                   color_discrete_sequence=['teal'],
                   nbins=25,
                   histnorm='probability density',
                   height = 500, width = 800,
                   title='Sample from a Normal Distribution')
fig.show()

8. Snow the histogram in **cummulative form** by setting the input `cummulative` as `True`

In [13]:
fig = px.histogram(df, x="Normal", 
                   cumulative=True,
                   opacity=0.75,
                   color_discrete_sequence=['teal'],
                   nbins=25,
                   histnorm='probability density',
                   height = 500, width = 800,
                   title='Sample from a Normal Distribution')
fig.show()

### Multiple data sets

Now we are going to use the data set `data2`

In [14]:
df = data2

1. To make a  histogram illustrating multiple distributions we follow the same ingredients as before calling the function `histrogram`  as before but importantly we have to pass the input `color` to differentiate the distributions


In [15]:
fig = px.histogram(df, x='Data',
                   color='Label', 
                   opacity=0.75,
                   nbins=25,
                   histnorm='probability density',
                   height = 500, width = 800,
                   title='Sample from a Normal Distribution')
fig.show()

2. Set the input `barmode` as `'overlay'` This is particularly important when showing several distributions which overlay on the same chart 

In [16]:
fig = px.histogram(df, x='Data',
                   color='Label', 
                   barmode="overlay",
                   opacity=0.75,
                   nbins=25,
                   histnorm='probability density',
                   height = 500, width = 800,
                   title='Sample from a Normal Distribution')
fig.show()

3. Customise the color of the histograms using the input `color_discrete_sequence`

In [17]:
fig = px.histogram(df, x='Data',
                   color='Label', 
                   opacity=0.75,
                   barmode="overlay",
                   color_discrete_sequence=['teal', 'pink'],
                   nbins=25,
                   histnorm='probability density',
                   height = 500, width = 800,
                   title='Sample from a Normal Distribution')
fig.show()

In [18]:
fig = px.histogram(df, x='Data',
                   color='Label', 
                   barmode="overlay",
                   opacity=0.5,
                   color_discrete_sequence=px.colors.qualitative.Prism,
                   nbins=25,
                   histnorm='probability density',
                   height = 500, width = 800,
                   title='Sample from a Normal Distribution')
fig.show()