{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Visualizing Correlation Matrices" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Visualizing [correlation matrices](https://en.wikipedia.org/wiki/Correlation) is a useful practice in data analysis, as it allows practicioners to quickly assess the relationships between multiple variables in a dataset. Correlation matrices summarize the pairwise Pearson (or Spearman) correlation coefficients between each pair of variables. Their values range from -1 to 1. A value of 1 indicates a perfect positive correlation, -1 indicates a perfect negative correlation, and 0 occurs when there is no correlation. \n", "\n", "\n", "Visual representations, such as heatmaps, facilitate the interpretation of these relationships by providing an intuitive overview. In a heatmap, the variables are represented along the axes, and the correlation coefficients are represented as colors, making it easy to spot strong correlations (either positive or negative) and identify patterns in the data at a glance.\n", "\n", "One of the main advantages of visualizing correlation matrices is the ability to identify multicollinearity, which can be problematic in regression analysis and predictive modeling. By highlighting highly correlated variables, analysts can make informed decisions about feature selection, potentially reducing dimensionality and improving model performance. Furthermore, correlation matrix visualizations are useful in exploratory data analysis, helping to reveal underlying structures in the data, such as clusters of related variables. However, it's important to be cautious when interpreting correlation matrices, as correlation does not imply causation. A high correlation between two variables may be coincidental or driven by a third variable, so further investigation is often needed to understand the nature of these relationships. Overall, correlation matrix visualizations are powerful tools that enhance the understanding of complex datasets, aiding in the identification of trends and relationships among variables." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting ready\n", "\n", "\n", "In addition to `plotly`, `numpy` and `pandas`, make sure the `yfinance` and `scipy` Python library avaiable in your Python environment\n", "You can install it using the command:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "pip install scipy, yfinance\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this recipe we will create two data sets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Import the Python modules `numpy`, `pandas`. Import the [`norm`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html) object from `scipy.stats`. This object will allow us to generate random samples from a normal distribution. This will help us to create data sets to be used in this recipe." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Create a correlation matrix to be used in this recipe" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "data = np.random.rand(5, 5)\n", "df = pd.DataFrame(data, columns=[f'Variable {i+1}' for i in range(5)])\n", "correlation_matrix = df.corr()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Variable 1 | \n", "Variable 2 | \n", "Variable 3 | \n", "Variable 4 | \n", "Variable 5 | \n", "
---|---|---|---|---|---|
Variable 1 | \n", "1.000000 | \n", "-0.273381 | \n", "-0.417104 | \n", "-0.142325 | \n", "0.015200 | \n", "
Variable 2 | \n", "-0.273381 | \n", "1.000000 | \n", "-0.532757 | \n", "-0.342872 | \n", "-0.083532 | \n", "
Variable 3 | \n", "-0.417104 | \n", "-0.532757 | \n", "1.000000 | \n", "0.693070 | \n", "0.413611 | \n", "
Variable 4 | \n", "-0.142325 | \n", "-0.342872 | \n", "0.693070 | \n", "1.000000 | \n", "0.931133 | \n", "
Variable 5 | \n", "0.015200 | \n", "-0.083532 | \n", "0.413611 | \n", "0.931133 | \n", "1.000000 | \n", "