Suppose we have some multi-dimensional data at the country level and we want to see the extent to which two countries are similar. One way to do this is by calculating the Mahalanobis distance between the countries. Here you can find a Python code to do just that.
In this code, I use the SciPy library to take advantage of the built-in function mahalanobis.
import pandas as pd import scipy as sp from scipy.spatial.distance import mahalanobis datadict = { 'country': ['Argentina', 'Bolivia', 'Brazil', 'Chile', 'Ecuador', 'Colombia', 'Paraguay', 'Peru', 'Venezuela'], 'd1': [0.34, -0.19, 0.37, 1.17, -0.31, -0.3, -0.48, -0.15, -0.61], 'd2': [-0.57, -0.69, -0.28, 0.68, -2.19, -0.83, -0.53, -1, -1.39], 'd3': [-0.02, -0.55, 0.07, 1.2, -0.14, -0.85, -0.9, -0.47, -1.02], 'd4': [-0.69, -0.18, 0.05, 1.43, -0.02, -0.7, -0.72, 0.23, -1.08], 'd5': [-0.83, -0.69, -0.39, 1.31, -0.7, -0.75, -1.04, -0.52, -1.22], 'd6': [-0.45, -0.77, 0.05, 1.37, -0.1, -0.67, -1.4, -0.35, -0.89]} pairsdict = { 'country1': ['Argentina', 'Chile', 'Ecuador', 'Peru'], 'country2': ['Bolivia', 'Venezuela', 'Colombia', 'Peru']} #DataFrame that contains the data for each country df = pd.DataFrame(datadict) #DataFrame that contains the pairs for which we calculate the Mahalanobis distance pairs = pd.DataFrame(pairsdict) #Add data to the country pairs pairs = pairs.merge(df, how='left', left_on=['country1'], right_on=['country']) pairs = pairs.merge(df, how='left', left_on=['country2'], right_on=['country']) #Convert data columns to list in a single cell pairs['vector1'] = pairs[['d1_x','d2_x','d3_x','d4_x','d5_x','d6_x']].values.tolist() pairs['vector2'] = pairs[['d1_y','d2_y','d3_y','d4_y','d5_y','d6_y']].values.tolist() mahala = pairs[['country1', 'country2', 'vector1', 'vector2']] #Calculate covariance matrix covmx = df.cov() invcovmx = sp.linalg.inv(covmx) #Calculate Mahalanobis distance mahala['mahala_dist'] = mahala.apply(lambda x: (mahalanobis(x['vector1'], x['vector2'], invcovmx)), axis=1) mahala = mahala[['country1', 'country2', 'mahala_dist']]
The df dataframe contains 6 variables for each country. The pairs dataframe contains pairs of countries that we want to compare.
In lines 25-26, we add the the 6 variables (d1–d6) to each country of the dyad. In lines 29-30 we convert the 6 columns to one column containing a list with the 6 values of variables d1–d6. In lines 35-36 we calculate the inverse of the covariance matrix, which is required to calculate the Mahalanobis distance. Finally, in line 39 we apply the mahalanobis function from SciPy to each pair of countries and we store the result in the new column called mahala_dist.
As a result, we get the following table:
country1, country2, mahala_dist Argentina, Bolivia, 3.003186 Chile, Venezuela, 3.829020 Ecuador, Colombia, 3.915868 Peru, Peru, 0.000000
Hi, thank you for your posting! I wonder how do you apply Mahalanobis distanceif you have both continuous and discrete variables. Do you have an example in python?