How To / Python: Calculate Mahalanobis Distance

Suppose we have some multi-dimensional data at the country level and we want to see the extent to which two countries are similar. One way to do this is by calculating the Mahalanobis distance between the countries. Here you can find a Python code to do just that.

In this code, I use the SciPy library to take advantage of the built-in function mahalanobis.

 

import pandas as pd
import scipy as sp
from scipy.spatial.distance import mahalanobis

datadict = {
'country': ['Argentina', 'Bolivia', 'Brazil', 'Chile', 'Ecuador', 'Colombia', 'Paraguay', 'Peru', 'Venezuela'],
'd1': [0.34, -0.19, 0.37, 1.17, -0.31, -0.3, -0.48, -0.15, -0.61],
'd2': [-0.57, -0.69, -0.28, 0.68, -2.19, -0.83, -0.53, -1, -1.39],
'd3': [-0.02, -0.55, 0.07, 1.2, -0.14, -0.85, -0.9, -0.47, -1.02],
'd4': [-0.69, -0.18, 0.05, 1.43, -0.02, -0.7, -0.72, 0.23, -1.08],
'd5': [-0.83, -0.69, -0.39, 1.31, -0.7, -0.75, -1.04, -0.52, -1.22],
'd6': [-0.45, -0.77, 0.05, 1.37, -0.1, -0.67, -1.4, -0.35, -0.89]}

pairsdict = {
'country1': ['Argentina', 'Chile', 'Ecuador', 'Peru'],
'country2': ['Bolivia', 'Venezuela', 'Colombia', 'Peru']}

#DataFrame that contains the data for each country
df = pd.DataFrame(datadict)

#DataFrame that contains the pairs for which we calculate the Mahalanobis distance
pairs = pd.DataFrame(pairsdict)

#Add data to the country pairs
pairs = pairs.merge(df, how='left', left_on=['country1'], right_on=['country'])
pairs = pairs.merge(df, how='left', left_on=['country2'], right_on=['country'])

#Convert data columns to list in a single cell
pairs['vector1'] = pairs[['d1_x','d2_x','d3_x','d4_x','d5_x','d6_x']].values.tolist()
pairs['vector2'] = pairs[['d1_y','d2_y','d3_y','d4_y','d5_y','d6_y']].values.tolist()

mahala = pairs[['country1', 'country2', 'vector1', 'vector2']]

#Calculate covariance matrix
covmx = df.cov()
invcovmx = sp.linalg.inv(covmx)

#Calculate Mahalanobis distance
mahala['mahala_dist'] = mahala.apply(lambda x: (mahalanobis(x['vector1'], x['vector2'], invcovmx)), axis=1)

mahala = mahala[['country1', 'country2', 'mahala_dist']]

The df dataframe contains 6 variables for each country. The pairs dataframe contains pairs of countries that we want to compare.

In lines 25-26, we add the the 6 variables (d1d6) to each country of the dyad. In lines 29-30 we convert the 6 columns to one column containing a list with the 6 values of variables d1d6. In lines 35-36 we calculate the inverse of the covariance matrix, which is required to calculate the Mahalanobis distance. Finally, in line 39 we apply the mahalanobis function from SciPy to each pair of countries and we store the result in the new column called mahala_dist.

As a result, we get the following table:

country1, country2, mahala_dist
Argentina, Bolivia, 3.003186
Chile, Venezuela, 3.829020
Ecuador, Colombia, 3.915868
Peru, Peru, 0.000000

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s