Author Archives: Francisco Morales

How To / Python: Get geographic coordinates using Google (Geocode)

Thanks to pygeocoder, obtaining geographic coordinates, or geocoding, from Google is very simple. The Python script below can do that very easily.

from pygeocoder import Geocoder
import pandas as pd

cities = {
'City': ['Coral Gables', 'Hawarden', 'Chapeco', 'Merida', 'Assisi'],
'Nation': ['United States', 'United Kingdom', 'Brazil', 'Spain', 'Italy']}

df = pd.DataFrame(cities)
info = pd.DataFrame([])

for index, row in df.iterrows():

    city = str(row['City'])
    nation = str(row['Nation'])
    location = str(row['City'] + ', ' + row['Nation'])

    result = Geocoder.geocode(location)
    coords = str(result[0].coordinates)

    chars_to_remove = ['(' , ')']
    coords = coords.translate(None, ''.join(chars_to_remove))

    lat = float(coords.split(',')[0])
    long = float(coords.split(',')[1])

    line = pd.Series([city, nation, lat, long])

    info = info.append(line, ignore_index = True)

In line 15 I create the ‘address’ I want to geocode. The format is ‘City, Country’. In line 17, I use pygeocoder to obtain the coordinates. In lines 18-24, I get the values of the latitude and longitude by splitting the result of the geocoder.

The result is a list of cities with their respective latitude and longitude.

Coral Gables, United States, 25.721490, -80.268384
Hawarden, United Kingdom, 53.185690, -3.029807
Chapeco, Brazil, -27.100934, -52.615699
Merida, Spain, 38.939427, -6.365157
Assisi, Italy, 43.070702, 12.619597

If you are dealing with a large amount of cities, you may want to add a couple of lines in order to deal with exceptions (e.g. when pygeocoder cannot find the city). You could use try and except, replacing the lat and long variables with a specific number very different from the values of a geographic coordinate (not pretty but effective!).

You can check Chris Albon’s website. He goes through more details and I probably borrowed from his code to write this script.

How To / Python: Calculate Cosine Distance II/II

This is the second part of this post.

Suppose now that we have incomplete information for each of the countries. Or suppose we just have some elements equal to zero and instead of listing them we omit them. Therefore, now we do not have vectors of the same length (i.e. indexed in the exact same way).

For example, we want to calculate the cosine distance between Argentina and Chile and the vectors are:

country, var, value
Chile, d1, 1.17
Chile, d2, 0.68
Chile, d4, 1.43
Chile, d6, 1.37
Argentina, d3, -0.02
Argentina, d4, -0.69
Argentina, d5, -0.83
Argentina, d6, -0.45

Note that now the data is in a long format. The previous post used data in a wide format.  I transform the data in line 37 in the code below.

Here you can see that Chile does not have rows for variables d3 and d5. Argentina does not have rows d1 and d2. Therefore, it gets a bit tricky if we want to use the Cosine function from SciPy. In the code below I define two functions to get around this and manually calculate the cosine distance.

Function mynorm calculates the norm of the vector. Function mydotprod calculates the dot product between two vectors using pd.merge. I use pd.merge in order to get around the fact that Argentina and Chile do not have the exact same vectors. Then, I make two merges to get the final set of elements that both Argentina and Chile share.

import pandas as pd
from math import sqrt

#Function to calculate norm of vectors
def mynorm(table):
    elements = table['value'].sort_values(ascending = False)
    vector_elements = [(value)**2 for value in elements]
    norm = sqrt(sum(vector_elements))
    return norm

#Function to calculate the dot product of vectors using pd.merge
def mydotprod(a,b):
    dfa = df2[(df2.country == a)][['var','value']]
    dfb = df2[(df2.country == b)][['var','value']]
    mergeddf = dfa.merge(dfb, how = 'inner', on = 'var')
    mergeddf['prod'] = mergeddf['value_x']*mergeddf['value_y']
    dotprod = float(mergeddf['prod'].sum())
    return dotprod

datadict = {'country': ['Argentina', 'Bolivia', 'Brazil', 'Chile', 'Ecuador', 'Colombia', 'Paraguay', 'Peru', 'Venezuela'],
            'd1': [0.34, -0.19, 0.37, 1.17, -0.31, -0.3, -0.48, -0.15, -0.61],
            'd2': [-0.57, -0.69, -0.28, 0.68, -2.19, -0.83, -0.53, -1, -1.39],
            'd3': [-0.02, -0.55, 0.07, 1.2, -0.14, -0.85, -0.9, -0.47, -1.02],
            'd4': [-0.69, -0.18, 0.05, 1.43, -0.02, -0.7, -0.72, 0.23, -1.08],
            'd5': [-0.83, -0.69, -0.39, 1.31, -0.7, -0.75, -1.04, -0.52, -1.22],
            'd6': [-0.45, -0.77, 0.05, 1.37, -0.1, -0.67, -1.4, -0.35, -0.89]}

pairsdict = {'country1': ['Argentina', 'Ecuador'],
             'country2': ['Chile', 'Colombia']}

df = pd.DataFrame(datadict)
pairs = pd.DataFrame(pairsdict)

print(df)
print(pairs)

df1 = pd.melt(df, id_vars=['country'], var_name='var', value_name='value')
df2 = df1[(df1['country'] == 'Chile') & (df1['var'] != 'd3') & (df1['var'] != 'd5')]
df2 = df2.append(df1[(df1['country'] == 'Argentina') & (df1['var'] != 'd1') & (df1['var'] != 'd2')])
df2 = df2.append(df1[(df1['country'] == 'Ecuador') | (df1['country'] == 'Colombia')])

#Group variable by country in order to calculate the norm of the country's vector
df3 = df2.groupby(['country'])
dfnorm = pd.DataFrame(df3.apply(mynorm)).reset_index()
dfnorm.rename(columns={0: 'norm'}, inplace = True)

#Add the norm values to the DataFrame containing the pairs of countries
df4 = pairs.merge(dfnorm, how = 'left', left_on = 'country1' , right_on = 'country')
df4 = df4[['country1', 'country2', 'norm']]
df4 = df4.merge(dfnorm, how = 'left', left_on = 'country2' , right_on = 'country')
df4 = df4[['country1', 'country2', 'norm_x', 'norm_y']]

#Calculate denominator and then apply the mydotprod function to obtain the dot product
df4['denom'] = df4['norm_x'] * df4['norm_y']
df4['dotprod'] = df4.apply(lambda row: round(mydotprod(row['country1'], row['country2']),2), axis=1)
df4['dist'] = 1 - (df4['dotprod'] / df4['denom'])

 

In lines 38-40 I modified the original data from the previous post so I now have the data I show at the beginning of this post (i.e. incomplete data for Argentina and Chile).

In lines 43-45 I calculate the norm of the countries’ vectors. I group by country and then apply mynorm function.

In lines 48-51 I add the norm to the pairs of countries I want to compare.

In line 54 I calculate the denominator of the formula (multiplication of both norms). In line 55 I apply mydotprod function to obtain the dot product. Finally, in line 56 I divide the dot product by the multiplication of the norms, and subtract this value from 1 to obtain the cosine distance (ranging from 0 to 2).

As a result, we get the following table:

country1, country2, norm_x, norm_y, denom, dotprod, dist
Argentina, Chile, 1.169573, 2.398562, 2.805292, -1.60, 1.570351
Ecuador, Colombia, 2.326414, 1.732859, 4.031346, 2.64, 0.345132

Here you can see that the distance between Ecuador and Colombia is the same we got in the previous post (0.35).

How To / Python: Calculate Cosine Distance I/II

Suppose we have some multi-dimensional data at the country level and we want to see the extent to which two countries are similar. One way to do this is by calculating the Cosine distance between the countries. Here you can find a Python code to do just that.

In this code, I use the SciPy library to take advantage of the built-in function cosine. This function provides the result of 1 – Cosine Proximity. This means that the results of this function range from 0 to  2, while Cosine Proximity ranges from -1 to 1.

import pandas as pd
from scipy.spatial.distance import cosine

datadict = {
'country': ['Argentina', 'Bolivia', 'Brazil', 'Chile', 'Ecuador', 'Colombia', 'Paraguay', 'Peru', 'Venezuela'],
'd1': [0.34, -0.19, 0.37, 1.17, -0.31, -0.3, -0.48, -0.15, -0.61],
'd2': [-0.57, -0.69, -0.28, 0.68, -2.19, -0.83, -0.53, -1, -1.39],
'd3': [-0.02, -0.55, 0.07, 1.2, -0.14, -0.85, -0.9, -0.47, -1.02],
'd4': [-0.69, -0.18, 0.05, 1.43, -0.02, -0.7, -0.72, 0.23, -1.08],
'd5': [-0.83, -0.69, -0.39, 1.31, -0.7, -0.75, -1.04, -0.52, -1.22],
'd6': [-0.45, -0.77, 0.05, 1.37, -0.1, -0.67, -1.4, -0.35, -0.89]}

pairsdict = {
'country1': ['Argentina', 'Venezuela', 'Ecuador', 'Peru'],
'country2': ['Bolivia', 'Chile', 'Colombia', 'Peru']}

df = pd.DataFrame(datadict)
pairs = pd.DataFrame(pairsdict) 

#Add data to the country pairs
pairs = pairs.merge(df, how='left', left_on=['country1'], right_on=['country'])
pairs = pairs.merge(df, how='left', left_on=['country2'], right_on=['country'])

#Convert data columns to list in a single cell
pairs['vector1'] = pairs[['d1_x','d2_x','d3_x','d4_x','d5_x','d6_x']].values.tolist()
pairs['vector2'] = pairs[['d1_y','d2_y','d3_y','d4_y','d5_y','d6_y']].values.tolist()

cosinedf = pairs[['country1', 'country2', 'vector1', 'vector2']]

#Calculate Cosine distance
cosinedf['cosine_dist'] = cosinedf.apply(lambda x: round(cosine(x['vector1'], x['vector2']),2), axis=1)

cosinedf = cosinedf[['country1', 'country2', 'cosine_dist']]
print(cosinedf)

The df dataframe contains 6 variables for each country. The pairs dataframe contains pairs of countries that we want to compare.

In lines 21-22, we add the the 6 variables (d1d6) to each country of the dyad. In lines 25-26 we convert the 6 columns to one column containing a list with the 6 values of variables d1d6. Finally, in line 31 we apply the cosine function from SciPy to each pair of countries and we store the result in the new column called cosine_dist.

As a result, we get the following table:

country1, country2, cosine_dist
Argentina, Bolivia, 0.26
Chile, Venezuela, 1.93
Ecuador, Colombia, 0.35
Peru, Peru, 0.00

This piece of code works well when you already have vectors of the same length, indexed by the same index. However, when you have vectors with many elements equal to zero you might have the data in a compressed format. For those cases, we need to take a longer route to calculate the cosine distance. I explain this case in the next post.

How To / Python: Combine multiple CSV files into one

If you have multiple CSV files with the same structure, you can append or combine them using a short Python script. Suppose you have several files which name starts with datayear. For instance, datayear1980.csv, datayear1981.csv, datayear1982.csv.

import pandas as pd
import glob, os

os.chdir("C:/Folder")
results = pd.DataFrame([])

for counter, file in enumerate(glob.glob("datayear*")):
    namedf = pd.read_csv(file, skiprows=0, usecols=[1,2,3])
    results = results.append(namedf)

results.to_csv('C:/combinedfile.csv')

In line 7 you have to specify the structure of the files’ name. Then, in line 8 you can select which columns you want to combine.

How To / Python: Calculate Mahalanobis Distance

Suppose we have some multi-dimensional data at the country level and we want to see the extent to which two countries are similar. One way to do this is by calculating the Mahalanobis distance between the countries. Here you can find a Python code to do just that.

In this code, I use the SciPy library to take advantage of the built-in function mahalanobis.

 

import pandas as pd
import scipy as sp
from scipy.spatial.distance import mahalanobis

datadict = {
'country': ['Argentina', 'Bolivia', 'Brazil', 'Chile', 'Ecuador', 'Colombia', 'Paraguay', 'Peru', 'Venezuela'],
'd1': [0.34, -0.19, 0.37, 1.17, -0.31, -0.3, -0.48, -0.15, -0.61],
'd2': [-0.57, -0.69, -0.28, 0.68, -2.19, -0.83, -0.53, -1, -1.39],
'd3': [-0.02, -0.55, 0.07, 1.2, -0.14, -0.85, -0.9, -0.47, -1.02],
'd4': [-0.69, -0.18, 0.05, 1.43, -0.02, -0.7, -0.72, 0.23, -1.08],
'd5': [-0.83, -0.69, -0.39, 1.31, -0.7, -0.75, -1.04, -0.52, -1.22],
'd6': [-0.45, -0.77, 0.05, 1.37, -0.1, -0.67, -1.4, -0.35, -0.89]}

pairsdict = {
'country1': ['Argentina', 'Chile', 'Ecuador', 'Peru'],
'country2': ['Bolivia', 'Venezuela', 'Colombia', 'Peru']}

#DataFrame that contains the data for each country
df = pd.DataFrame(datadict)

#DataFrame that contains the pairs for which we calculate the Mahalanobis distance
pairs = pd.DataFrame(pairsdict)

#Add data to the country pairs
pairs = pairs.merge(df, how='left', left_on=['country1'], right_on=['country'])
pairs = pairs.merge(df, how='left', left_on=['country2'], right_on=['country'])

#Convert data columns to list in a single cell
pairs['vector1'] = pairs[['d1_x','d2_x','d3_x','d4_x','d5_x','d6_x']].values.tolist()
pairs['vector2'] = pairs[['d1_y','d2_y','d3_y','d4_y','d5_y','d6_y']].values.tolist()

mahala = pairs[['country1', 'country2', 'vector1', 'vector2']]

#Calculate covariance matrix
covmx = df.cov()
invcovmx = sp.linalg.inv(covmx)

#Calculate Mahalanobis distance
mahala['mahala_dist'] = mahala.apply(lambda x: (mahalanobis(x['vector1'], x['vector2'], invcovmx)), axis=1)

mahala = mahala[['country1', 'country2', 'mahala_dist']]

The df dataframe contains 6 variables for each country. The pairs dataframe contains pairs of countries that we want to compare.

In lines 25-26, we add the the 6 variables (d1d6) to each country of the dyad. In lines 29-30 we convert the 6 columns to one column containing a list with the 6 values of variables d1d6. In lines 35-36 we calculate the inverse of the covariance matrix, which is required to calculate the Mahalanobis distance. Finally, in line 39 we apply the mahalanobis function from SciPy to each pair of countries and we store the result in the new column called mahala_dist.

As a result, we get the following table:

country1, country2, mahala_dist
Argentina, Bolivia, 3.003186
Chile, Venezuela, 3.829020
Ecuador, Colombia, 3.915868
Peru, Peru, 0.000000

 

What Can We Learn about Writing from Flaubert’s Letters?

In short? That even talented writers suffer. A lot.

When we are trying to find wisdom in order to learn to write well, we usually get Stephen King’s On Writing and some other books. A less straightforward way would be to look at the notes or letters of writers. The Peruvian Nobel Laureate Mario Vargas-Llosa is a loyal follower of Gustave Flaubert, the french writer that is known as the precursor of the modern novel. Vargas-Llosa has said that after reading Flaubert’s letters he realized that with work, insistence and perseverance one can make up for the lack of talent.

Below I share some of the most interesting passages that I found in Flaubert’s correspondence (Steegmuller, 1980) at the time he was writing his masterpiece: Madame Bovary. Most of these passages come from letters addressed to his friend and lover Louise Colet, with whom Flaubert had a very strange relationship (but that is ‘harina de otro costal‘). Flaubert’s sentences in those letters show the process of writing the best sentences one could write. We usually assume that sentences flow out of talented writers in an effortless way. However, reading Flaubert’s letters we can see how one of the best writers struggles when trying to produce the written perfection.

Continue reading

Genetic Diversity and Economic Development

If we think that diversity can have an impact on organizational outcomes, it would make sense to also look at higher-level outcomes. Ashraf and Galor (2013) do that. They study the effect of genetic diversity on economic development at the country-level. They find that an intermediate level of genetic diversity is associated with ‘the highest’ level of economic development. The results show that there is an ‘inverted-U’ relationship between genetic diversity and economic development: a trade-off exists between the positive and negative effects of diversity on productivity.

Based on their results, the authors indicate that (Ashraf & Galor, 2013: 38)

(i) increasing the diversity of the most homogenous country in the sample (Bolivia) by 1 percentage point would raise its income per capita in the year 2000 CE by 39%, (ii) decreasing the diversity of the most diverse country in the sample (Ethiopia) by 1 percentage point would raise its income per capita by 21%…

The following image extracted from the paper shows this curvilinear relationship:

Ashraf and Galor 2013

Ashraf and Galor (2013)

They explore some mechanisms through which genetic diversity has an impact on economic development. In particular, they look at interpersonal trust and scientific knowledge creation. They find that (Ashraf & Galor, 2013: 8)

…the analysis indicates that genetic diversity is negatively associated with the extent of cooperative behavior, as measured by the prevalence of interpersonal trust, and positively associated with innovative activity, as measured by the intensity of scientific knowledge creation.

This study has received some attention from news sites. Check one of the author’s website here.

References

Ashraf, Q., & Galor, O. (2013). The’Out of Africa’hypothesis, human genetic diversity, and comparative economic development. The American Economic Review, 103(1), 1.