Soccer: Is a partial 2-0 in favor a dangerous result?

Often, journalists and commentators suggest that a partial 2-0 in favor is a dangerous result. The argument is that the team that is leading is going to inadvertently slow down and the other team can produce a comeback. Is this ‘soccer myth’ true?

I collected data on 14,518 matches from all the seasons from 2009/2010 to 2016/2017 from the following leagues: Bundesliga (Germany), English Premier League (England), La Liga (Spain), Ligue 1 (France), and Serie A (Italy). The data say that the ‘myth’ that a partial 2-0 is a dangerous result is actually not accurate.

soccer_table

From all the matches, 5,720 were matches in which a team was leading by 2-0 at any given point during the match. However, in only 407 matches the leading team did not win (the team either lost or tied). That is, in 92.9% of the 5,720 matches the ‘2-0 partial result is a dangerous result’ statement was not true. And what if the team is leading 2-0 at half time? In that case, the leading team wins the match in 89.3% of the cases.

It could be possible that in some leagues it is easier or more difficult to maintain a 2-0 lead. Let’s breakdown the 92.9% by league.

soccer_table2

Although there is some variation, the minimum is still very high. In the Bundesliga, a team that is leading by a partial 2-0 is going to won the match in 91.7% of the matches. On the other hand, in La Liga, a team that is leading by 2-0 wins in 94.2% of the matches.

Therefore, it is hard to say that a partial 2-0 is a dangerous result in soccer.

What about a partial 4-0? Out of the 782 matches in which a team was leading by 4-0, there is only one match where that leading team did not win. Who gets the honors? Of course, it had to be Arsenal (Sorry Arsene!). In a very entertaining match, Arsenal FC wasted their 4-0 lead against Newcastle United FC during the 2010/2011 season.

 

Advertisements

How To / Python: Get geographic coordinates using Google (Geocode)

Thanks to pygeocoder, obtaining geographic coordinates, or geocoding, from Google is very simple. The Python script below can do that very easily.

from pygeocoder import Geocoder
import pandas as pd

cities = {
'City': ['Coral Gables', 'Hawarden', 'Chapeco', 'Merida', 'Assisi'],
'Nation': ['United States', 'United Kingdom', 'Brazil', 'Spain', 'Italy']}

df = pd.DataFrame(cities)
info = pd.DataFrame([])

for index, row in df.iterrows():

    city = str(row['City'])
    nation = str(row['Nation'])
    location = str(row['City'] + ', ' + row['Nation'])

    result = Geocoder.geocode(location)
    coords = str(result[0].coordinates)

    chars_to_remove = ['(' , ')']
    coords = coords.translate(None, ''.join(chars_to_remove))

    lat = float(coords.split(',')[0])
    long = float(coords.split(',')[1])

    line = pd.Series([city, nation, lat, long])

    info = info.append(line, ignore_index = True)

In line 15 I create the ‘address’ I want to geocode. The format is ‘City, Country’. In line 17, I use pygeocoder to obtain the coordinates. In lines 18-24, I get the values of the latitude and longitude by splitting the result of the geocoder.

The result is a list of cities with their respective latitude and longitude.

Coral Gables, United States, 25.721490, -80.268384
Hawarden, United Kingdom, 53.185690, -3.029807
Chapeco, Brazil, -27.100934, -52.615699
Merida, Spain, 38.939427, -6.365157
Assisi, Italy, 43.070702, 12.619597

If you are dealing with a large amount of cities, you may want to add a couple of lines in order to deal with exceptions (e.g. when pygeocoder cannot find the city). You could use try and except, replacing the lat and long variables with a specific number very different from the values of a geographic coordinate (not pretty but effective!).

You can check Chris Albon’s website. He goes through more details and I probably borrowed from his code to write this script.

How To / Python: Calculate Cosine Distance II/II

This is the second part of this post.

Suppose now that we have incomplete information for each of the countries. Or suppose we just have some elements equal to zero and instead of listing them we omit them. Therefore, now we do not have vectors of the same length (i.e. indexed in the exact same way).

For example, we want to calculate the cosine distance between Argentina and Chile and the vectors are:

country, var, value
Chile, d1, 1.17
Chile, d2, 0.68
Chile, d4, 1.43
Chile, d6, 1.37
Argentina, d3, -0.02
Argentina, d4, -0.69
Argentina, d5, -0.83
Argentina, d6, -0.45

Note that now the data is in a long format. The previous post used data in a wide format.  I transform the data in line 37 in the code below.

Here you can see that Chile does not have rows for variables d3 and d5. Argentina does not have rows d1 and d2. Therefore, it gets a bit tricky if we want to use the Cosine function from SciPy. In the code below I define two functions to get around this and manually calculate the cosine distance.

Function mynorm calculates the norm of the vector. Function mydotprod calculates the dot product between two vectors using pd.merge. I use pd.merge in order to get around the fact that Argentina and Chile do not have the exact same vectors. Then, I make two merges to get the final set of elements that both Argentina and Chile share.

import pandas as pd
from math import sqrt

#Function to calculate norm of vectors
def mynorm(table):
    elements = table['value'].sort_values(ascending = False)
    vector_elements = [(value)**2 for value in elements]
    norm = sqrt(sum(vector_elements))
    return norm

#Function to calculate the dot product of vectors using pd.merge
def mydotprod(a,b):
    dfa = df2[(df2.country == a)][['var','value']]
    dfb = df2[(df2.country == b)][['var','value']]
    mergeddf = dfa.merge(dfb, how = 'inner', on = 'var')
    mergeddf['prod'] = mergeddf['value_x']*mergeddf['value_y']
    dotprod = float(mergeddf['prod'].sum())
    return dotprod

datadict = {'country': ['Argentina', 'Bolivia', 'Brazil', 'Chile', 'Ecuador', 'Colombia', 'Paraguay', 'Peru', 'Venezuela'],
            'd1': [0.34, -0.19, 0.37, 1.17, -0.31, -0.3, -0.48, -0.15, -0.61],
            'd2': [-0.57, -0.69, -0.28, 0.68, -2.19, -0.83, -0.53, -1, -1.39],
            'd3': [-0.02, -0.55, 0.07, 1.2, -0.14, -0.85, -0.9, -0.47, -1.02],
            'd4': [-0.69, -0.18, 0.05, 1.43, -0.02, -0.7, -0.72, 0.23, -1.08],
            'd5': [-0.83, -0.69, -0.39, 1.31, -0.7, -0.75, -1.04, -0.52, -1.22],
            'd6': [-0.45, -0.77, 0.05, 1.37, -0.1, -0.67, -1.4, -0.35, -0.89]}

pairsdict = {'country1': ['Argentina', 'Ecuador'],
             'country2': ['Chile', 'Colombia']}

df = pd.DataFrame(datadict)
pairs = pd.DataFrame(pairsdict)

print(df)
print(pairs)

df1 = pd.melt(df, id_vars=['country'], var_name='var', value_name='value')
df2 = df1[(df1['country'] == 'Chile') & (df1['var'] != 'd3') & (df1['var'] != 'd5')]
df2 = df2.append(df1[(df1['country'] == 'Argentina') & (df1['var'] != 'd1') & (df1['var'] != 'd2')])
df2 = df2.append(df1[(df1['country'] == 'Ecuador') | (df1['country'] == 'Colombia')])

#Group variable by country in order to calculate the norm of the country's vector
df3 = df2.groupby(['country'])
dfnorm = pd.DataFrame(df3.apply(mynorm)).reset_index()
dfnorm.rename(columns={0: 'norm'}, inplace = True)

#Add the norm values to the DataFrame containing the pairs of countries
df4 = pairs.merge(dfnorm, how = 'left', left_on = 'country1' , right_on = 'country')
df4 = df4[['country1', 'country2', 'norm']]
df4 = df4.merge(dfnorm, how = 'left', left_on = 'country2' , right_on = 'country')
df4 = df4[['country1', 'country2', 'norm_x', 'norm_y']]

#Calculate denominator and then apply the mydotprod function to obtain the dot product
df4['denom'] = df4['norm_x'] * df4['norm_y']
df4['dotprod'] = df4.apply(lambda row: round(mydotprod(row['country1'], row['country2']),2), axis=1)
df4['dist'] = 1 - (df4['dotprod'] / df4['denom'])

 

In lines 38-40 I modified the original data from the previous post so I now have the data I show at the beginning of this post (i.e. incomplete data for Argentina and Chile).

In lines 43-45 I calculate the norm of the countries’ vectors. I group by country and then apply mynorm function.

In lines 48-51 I add the norm to the pairs of countries I want to compare.

In line 54 I calculate the denominator of the formula (multiplication of both norms). In line 55 I apply mydotprod function to obtain the dot product. Finally, in line 56 I divide the dot product by the multiplication of the norms, and subtract this value from 1 to obtain the cosine distance (ranging from 0 to 2).

As a result, we get the following table:

country1, country2, norm_x, norm_y, denom, dotprod, dist
Argentina, Chile, 1.169573, 2.398562, 2.805292, -1.60, 1.570351
Ecuador, Colombia, 2.326414, 1.732859, 4.031346, 2.64, 0.345132

Here you can see that the distance between Ecuador and Colombia is the same we got in the previous post (0.35).

How To / Python: Calculate Cosine Distance I/II

Suppose we have some multi-dimensional data at the country level and we want to see the extent to which two countries are similar. One way to do this is by calculating the Cosine distance between the countries. Here you can find a Python code to do just that.

In this code, I use the SciPy library to take advantage of the built-in function cosine. This function provides the result of 1 – Cosine Proximity. This means that the results of this function range from 0 to  2, while Cosine Proximity ranges from -1 to 1.

import pandas as pd
from scipy.spatial.distance import cosine

datadict = {
'country': ['Argentina', 'Bolivia', 'Brazil', 'Chile', 'Ecuador', 'Colombia', 'Paraguay', 'Peru', 'Venezuela'],
'd1': [0.34, -0.19, 0.37, 1.17, -0.31, -0.3, -0.48, -0.15, -0.61],
'd2': [-0.57, -0.69, -0.28, 0.68, -2.19, -0.83, -0.53, -1, -1.39],
'd3': [-0.02, -0.55, 0.07, 1.2, -0.14, -0.85, -0.9, -0.47, -1.02],
'd4': [-0.69, -0.18, 0.05, 1.43, -0.02, -0.7, -0.72, 0.23, -1.08],
'd5': [-0.83, -0.69, -0.39, 1.31, -0.7, -0.75, -1.04, -0.52, -1.22],
'd6': [-0.45, -0.77, 0.05, 1.37, -0.1, -0.67, -1.4, -0.35, -0.89]}

pairsdict = {
'country1': ['Argentina', 'Venezuela', 'Ecuador', 'Peru'],
'country2': ['Bolivia', 'Chile', 'Colombia', 'Peru']}

df = pd.DataFrame(datadict)
pairs = pd.DataFrame(pairsdict) 

#Add data to the country pairs
pairs = pairs.merge(df, how='left', left_on=['country1'], right_on=['country'])
pairs = pairs.merge(df, how='left', left_on=['country2'], right_on=['country'])

#Convert data columns to list in a single cell
pairs['vector1'] = pairs[['d1_x','d2_x','d3_x','d4_x','d5_x','d6_x']].values.tolist()
pairs['vector2'] = pairs[['d1_y','d2_y','d3_y','d4_y','d5_y','d6_y']].values.tolist()

cosinedf = pairs[['country1', 'country2', 'vector1', 'vector2']]

#Calculate Cosine distance
cosinedf['cosine_dist'] = cosinedf.apply(lambda x: round(cosine(x['vector1'], x['vector2']),2), axis=1)

cosinedf = cosinedf[['country1', 'country2', 'cosine_dist']]
print(cosinedf)

The df dataframe contains 6 variables for each country. The pairs dataframe contains pairs of countries that we want to compare.

In lines 21-22, we add the the 6 variables (d1d6) to each country of the dyad. In lines 25-26 we convert the 6 columns to one column containing a list with the 6 values of variables d1d6. Finally, in line 31 we apply the cosine function from SciPy to each pair of countries and we store the result in the new column called cosine_dist.

As a result, we get the following table:

country1, country2, cosine_dist
Argentina, Bolivia, 0.26
Chile, Venezuela, 1.93
Ecuador, Colombia, 0.35
Peru, Peru, 0.00

This piece of code works well when you already have vectors of the same length, indexed by the same index. However, when you have vectors with many elements equal to zero you might have the data in a compressed format. For those cases, we need to take a longer route to calculate the cosine distance. I explain this case in the next post.

How To / Python: Combine multiple CSV files into one

If you have multiple CSV files with the same structure, you can append or combine them using a short Python script. Suppose you have several files which name starts with datayear. For instance, datayear1980.csv, datayear1981.csv, datayear1982.csv.

import pandas as pd
import glob, os

os.chdir("C:/Folder")
results = pd.DataFrame([])

for counter, file in enumerate(glob.glob("datayear*")):
    namedf = pd.read_csv(file, skiprows=0, usecols=[1,2,3])
    results = results.append(namedf)

results.to_csv('C:/combinedfile.csv')

In line 7 you have to specify the structure of the files’ name. Then, in line 8 you can select which columns you want to combine.

How To / Python: Calculate Mahalanobis Distance

Suppose we have some multi-dimensional data at the country level and we want to see the extent to which two countries are similar. One way to do this is by calculating the Mahalanobis distance between the countries. Here you can find a Python code to do just that.

In this code, I use the SciPy library to take advantage of the built-in function mahalanobis.

 

import pandas as pd
import scipy as sp
from scipy.spatial.distance import mahalanobis

datadict = {
'country': ['Argentina', 'Bolivia', 'Brazil', 'Chile', 'Ecuador', 'Colombia', 'Paraguay', 'Peru', 'Venezuela'],
'd1': [0.34, -0.19, 0.37, 1.17, -0.31, -0.3, -0.48, -0.15, -0.61],
'd2': [-0.57, -0.69, -0.28, 0.68, -2.19, -0.83, -0.53, -1, -1.39],
'd3': [-0.02, -0.55, 0.07, 1.2, -0.14, -0.85, -0.9, -0.47, -1.02],
'd4': [-0.69, -0.18, 0.05, 1.43, -0.02, -0.7, -0.72, 0.23, -1.08],
'd5': [-0.83, -0.69, -0.39, 1.31, -0.7, -0.75, -1.04, -0.52, -1.22],
'd6': [-0.45, -0.77, 0.05, 1.37, -0.1, -0.67, -1.4, -0.35, -0.89]}

pairsdict = {
'country1': ['Argentina', 'Chile', 'Ecuador', 'Peru'],
'country2': ['Bolivia', 'Venezuela', 'Colombia', 'Peru']}

#DataFrame that contains the data for each country
df = pd.DataFrame(datadict)

#DataFrame that contains the pairs for which we calculate the Mahalanobis distance
pairs = pd.DataFrame(pairsdict)

#Add data to the country pairs
pairs = pairs.merge(df, how='left', left_on=['country1'], right_on=['country'])
pairs = pairs.merge(df, how='left', left_on=['country2'], right_on=['country'])

#Convert data columns to list in a single cell
pairs['vector1'] = pairs[['d1_x','d2_x','d3_x','d4_x','d5_x','d6_x']].values.tolist()
pairs['vector2'] = pairs[['d1_y','d2_y','d3_y','d4_y','d5_y','d6_y']].values.tolist()

mahala = pairs[['country1', 'country2', 'vector1', 'vector2']]

#Calculate covariance matrix
covmx = df.cov()
invcovmx = sp.linalg.inv(covmx)

#Calculate Mahalanobis distance
mahala['mahala_dist'] = mahala.apply(lambda x: (mahalanobis(x['vector1'], x['vector2'], invcovmx)), axis=1)

mahala = mahala[['country1', 'country2', 'mahala_dist']]

The df dataframe contains 6 variables for each country. The pairs dataframe contains pairs of countries that we want to compare.

In lines 25-26, we add the the 6 variables (d1d6) to each country of the dyad. In lines 29-30 we convert the 6 columns to one column containing a list with the 6 values of variables d1d6. In lines 35-36 we calculate the inverse of the covariance matrix, which is required to calculate the Mahalanobis distance. Finally, in line 39 we apply the mahalanobis function from SciPy to each pair of countries and we store the result in the new column called mahala_dist.

As a result, we get the following table:

country1, country2, mahala_dist
Argentina, Bolivia, 3.003186
Chile, Venezuela, 3.829020
Ecuador, Colombia, 3.915868
Peru, Peru, 0.000000

 

What Can We Learn about Writing from Flaubert’s Letters?

In short? That even talented writers suffer. A lot.

When we are trying to find wisdom in order to learn to write well, we usually get Stephen King’s On Writing and some other books. A less straightforward way would be to look at the notes or letters of writers. The Peruvian Nobel Laureate Mario Vargas-Llosa is a loyal follower of Gustave Flaubert, the french writer that is known as the precursor of the modern novel. Vargas-Llosa has said that after reading Flaubert’s letters he realized that with work, insistence and perseverance one can make up for the lack of talent.

Below I share some of the most interesting passages that I found in Flaubert’s correspondence (Steegmuller, 1980) at the time he was writing his masterpiece: Madame Bovary. Most of these passages come from letters addressed to his friend and lover Louise Colet, with whom Flaubert had a very strange relationship (but that is ‘harina de otro costal‘). Flaubert’s sentences in those letters show the process of writing the best sentences one could write. We usually assume that sentences flow out of talented writers in an effortless way. However, reading Flaubert’s letters we can see how one of the best writers struggles when trying to produce the written perfection.

Continue reading