Search code examples
pandasconcatenation

What will be the better way of writing the same conditions for multiple datasets to concatenate?


Below is the code and im just showing for 2015 to 2016 as an example. Im actually writing the same code for 2015-2019 dataset. This works for me but as you can see, its long winded as i have to keep repeating the codes(till 2019 dataset).

Question: 1. What will be the better and efficient way of writing this?

df2015 = pd.read_csv('EPL_20152016.csv', parse_dates=['Date'], dayfirst=True,
         usecols=['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'BbAv>2.5','BbAv<2.5'])
df2015.rename(columns={'BbAv>2.5': 'Avg>2.5', 'BbAv<2.5': 'Avg<2.5'},inplace=True)
df2015['FTTG'] = df2015['FTHG'] + df2015['FTAG']
df2015['%Avg>2.5'] = 100* (1 / df2015['Avg>2.5'])
df2015['%Avg<2.5'] = 100* (1 / df2015['Avg<2.5'])
df2015['%TotalAvg><2.5'] = df2015['%Avg>2.5'] + df2015['%Avg<2.5']
df2015['%Vig><2.5'] = df2015['%TotalAvg><2.5'] - 100
#df2015 = df2015[['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTTG', 'FTR','Avg>2.5','Avg<2.5',
df2015 = df2015.reindex(columns=['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTTG', 'FTR','Avg>2.5','Avg<2.5','%Avg>2.5', '%Avg<2.5', '%TotalAvg><2.5', '%Vig><2.5'])


df2016 = pd.read_csv('EPL_20162017.csv', parse_dates=['Date'], dayfirst=True,
         usecols=['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'BbAv>2.5','BbAv<2.5'])
df2016.rename(columns={'BbAv>2.5': 'Avg>2.5', 'BbAv<2.5': 'Avg<2.5'},inplace=True)
df2016['FTTG'] = df2016['FTHG'] + df2016['FTAG']
df2016['%Avg>2.5'] = 100* (1 / df2016['Avg>2.5'])
df2016['%Avg<2.5'] = 100* (1 / df2016['Avg<2.5'])
df2016['%TotalAvg><2.5'] = df2016['%Avg>2.5'] + df2016['%Avg<2.5']
df2016['%Vig><2.5'] = df2016['%TotalAvg><2.5'] - 100
df2016 = df2016.reindex(columns=['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTTG', 'FTR','Avg>2.5','Avg<2.5','%Avg>2.5', '%Avg<2.5', '%TotalAvg><2.5', '%Vig><2.5'])

Same code done for 2017,2018 and 2019 datasets. And then i use pd.concat and .set_index

df20152019 = pd.concat([df2015, df2016, df2017, df2018, df2019]).set_index('Date')

Solution

  • After reading your files with pd.read_csv() into different dataframes, you can try adding them in a list. Then have a for loop that loops over each dataframe in that list to perform the desired operations. For example, you could try something like this :

    df_list = [df2015, df2016, df2017, df2018, df2019]
    
    for df in df_list:
        df.rename(columns={'BbAv>2.5': 'Avg>2.5', 'BbAv<2.5': 'Avg<2.5'},inplace=True)
        df['FTTG'] = df['FTHG'] + df['FTAG']
        df['%Avg>2.5'] = 100* (1 / df['Avg>2.5'])
        df['%Avg<2.5'] = 100* (1 / df['Avg<2.5'])
        df['%TotalAvg><2.5'] = df['%Avg>2.5'] + df['%Avg<2.5']
        df['%Vig><2.5'] = df['%TotalAvg><2.5'] - 100
        df = df.reindex(columns=['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG','FTTG', 'FTR','Avg>2.5','Avg<2.5','%Avg>2.5', '%Avg<2.5', '%TotalAvg><2.5', '%Vig><2.5'])