Python: Joblib for multiprocessing

So I have these given functions:

def make_event_df(match_id, path):
    '''
    Function for making event dataframe.
    
    Argument:
        match_id -- int, the required match id for which event data will be constructed.
        path -- str, path to .json file containing event data.
    
    Returns:
        df -- pandas dataframe, the event dataframe for the particular match.
    '''
    ## read in the json file
    event_json = json.load(open(path, encoding='utf-8'))
    
    ## normalize the json data
    df = json_normalize(event_json, sep='_')
    
    return df

def full_season_events(comp_name, match_df, match_ids, path):
    '''
    Function to make event dataframe for a full season.
    
    Arguments:
        comp_name -- str, competition name + season name
        match_df -- pandas dataframe, containing match-data
        match_id -- list, list of match id.
        path -- str, path to directory where .json file is listed.
                e.g. '../input/Statsbomb/data/events'
    
    Returns:
        event_df -- pandas dataframe, containing event data for the whole season.
    '''
    ## init an empty dataframe
    event_df = pd.DataFrame()

    for match_id in tqdm(match_ids, desc=f'Making Event Data For {comp_name}'):
        ## .json file
        temp_path = path + f'/{match_id}.json'

        temp_df = make_event_df(match_id, temp_path)
        event_df = pd.concat([event_df, temp_df], sort=True)
        
    return event_df

Now I am running this piece of code to get the dataframe:

comp_id = 11
season_id = 1
path = f'../input/Statsbomb/data/matches/{comp_id}/{season_id}.json'

match_df = get_matches(comp_id, season_id, path)

comp_name = match_df['competition_name'].unique()[0] + '-' + match_df['season_name'].unique()[0]
match_ids = list(match_df['match_id'].unique())
path = f'../input/Statsbomb/data/events'

event_df = full_season_events(comp_name, match_df, match_ids, path)

The above code snippet is giving me this output:

Making Event Data For La Liga-2017/2018: 100%|██████████| 36/36 [00:29<00:00,  1.20it/s]

How can I make use multiprocessing to make the process faster i.e. how can I use the match_ids in full_season_events() to grab the data from the JSON file in a faster manner(using multiprocessing). I am very new to joblib and multiprocessing concept. Can someone tell what changes do I have to make in these functions to get the required results?

Solution

You don't need joblib here, just plain multiprocessing will do.

I'm using imap_unordered since it's faster than imap or map, but doesn't retain order (each worker can receive and submit jobs out of order). Not retaining order doesn't seem to matter since you're sort=Trueing anyway.
- Because I'm using imap_unordered, there's that need for additional jobs finagling; there's no istarmap_unordered which would unpack parameters, so we need to do it ourselves.
If you have many match_ids, things can be sped up with e.g. chunksize=10 to imap_unordered; it means each worker process will be fed 10 jobs at a time, and they will also return 10 jobs at a time. It's faster since less time is spent in process synchronization and serialization, but on the other hand the TQDM progress bar will update less often.

As usual, the code below is dry-coded and might not work OOTB.

import multiprocessing


def make_event_df(job):
    # Unpack parameters from job tuple
    match_id, path = job
    with open(path) as f:
        event_json = json.load(f)
    # Return the match id (if required) and the result.
    return (match_id, json_normalize(event_json, sep="_"))


def full_season_events(comp_name, match_df, match_ids, path):
    event_df = pd.DataFrame()

    with multiprocessing.Pool() as p:
        # Generate job tuples
        jobs = [(match_id, path + f"/{match_id}.json") for match_id in match_ids]
        # Run & get results from multiprocessing generator
        for match_id, temp_df in tqdm(
            p.imap_unordered(make_event_df, jobs),
            total=len(jobs),
            desc=f"Making Event Data For {comp_name}",
        ):
            event_df = pd.concat([event_df, temp_df], sort=True)

    return event_df