Search code examples
pythonpandasdataframegenerator

Get the pandas dataframe in chunks without repetition?


I've looked into the following StackOverflow answer along with several others and I may be dead tired that I am making this mistake and can't figure out exactly where. I basically want to split a pandas dataframe into chunks and send it piece by piece via JSON to an API endpoint. I don't want the same row to be sent multiple times. My question is in Step 4 in the process below.

Reproducible example

Step 1: Dataframe Creation

# Dataframe Creation

import numpy as np
import pandas as pd

filenames = ["file_"+str(x) for x in np.arange(1, 11)]
languages = ['en', 'en', 'fr', 'en', 'en', 'en', 'es', 'en', 'fr', 'en']

test_df = pd.DataFrame({'file': filenames, 'lang': languages})

Step 1 Output

file    lang
0   file_1  en
1   file_2  en
2   file_3  fr
3   file_4  en
4   file_5  en
5   file_6  en
6   file_7  es
7   file_8  en
8   file_9  fr
9   file_10 en

Step 2 - two functions

def get_chunk_df(large_df, splits):
    """splits df into chunks"""
    for chunk_df in np.array_split(large_df, splits):
        yield chunk_df


def get_json_chunks(df, splits):
    """converts each chunk to a dict which is basically going to be a JSON load"""
    documents = {"documents": []}
    df_chunks = get_chunk_df(df, splits)
    for chunk_df in df_chunks:
        for idx, row in chunk_df.iterrows():
            documents["documents"].append({
                "id": str(idx + 1),
                "text": row["lang"]
            })
        yield documents

Step 3 - Testing the output of get_chunk_df function - which is OK

chunk_gen = get_chunk_df(test_df, 3)
counter = 0
for chk in chunk_gen:
    counter = counter + 1
    print(f"***********PRINTING {counter} CHUNK...")
    print(chk)

Step 3 Output

***********PRINTING 1 CHUNK...
     file lang
0  file_1   en
1  file_2   en
2  file_3   fr
3  file_4   en
***********PRINTING 2 CHUNK...
     file lang
4  file_5   en
5  file_6   en
6  file_7   es
***********PRINTING 3 CHUNK...
      file lang
7   file_8   en
8   file_9   fr
9  file_10   en

Step 4 - My problem is here

json_chunks = get_json_chunks(test_df, 3)

for json_chk in json_chunks:
    print(f"First row: {json_chk['documents'][0]}")
    print(f"Last row: {json_chk['documents'][-1]}")

Step 4 Output

First row: {'id': '1', 'text': 'en'}
Last row: {'id': '4', 'text': 'en'}
First row: {'id': '1', 'text': 'en'}
Last row: {'id': '7', 'text': 'es'}
First row: {'id': '1', 'text': 'en'}
Last row: {'id': '10', 'text': 'en'}

But I want the Expected Output to be:

First row: {'id': '1', 'text': 'en'}
Last row: {'id': '4', 'text': 'en'}
First row: {'id': '5', 'text': 'en'}
Last row: {'id': '7', 'text': 'es'}
First row: {'id': '8', 'text': 'en'}
Last row: {'id': '10', 'text': 'en'}

Thanks!


Solution

  • You create documents = {"documents": []} before for-loop and later you append to the same documents but you have to create new documents inside for-loop

    def get_json_chunks(df, splits):
        """converts each chunk to a dict which is basically going to be a JSON load"""
        
        #documents = {"documents": []}  # <-- wrong place
        df_chunks = get_chunk_df(df, splits)
        
        for chunk_df in df_chunks:
    
            documents = {"documents": []}  # <-- good place
    
            for idx, row in chunk_df.iterrows():
                documents["documents"].append({
                    "id": str(idx + 1),
                    "text": row["lang"]
                })
    
            yield documents