I've looked into the following StackOverflow answer along with several others and I may be dead tired that I am making this mistake and can't figure out exactly where. I basically want to split a pandas dataframe into chunks and send it piece by piece via JSON to an API endpoint. I don't want the same row to be sent multiple times. My question is in Step 4 in the process below.
Reproducible example
Step 1: Dataframe Creation
# Dataframe Creation
import numpy as np
import pandas as pd
filenames = ["file_"+str(x) for x in np.arange(1, 11)]
languages = ['en', 'en', 'fr', 'en', 'en', 'en', 'es', 'en', 'fr', 'en']
test_df = pd.DataFrame({'file': filenames, 'lang': languages})
Step 1 Output
file lang
0 file_1 en
1 file_2 en
2 file_3 fr
3 file_4 en
4 file_5 en
5 file_6 en
6 file_7 es
7 file_8 en
8 file_9 fr
9 file_10 en
Step 2 - two functions
def get_chunk_df(large_df, splits):
"""splits df into chunks"""
for chunk_df in np.array_split(large_df, splits):
yield chunk_df
def get_json_chunks(df, splits):
"""converts each chunk to a dict which is basically going to be a JSON load"""
documents = {"documents": []}
df_chunks = get_chunk_df(df, splits)
for chunk_df in df_chunks:
for idx, row in chunk_df.iterrows():
documents["documents"].append({
"id": str(idx + 1),
"text": row["lang"]
})
yield documents
Step 3 - Testing the output of get_chunk_df function - which is OK
chunk_gen = get_chunk_df(test_df, 3)
counter = 0
for chk in chunk_gen:
counter = counter + 1
print(f"***********PRINTING {counter} CHUNK...")
print(chk)
Step 3 Output
***********PRINTING 1 CHUNK...
file lang
0 file_1 en
1 file_2 en
2 file_3 fr
3 file_4 en
***********PRINTING 2 CHUNK...
file lang
4 file_5 en
5 file_6 en
6 file_7 es
***********PRINTING 3 CHUNK...
file lang
7 file_8 en
8 file_9 fr
9 file_10 en
Step 4 - My problem is here
json_chunks = get_json_chunks(test_df, 3)
for json_chk in json_chunks:
print(f"First row: {json_chk['documents'][0]}")
print(f"Last row: {json_chk['documents'][-1]}")
Step 4 Output
First row: {'id': '1', 'text': 'en'}
Last row: {'id': '4', 'text': 'en'}
First row: {'id': '1', 'text': 'en'}
Last row: {'id': '7', 'text': 'es'}
First row: {'id': '1', 'text': 'en'}
Last row: {'id': '10', 'text': 'en'}
But I want the Expected Output to be:
First row: {'id': '1', 'text': 'en'}
Last row: {'id': '4', 'text': 'en'}
First row: {'id': '5', 'text': 'en'}
Last row: {'id': '7', 'text': 'es'}
First row: {'id': '8', 'text': 'en'}
Last row: {'id': '10', 'text': 'en'}
Thanks!
You create documents = {"documents": []}
before for
-loop and later you append
to the same documents
but you have to create new documents
inside for
-loop
def get_json_chunks(df, splits):
"""converts each chunk to a dict which is basically going to be a JSON load"""
#documents = {"documents": []} # <-- wrong place
df_chunks = get_chunk_df(df, splits)
for chunk_df in df_chunks:
documents = {"documents": []} # <-- good place
for idx, row in chunk_df.iterrows():
documents["documents"].append({
"id": str(idx + 1),
"text": row["lang"]
})
yield documents