My dataset looks like this
dt = [{author: ...., text: ....},...,{author: ...., text: ....}]
and I want to split the texts into chunks and then produce a dataframe with the following form:
df = chunk1 of text1 author of chunk1
............... ................
etc
I can produce chunks by using this function
textwrap.wrap(text, width = 200, break_long_words=False)
and then transform it as a dataframe by using
# Convert the list of dictionaries to dataframe
df = pd.DataFrame.from_dict(dt)
but I don't know how to match each chunk with each author. I could be grateful if you could help me!
I think you can use nested list comprehension to iterate through the rows in the original dt
and then for each row, you iterate through the list of chunks split by textwrap
, creating a new dictionary for each chunk with associated author. Does the code below give you the expected output?
import pandas as pd
import textwrap
# Sample data
dt = [{'author': 'A', 'text': 'Hello world!'}, {'author': 'B', 'text': '!dlrow olleH'}]
# Sample width
width=6
new_dt = [{'chunk': chunk, 'author': row['author']} for row in dt for chunk in textwrap.wrap(row['text'], width=width, break_long_words=False)]
df = pd.DataFrame.from_dict(new_dt)
print(df)