Search code examples
pythonpandasnlp

NLP: split dictionary and then transform it into dataframe


My dataset looks like this

dt = [{author: ...., text: ....},...,{author: ...., text: ....}]

and I want to split the texts into chunks and then produce a dataframe with the following form:

df = chunk1 of text1    author of chunk1
     ...............    ................

etc

I can produce chunks by using this function

textwrap.wrap(text, width = 200, break_long_words=False)

and then transform it as a dataframe by using

# Convert the list of dictionaries to dataframe
df = pd.DataFrame.from_dict(dt)

but I don't know how to match each chunk with each author. I could be grateful if you could help me!


Solution

  • I think you can use nested list comprehension to iterate through the rows in the original dt and then for each row, you iterate through the list of chunks split by textwrap, creating a new dictionary for each chunk with associated author. Does the code below give you the expected output?

    import pandas as pd
    import textwrap
    
    # Sample data
    dt = [{'author': 'A', 'text': 'Hello world!'}, {'author': 'B', 'text': '!dlrow olleH'}]
    
    # Sample width
    width=6
    
    new_dt = [{'chunk': chunk, 'author': row['author']} for row in dt for chunk in textwrap.wrap(row['text'], width=width, break_long_words=False)]
    
    df = pd.DataFrame.from_dict(new_dt)
    print(df)