Search code examples
pythonpandastext-miningsentence-synthesis

How do I split text with multiple sentences in a column into multiple rows in Python pandas?


I am trying to split Comments column into multiple rows containing each sentence. I used the following StackOverflow thread for my reference as it tends to give similar result. Reference Link: pandas: How do I split text in a column into multiple rows? Sample data of dataframe is as below.

Id Team Food_Text 1 X Food is good. It is very well cooked. Delicious! 2 X I hate the squid. Food is not cooked well. At all indeed. 3 X Please do not good anytime over here 4 Y I love the fish. Awesome delicacy. 5 Y Good for desserts. Meat tastes bad

Each record for 'Food_Text' can be of multiple sentences delimited by full-stop or period. I have used the following code

import numpy as np
import pandas as pd

survey_data = pd.read_csv("Food_Dummy.csv")
survey_text = survey_data[['Id','Team','Food_Text']]

# Getting s as pandas series which has split on full stop and new sentence a new line         
s = survey_text["Food_Text"].str.split('.').apply(pd.Series,1).stack()
s.index = s.index.droplevel(-1) # to line up with df's index
s.name = 'Food_Text' # needs a name to join

# There are blank or emplty cell values after above process. Removing them
s.replace('', np.nan, inplace=True)
s.dropna(inplace=True)
x=s.to_frame(name='Food_Text1')
x.head(10)

# Joining should ideally get me proper output. But I am getting original dataframe instead of split one.
survey_text.join(x)
survey_text.head(10)

I am not sure why the join is not giving me a proper dataframe with more number of rows. Repetition of other columns based on index of split. So Id=1 has 3 sentences so we should have 3 records with all other data same and Food_Text column with a new sentence from a comment by ID=1. Similarly for other records.

Thank You in advance for your help! Regards, Sohil Shah


Solution

  • In the example that you put in your code, The result of the join was printed, so if you want to change the value of your survey_text, the code should be:

    survey_text = survey_text.join(x)

    or if you wanted to simplify your code, this code below is just fine:

    import numpy as np
    import pandas as pd
    
    survey_data = pd.read_csv("Food_Dummy.csv")
    survey_text = survey_data[['Id','Team','Food_Text']]
    
    # Getting s as pandas series which has split on full stop and new sentence a new line
    s = survey_text["Food_Text"].str.split('.').apply(pd.Series,1).stack()
    s.index = s.index.droplevel(-1) # to line up with df's index
    s.name = 'Food_Text' # needs a name to join
    
    # There are blank or emplty cell values after above process. Removing them
    s.replace('', np.nan, inplace=True)
    s.dropna(inplace=True)
    
    # Joining should ideally get me proper output. But I am getting original dataframe instead of split one.
    del survey_text['Food_Text']
    survey_text = survey_text.join(s)
    survey_text.head(10)
    

    This way you will not have multiple "Food_Text" columns in yout DataFrame.