Search code examples
python-3.xpandasdata-cleaning

Pandas : Reorganization of a DataFrame


I'm looking for a way to clean the following data:

enter image description here

I would like to output something like this:

enter image description here

with the tokenized words in the first column and their associated labels on the other.

Is there a particular strategy with Pandas and NLTK to obtain this type of output in one go?

Thank you in advance for your help or advice


Solution

  • Given the 1st table, it's simply a matter of splitting the first column and repeating the 2nd column:

    import pandas as pd
    
    data = [['foo bar', 'O'], ['George B', 'PERSON'], ['President', 'TITLE']]
    df1 = pd.DataFrame(data, columns=['col1', 'col2'])
    
    print(df1)
    
    df2 = pd.concat([pd.Series(row['col2'], row['col1'].split(' '))
                     for _, row in df1.iterrows()]).reset_index()
    df2 = df2.rename(columns={'index': 'col1', 0: 'col2'})
    print(df2)
    

    The output:

            col1    col2
    0    foo bar       O
    1   George B  PERSON
    2  President   TITLE
    
            col1    col2
    0        foo       O
    1        bar       O
    2     George  PERSON
    3          B  PERSON
    4  President   TITLE
    

    As for splitting the 1st column, you want to look at the split method which supports regular expression, which should allow you to handle the various language delimiters: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html

    If 1st table is not given there is no way to do this in 1 go with pandas since pandas has no built-in NLP capabilities.