Pandas : Reorganization of a DataFrame

I'm looking for a way to clean the following data:

I would like to output something like this:

with the tokenized words in the first column and their associated labels on the other.

Is there a particular strategy with Pandas and NLTK to obtain this type of output in one go?

Thank you in advance for your help or advice

Solution

Given the 1st table, it's simply a matter of splitting the first column and repeating the 2nd column:

import pandas as pd

data = [['foo bar', 'O'], ['George B', 'PERSON'], ['President', 'TITLE']]
df1 = pd.DataFrame(data, columns=['col1', 'col2'])

print(df1)

df2 = pd.concat([pd.Series(row['col2'], row['col1'].split(' '))
                 for _, row in df1.iterrows()]).reset_index()
df2 = df2.rename(columns={'index': 'col1', 0: 'col2'})
print(df2)

The output:

        col1    col2
0    foo bar       O
1   George B  PERSON
2  President   TITLE

        col1    col2
0        foo       O
1        bar       O
2     George  PERSON
3          B  PERSON
4  President   TITLE

As for splitting the 1st column, you want to look at the split method which supports regular expression, which should allow you to handle the various language delimiters: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html

If 1st table is not given there is no way to do this in 1 go with pandas since pandas has no built-in NLP capabilities.