Search code examples
pandasmachine-learningnlpdata-sciencenltk

Pandas append string tokens into list with corresponding column where those column in those string rows having same value


I'm working on this dataset.

dataset

My question is how do I group this dataset based on the same timestamp and merge these strings into one with unique tokens, so, for example, I could have:

date string
2011-02-01 15:00:00 Richmond Service Index S&P/CS HPI Composite - 20 s.a. n.s.a Texas Services Sector Outlook TIC Net Long-Term Transactions including Swaps

I don't have any idea on what method should I use to solve this problem. Does anyone know how to solve it?


Solution

  • Could this help you?

    import pandas as pd
    from collections import OrderedDict
    
    df['event'] = df['event'].str.replace('amp;', '')
    df = df.groupby('date')['event'].apply(lambda x: ' '.join(x)).reset_index()
    df['event'] = df['event'].str.split().apply(lambda x: OrderedDict.fromkeys(x).keys()).str.join(' ')