Search code examples
pythonexceldataframedata-manipulation

How can I remove the lines that after the 'events' first equals 1 for each id?


This is my data: enter image description here

The yellow part is the lines before the 'events' first equals 1 for each id,and the green part is the lines after the 'events' first equals 1 for each id. Note:the line which the 'events' equals 1 for each id belongs to yellow part I want the yellow part and the green part will be removed.

The final data maybe like this: enter image description here How can I use Excel or Python to solve this?


Solution

  • Let's start by building your dataframe, df:

    import pandas as pd
    id = [1,1,1,1,1,1,1,2,2,2,2,2]
    date = ['2018-05-15','2018-07-13','2018-08-09','2018-09-01','2018-10-03','2018-11-04','2018-12-01','2014-04-29','2014-05-22','2018-06-24','2014-07-24','2014-09-01']
    events = [0,0,0,1,1,0,1,0,1,0,0,1]
    
    df = pd.DataFrame({'id':id,
                       'date':date,
                       'events': events})
    df['date'] = pd.to_datetime(df['date'])
    print(df)
    

    enter image description here

    Now here is what you need:

    def remove_rows(df):
        df = df.reset_index(drop=True)
        if df[df['events'] == 1].index.empty:
            return df
        return df.loc[:df[df['events'] == 1].index[0]]
    
    result = df.groupby('id').apply(lambda x: remove_rows(x)).reset_index(drop=True)
    print(result)
    

    Which will give you:

    enter image description here