Search code examples
pythoncsvpandasextractisnullorempty

How to delete entire row of data set given a condition on a column in csv file?


Here is a snippet of the following data-set in csv format:

quantity revenue     time_x    transaction_id   user_id
    1       0       57:57.0 0        0             0
    1       0       18:59.0 0        1

I want to delete the entire row when the user_id is empty. How do I do this in python? So far, here's my code:

activity = pd.read_csv("activity(delimited).csv", delimiter=';', error_bad_lines=False, dtype=object)
impression = pd.read_csv("impression(delimited).csv", delimiter=';', error_bad_lines=False, dtype=object)
click = pd.read_csv("click(delimited).csv", delimiter=';', error_bad_lines=False, dtype=object)

pre_merge = activity.merge(impression, on="user_id", how="outer")
merged = pre_merge.merge(click, on="user_id", how="outer")
merged.to_csv("merged.csv", index=False)

open_merged = pd.read_csv("merged.csv", delimiter=',', error_bad_lines= False, dtype=object)

filtered_merged = open_merged.dropna(axis='columns', how='all')

Also, how can I write the code in an efficient manner?


Solution

  • With Pandas:

    import pandas as pd
    
    df = pd.read_csv("path/to/csv/data.csv", delimiter=';', error_bad_lines=False)
    df = df[pd.notnull(df.user_id)] # boolean indexing
    
    # Shift user_id to first column
    df = df.set_index("user_id")
    df = df.reset_index()
    
    df.to_csv("path/to/csv/data.csv", index=False)
    

    The bracket notation allows you provide an iterable of boolean values. This is called boolean indexing. Similar concepts and syntax are used in numpy, matlab and R