Search code examples
pythonpandascsvbigdatadata-warehouse

How to use with open to filter datafiles in python and create new file?


i have huge csv and i tried to filter data using with open.

I know i can use FINDSTR on command line but i would like to use python to create a new file filtered or i would like to create a pandas dataframe as output.

here is my code:

outfile = open('my_file2.csv', 'a')
with open('my_file1.csv', 'r') as f:
 for lines in f:
         if '31/10/2018' in lines:
            print(lines)  
         outfile.write(lines)

The problem is that the output file generated is = input file and there is no filter(and the size of file is the same)

Thanks to all


Solution

  • The problem with your code is the indentation of the last line. It should be within the if-statement, so only lines that contain '31/10/2018' get written.

    outfile = open('my_file2.csv', 'a')
    with open('my_file1.csv', 'r') as f:
     for lines in f:
             if '31/10/2018' in lines:
                print(lines)  
                outfile.write(lines)
    

    To filter using Pandas and creating a DataFrame, do something along the lines of:

    import pandas as pd
    import datetime
    
    # I assume here that the date is in a seperate column, named 'Date'
    df = pd.read_csv('my_file1.csv', parse_dates=['Date']) 
    
    # Filter on October 31st 2018
    df_filter = df[df['Date'].dt.date == datetime.date(2018, 10, 31)]
    
    # Output to csv
    df_filter.to_csv('my_file2.csv', index=False)
    

    (For very large csv's, look at the pd.read_csv() argument 'chunksize')

    To use with open(....) as f:, you could do something like:

    import pandas as pd
    
    filtered_list = []
    with open('my_file1.csv', 'r') as f:
        for lines in f:
            if '31/10/2018' in lines:
                print(lines)
                # Split line by comma into list
                line_data = lines.split(',')
                filtered_list.append(line_data)
    
    # Convert to dataframe and export as csv
    df = pd.DataFrame(filtered_list)
    df_filter.to_csv('my_file2.csv', index=False)