i have huge csv and i tried to filter data using with open.
I know i can use FINDSTR on command line but i would like to use python to create a new file filtered or i would like to create a pandas dataframe as output.
here is my code:
outfile = open('my_file2.csv', 'a')
with open('my_file1.csv', 'r') as f:
for lines in f:
if '31/10/2018' in lines:
print(lines)
outfile.write(lines)
The problem is that the output file generated is = input file and there is no filter(and the size of file is the same)
Thanks to all
The problem with your code is the indentation of the last line. It should be within the if-statement, so only lines that contain '31/10/2018'
get written.
outfile = open('my_file2.csv', 'a')
with open('my_file1.csv', 'r') as f:
for lines in f:
if '31/10/2018' in lines:
print(lines)
outfile.write(lines)
To filter using Pandas and creating a DataFrame, do something along the lines of:
import pandas as pd
import datetime
# I assume here that the date is in a seperate column, named 'Date'
df = pd.read_csv('my_file1.csv', parse_dates=['Date'])
# Filter on October 31st 2018
df_filter = df[df['Date'].dt.date == datetime.date(2018, 10, 31)]
# Output to csv
df_filter.to_csv('my_file2.csv', index=False)
(For very large csv's, look at the pd.read_csv()
argument 'chunksize')
To use with open(....) as f:
, you could do something like:
import pandas as pd
filtered_list = []
with open('my_file1.csv', 'r') as f:
for lines in f:
if '31/10/2018' in lines:
print(lines)
# Split line by comma into list
line_data = lines.split(',')
filtered_list.append(line_data)
# Convert to dataframe and export as csv
df = pd.DataFrame(filtered_list)
df_filter.to_csv('my_file2.csv', index=False)