I have a large-ish csv file that has a substantial ammount of dirty data in it, I'd like to clean it up a bit by eliminating all the values that are not absolutely necessary.
Here is the file I'm talking about.
It has the components:
Website
,Title
,Start Date
,Employer
,Location
,lat
,lon
,Country
,Skills11
,Jobs
but I would like to obliterate all but:
Employer
,Location
,Country
,Jobs
Is there a particular tool that is uniquely suited for this task?
Or maybe someone has a handy Python script that can get the job done?
You can easily do it with python writing to a temporary file then replacing the original.
import csv
from operator import itemgetter
from tempfile import NamedTemporaryFile
from shutil import move
with open("edsa_data.csv") as f, NamedTemporaryFile(dir=".", delete=False) as tmp:
# itertools.imap python2
csv.writer(tmp).writerows(map(itemgetter(3, 5, 7, 9), csv.reader(f)))
move(tmp.name, "edsa_data.csv")
For a more generic approach:
import csv
from operator import itemgetter
from tempfile import NamedTemporaryFile
from shutil import move
def keep_columns(csv_f, keep_cols, **kwargs):
with open(csv_f) as f, NamedTemporaryFile("w", dir=".", delete=False) as tmp:
csv.writer(tmp, **kwargs).writerows(itemgetter(*keep_cols)(row)
for row in csv.reader(f, **kwargs))
move(tmp.name, csv_f)
keep_columns("edsa_data.csv", (3, 4, 7, 9))
For the kwargs you can pass sep="," skipinitialspace=True etc..