Search code examples
pythoncsvstring-parsing

Python: Parse file names, split, and strip multiple characters


I have a folder containing images (.jpg), and I need to extract the file names to CSV, split them using '_' into multiple columns (with headers), and strip out multiple characters.

I have partially completed this using the following:

import os, csv

with open('filepath.csv', 'w') as f:
    writer = csv.writer(f)
    for path, dirs, files in os.walk('dirpath'):
        for item in files:
        writer.writerow([item])

with open('filepath.csv', 'w') as inf:
    with open ('outfile.csv', 'w') as outf:
        for line in inf:
            outf.write(','.join(line.split('_')))

Example file name: firstname_lastname_uniqueid_date_latUKN_longUKN_club.jpg The result of my code above returns firstname, lastname, uniqueid, date, latUKN,longUKN, and club.jpg.

This is the schema I'm looking for but I'd also like to parse out the 'lat' and 'long' from latUKN, and longUKN, as well as remove the .jpg at the end of the string. I need to remove the strings 'lat' and 'long' because there are file names containing latitude/longitude, but the 'lat' and 'long'are brought along in the parsing (e.g. lat12.34, long54.67)

How can I remove/strip out these other characters, and add headers? If there is no latitude or longitude, how can I leave this part empty instead of populating the string 'latUKN','longUKN'. Is it possible to run this over a whole directory and output a single csv?

Sample Data

John_Doe_2259153_20171102_latUKN_longUKN_club1.jpg
John_Doe_2259153_20171031_lat123.00_long456.00_club1.jpg
Jane_Doe_5964264_20171101_latUKN_longUKN_club2.jpg
Jane_Doe_5964264_20171029_lat789.00_long012.00_club2.jpg
Joe_Smith_1234564_20171001_lat345.00_long678.00_club3.jpg

How data looks with current code:

John|Doe|2259153|20171102|latUKN|longUKN|club1.jpg
John|Doe|2259153|20171031|lat123.00|long456.00|club1.jpg
Jane|Doe|5964264|20171101|latUKN|longUKN|club2.jpg
Jane|Doe|5964264|20171029|lat789.00|long012.00|club2.jpg
Joe|Smith|1234564|20171001|lat345.00|long678.00|club3.jpg

How I want the data to look:

John|Doe|2259153|20171102|UKN|UKN|club1
John|Doe|2259153|20171031|123.00|456.00|club1
Jane|Doe|5964264|20171101|UKN|UKN|club2
Jane|Doe|5964264|20171029|789.00|l012.00|club2
Joe|Smith|1234564|20171001|345.00|678.00|club3

Solution

  • Since both answers revolved around using find/replace, and did not fully resolve the problem, I used the following to I complete the task:

    import csv
    
    infile = open('path', 'r')
    outfile = open('path', 'r')
    
    findlist = ['lat', 'long', '.jpg.']
    replacelist = ["", "", ""]
    
    rep = dict(zip(findlist, replacelist))
    
    s = infile.read()
    for item, replacement in zip(findlist, replacelist):
        s = s.replace(item, replacement)
    outfile.write(s)