Search code examples
pythoncsvpandasglob

Taking out bits of CSVs


I am new to programming and I have a bunch of CSV files that are about 50 to 60 rows each. After an unspecified amount of rows, there is a string called 'NAME' in the second column. I want to take everything in the second column after 'NAME' and print it to a text file. I initially thought to use pandas and skiprows to do this however the problem is that each csv I run through will have 'NAME' in a different row. Also, if it helps, there is a blank line three rows before 'NAME' in every CSV.

header, header   
header, header                           
NUMBER,ITEM
N1,Shoe
N2,Heel
N3,Tee, random stuff
N4,Polo, random stuff
N5,Sneaker
N10,Heel
N11,Tee
...
...
...

How         
Count   17      
SORT,NAME       
H1,Thing, random stuff   
H2,WTANK        
H3,TEE2  
H4,TEE  

I would also like to not have repeats in the text file because I will be running through around 1000 CSVs in a directory. Here is a bit of code that I started out with and is where i got stuck.

import pandas as pd
import csv
import glob

fns = glob.glob('*.csv') #goes through every CSV file in director
for csv in fns:
    prod_df = pd.read_csv(csv, skiprows=???)
    with open (os.path.join('out', fn), 'wb') as f:
        w = csv.writer(f)
        test_alias = prod_df['NAME'].unique()
        w.writerow(row) 

I know it doesn't work, and is probably not a very good bit of code. Any help would be greatly appreciated. Thank You!


Solution

  • Assuming that NAME only occurs one time you could use get_loc

    import pandas as pd
    import glob
    
    fn = 'out.csv'
    fns = glob.glob('*.csv') #goes through every CSV file in director
    colname = 'ITEM'
    rowname = 'NAME'
    for csv in fns:
        s = pd.read_csv(csv, usecols=[colname], squeeze=True, skiprows=7, header=0)
        row_idx = pd.Index(s).get_loc(rowname)
        sub_items = s[row_idx + 1:]
        pd.Series(sub_items.unique()).to_csv(fn, mode='a')
    

    EDIT: Explanation

    usecols only reads in the column you need, i.e., ITEM

    squeeze=True returns a Series

    get_loc returns the index of its argument

    You said you wanted everything after 'NAME', so the index is row_idx + 1

    mode='a' appends to a single csv file (called out.csv), if you want to write to individual csv files then you should adjust your code accordingly