Search code examples
pythonpandascsvtxt

How can I separate a text file into columns, but only for specific lines?


I have a .psa file that I need manipulated, but to my knowledge the file needs to be in a .txt or .csv to manipulate using pandas. To solve this issue, I am reading the original file and writing the contents into another .txt file to which my code will apply to.

The original .psa file has text data in it, all separated by commas. I am trying to organize this data into columns and only pulling the data I need. Each line has 30+ values separated by commas, but I only need the 3rd value to put into a column.
I will have a zip folder which needs this code to run through and do the same thing to each file within the folder. Each file will have a different store number in the title.

Example:

filename: 1 Area 2 - store 15 group.psa

prod,123,456,abc,def, etc...
pla,124,uhj,jop,etc. 
prod,321,789,ghi,jkl, etc...
...

Expectation: I want to only take out the third item in the lines that start with prod and put it into a .csv file. I want to also keep the title of the original file in another column (would be very nice if it is only the store number included on there, but not a necessity). Ex.

nums store #
456 15
789 15

Here is the code I have so far:

with open('1 Area 2 - store 15 group.psa','r') as firstfile, open('test.txt','a') as secondfile: 
    # read content from first file 
    for line in firstfile: 
         # append content to second file 
         secondfile.write(line)

file = pd.read_csv("test.txt", sep=',', usecols=[0,1,2], header=0, names=['col 1','col 2','col 3'])
file.to_csv("output.csv", index=False) 

This code is able to give me the columns as output, but the rows end up including lines that don't start with prod and I have 3 columns instead of just the num column (get an error when I only do [usecols=2], so the data is still messy and I have no idea how to get the title of the original file in the second column.


Solution

  • You can do everything you want inside pandas without writing to a temporary file; just read in the first and third columns, filter on the first column value and then drop it; finally add the store number (extracted from the filename) as the second column:

    import re
    
    fname = '1 Area 2 - store 15 group.psa'
    df = pd.read_csv(fname, usecols=[0, 2], header=None, names=['type', 'num'])
    store = re.search(r'store\s+(\d+)', fname).group(1)
    df = df[df['type'] == 'prod'].drop(columns='type').assign(store=store)
    df.to_csv("output.csv", index=False) 
    

    Output (for your sample data):

    num,store
    456,15
    789,15