Search code examples
pythonpandascsvziptxt

Is it possible to apply a code to all txt files within a zip file in python?


I have a piece of code that manipulates data of a txt file and writes a new csv file with the manipulated data. The original file does not have headers and column 1 includes unwanted data.

The code does 3 things:

  1. Removes two of the 4 columns
  2. Adds column headers
  3. Changes the content of one of the remaining columns to remove characters around the desired numbers (basically takes out prefix and suffix around the numbers).
import pandas as pd
file = pd.read_csv("example.txt", usecols=[0,1]) #to only get the first 2 columns 
 
headerList = ['store', 'sku'] #name headers
 
file.to_csv("test.csv", header=headerList, index=False) #create new csv file headers
 
file = pd.read_csv("test.csv") #read new file including headers
 
file['store']=file['store'].str.split('R ').str[-1] #remove chars before str num
file['store']=file['store'].str.split(' -').str[0] #remove chars after str num
 
 
file.to_csv("test.csv", index=False) #updates the header file

This is easy to do with one file at a time, but I would like to apply this code to all files within a zip file that are formatted the same way, but have different names and data. Is there a way to maybe create a loop that goes through each file within the zip to run this code and create a new zip file with the modified data?


Solution

  • From the read_csv docs, you can pass in a filename or buffer (that is, a file-like object). The zipfile.ZipFile.open will open a file contained in a zipfile. Put those together and you can enumerate the zipfile, processing each file. Also, you can apply your own header to the data as you read it, so there is no need for an intermediate file

    import pandas as pd
    import zipfile
    
    with zipfile.ZipFile("example.zip") as zippy:
        for filename in zippy.infolist():
            df = pd.read_csv(zippy.open(filename), usecols=[0,1], 
                    header=0, names=['store', 'sku'])
            print(df)