Search code examples
pythonpandasnumpyreadfile

how to save and then extract some information from the file names in dataframe


I have almost 1000000 or even more files in a path. My final goal is to extract some information from just names of the files. Till now I have saved the names of the file in a list.

what information in names of the files?

so the format of the names of the file is something like this:

09066271_142468576_1_Haha_-Haha-haha_2016-10-07_haha-false_haha2427.txt

all haha are other different text that does not matter.

I want to extract 09066271 and 2016-10-07 out of the names and save in a dataframe. the first number is always 8 character.

Till now , I have saved the whole text file names in the list:

path = 'path to the saved txt files/fldr'
file_list = os.listdir(path)

firstly I wanted to save the whole txt file names in the dataframe and then do these operations on them. it seems I have to firstly read to numpy then reshape it to be readable in pandas. however I do not now before what will be the reshape numbers.

df = pd.DataFrame(np.array(file_list).reshape(,))

I would appreciate if you can give me your idea and what will be the efficient way of doing this :)


Solution

  • You can use os to list all of the files. Then just construct a DataFrame and use the string methods to get the parts of the filenames you need.

    import pandas as pd
    import os
    
    path = 'path to the saved txt files/fldr'
    file_list = os.listdir(path)
    
    df = pd.DataFrame(file_list, columns=['file_name'])
    df['data'] = df.file_name.str[0:8]
    df['date'] = df.file_name.str.extract('(\d{4}-\d{2}-\d{2})', expand=True)
    

                                               file_name      data        date
    0  09066271_142468576_1_Haha_-Haha-haha_2016-10-0...  09066271  2016-10-07
    1  09014271_142468576_1_Haha_-Haha-haha_2013-02-1...  09014271  2013-02-18