I have almost 1000000 or even more files in a path.
My final goal is to extract some information from just names
of the files.
Till now I have saved the names of the file in a list.
what information in names of the files?
so the format of the names of the file is something like this:
09066271_142468576_1_Haha_-Haha-haha_2016-10-07_haha-false_haha2427.txt
all haha are other different text that does not matter.
I want to extract 09066271
and 2016-10-07
out of the names and save in a dataframe. the first number is always 8 character.
Till now , I have saved the whole text file names in the list:
path = 'path to the saved txt files/fldr'
file_list = os.listdir(path)
firstly I wanted to save the whole txt file names in the dataframe and then do these operations on them. it seems I have to firstly read to numpy then reshape it to be readable in pandas. however I do not now before what will be the reshape numbers.
df = pd.DataFrame(np.array(file_list).reshape(,))
I would appreciate if you can give me your idea and what will be the efficient way of doing this :)
You can use os
to list all of the files. Then just construct a DataFrame
and use the string methods to get the parts of the filenames you need.
import pandas as pd
import os
path = 'path to the saved txt files/fldr'
file_list = os.listdir(path)
df = pd.DataFrame(file_list, columns=['file_name'])
df['data'] = df.file_name.str[0:8]
df['date'] = df.file_name.str.extract('(\d{4}-\d{2}-\d{2})', expand=True)
file_name data date
0 09066271_142468576_1_Haha_-Haha-haha_2016-10-0... 09066271 2016-10-07
1 09014271_142468576_1_Haha_-Haha-haha_2013-02-1... 09014271 2013-02-18