I have about 1000 txt files in the format below. The identifier is separated from the data by ':' . However the data it self can contains ':' I am working on a script that can allow me to transpose and import the data into an excel, csv, microsoft acess or mssql format. in a structured way. Given that the identifiers will be the same for all the txt files.
Author: sharkespare
Content-Length: 15200098
Content-Type: application/pdf
Creation-Date: 2015-02-05T07:27:34Z
Last-Modified: 2015-02-05T07:28:38Z
Last-Save-Date: 2015-02-05T07:28:38Z
created: Thu Feb 05 08:27:34 CET 2015
creator: Williams spear
date: 2015-02-05T07:27:34Z
dc:creator: Library of congress
What about:
Open txt file
and split by ": "
(with space after the colon), put it into a list.
text=[]
with open('read_file.txt','r') as f:
for line in f:
for word in line.split(": "):
text.append(word)
Now every even element is an identifier and the next odd element ist the corresponding data. Rebuild the list putting every identifier and its data together:
sorted = [text[i:i+2] for i in range(0, len(text), 2)]
Finally write everything into a csv file
in colums using zip
:
import csv
with open('save_file.csv', 'w') as f:
writer = csv.writer(f)
writer.writerows(zip(*sorted))
If you don't want to transpose it, just replace zip(*sorted)
with sorted
.
After saving the text from the first file, loop over the other ones and add only the data
to the csv file
. You can achieve that by opening with the 'a' parameter, allowing you to append stuff at the end of the file, and using [zip(*sorted)[1]]
:
files = ["file1.txt", "file2.txt", "file3.txt", etc]
for each in files:
text=[]
with open(str(each),'r') as f:
for line in f:
for word in line.split(": "):
text.append(word)
sorted = [lst[i:i+2] for i in range(0, len(lst), 2)]
with open('save_file.csv', 'a') as f:
writer = csv.writer(f)
writer.writerows([zip(*sorted)[1]])