I'm working with a large amount of files (~4gb worth) which all contain anywhere between 1 and 100 entries with the following format (between two *** is one entry):
***
Type:status
Origin: @z_rose yes
Text: yes
URL:
ID: 95482459084427264
Time: Mon Jul 25 08:16:06 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 20776334
Hashtags:
***
***
Type:status
Origin: @aaronesilvers text
Text: text
URL:
ID: 95481610861953024
Time: Mon Jul 25 08:12:44 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 2226621
Hashtags:
***
***
Type:status
Origin: @z_rose text
Text: text and stuff
URL:
ID: 95480980026040320
Time: Mon Jul 25 08:10:14 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 20776334
Hashtags:
***
Now I want to somehow import these into Pandas for mass analysis, but obviously I'd have to convert this into a format Pandas can handle. So I want to write a script that converts the above into a .csv looking something like this (User is the file title):
User Type Origin Text URL ID Time RetCount Favorite MentionedEntities Hashtags
4012987 status @z_rose yes yes Null 95482459084427264 Mon Jul 25 08:16:06 CDT 2011 0 false 20776334 Null
4012987 status @aaronsilvers text text Null 95481610861953024 Mon Jul 25 08:12:44 CDT 2011 0 false 2226621 Null
(Formatting isn't perfect but hopefully you get the idea)
I've had some code working that worked on the basis of it regularly being information in segments of 12, but sadly some of the files contain several whitelines in some fields. What I'm basically looking to do is:
fields[] =['User', 'Type', 'Origin', 'Text', 'URL', 'ID', 'Time', 'RetCount', 'Favorite', 'MentionedEntities', 'Hashtags']
starPair = 0;
User = filename;
read(file)
#Determine if the current entry has ended
if(stringRead=="***"){
if(starPair == 0)
starPair++;
if(starPair == 1){
row=row++;
starPair = 0;
}
}
#if string read matches column field
if(stringRead == fields[])
while(strRead != fields[]) #until next field has been found
#extract all characters into correct column field
However the issue arises that some fields can contain the words in fields[].. I can check for a \n char first, which would greatly reduce the amount of faulty entries, but wouldn't eliminate them.
Can anyone point me in the right direction?
Thanks in advance!
Your code/pseudo-code doesn't look like python but because you have the python tag here is how I would do it. First, read the file into a string, then go through each field and make a regular expression to find the value after it, push the result into a 2d list, and then output that 2d list into a CSV. Also, your CSV looks more like a TSV (tab separated instead of comma separated).
import re
import csv
filename='4012987'
User=filename
# read your file into a string
with open(filename, 'r') as myfile:
data=myfile.read()
fields =['Type', 'Origin', 'Text', 'URL', 'ID', 'Time', 'RetCount', 'Favorite', 'MentionedEntities', 'Hashtags']
csvTemplate = [['User','Type', 'Origin', 'Text', 'URL', 'ID', 'Time', 'RetCount', 'Favorite', 'MentionedEntities', 'Hashtags']]
# for each field use regex to get the entry
for n,field in enumerate(fields):
matches = re.findall(field+':\s?([^\n]*)\n+', data)
# this should run only the first time to fill your 2d list with the right amount of lists
while len(csvTemplate)<=len(matches):
csvTemplate.append([None]*(len(fields)+1)) # Null isn't a python reserved word
for e,m in enumerate(matches):
if m != '':
csvTemplate[e+1][n+1]=m.strip()
# set the User column
for i in range(1,len(csvTemplate)):
csvTemplate[i][0] = User
# output to csv....if you want tsv look at https://stackoverflow.com/a/29896136/3462319
with open("output.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(csvTemplate)