Search code examples
pythonregexdata-analysis

Extracting information from textfile through regex and/or python


I'm working with a large amount of files (~4gb worth) which all contain anywhere between 1 and 100 entries with the following format (between two *** is one entry):

***
Type:status
Origin: @z_rose yes
Text:  yes
URL: 
ID: 95482459084427264
Time: Mon Jul 25 08:16:06 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 20776334 
Hashtags: 
***
***
Type:status
Origin: @aaronesilvers text
Text:  text
URL: 
ID: 95481610861953024
Time: Mon Jul 25 08:12:44 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 2226621 
Hashtags: 
***
***
Type:status
Origin: @z_rose text
Text:  text and stuff
URL: 
ID: 95480980026040320
Time: Mon Jul 25 08:10:14 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 20776334 
Hashtags: 
***

Now I want to somehow import these into Pandas for mass analysis, but obviously I'd have to convert this into a format Pandas can handle. So I want to write a script that converts the above into a .csv looking something like this (User is the file title):

User   Type    Origin              Text  URL    ID                Time                          RetCount  Favorite  MentionedEntities  Hashtags
4012987 status  @z_rose yes         yes   Null   95482459084427264  Mon Jul 25 08:16:06 CDT 2011  0           false  20776334            Null
4012987 status  @aaronsilvers text  text Null    95481610861953024   Mon Jul 25 08:12:44 CDT 2011  0           false   2226621            Null   

(Formatting isn't perfect but hopefully you get the idea)

I've had some code working that worked on the basis of it regularly being information in segments of 12, but sadly some of the files contain several whitelines in some fields. What I'm basically looking to do is:

fields[] =['User', 'Type', 'Origin', 'Text', 'URL', 'ID', 'Time', 'RetCount', 'Favorite', 'MentionedEntities', 'Hashtags']
starPair = 0;
User = filename;
read(file)
#Determine if the current entry has ended
if(stringRead=="***"){
    if(starPair == 0)
        starPair++;
    if(starPair == 1){
        row=row++;
        starPair = 0;
    }
}
#if string read matches column field
if(stringRead == fields[])
    while(strRead != fields[]) #until next field has been found
        #extract all characters into correct column field

However the issue arises that some fields can contain the words in fields[].. I can check for a \n char first, which would greatly reduce the amount of faulty entries, but wouldn't eliminate them.

Can anyone point me in the right direction?

Thanks in advance!


Solution

  • Your code/pseudo-code doesn't look like python but because you have the python tag here is how I would do it. First, read the file into a string, then go through each field and make a regular expression to find the value after it, push the result into a 2d list, and then output that 2d list into a CSV. Also, your CSV looks more like a TSV (tab separated instead of comma separated).

    import re
    import csv
    
    filename='4012987'
    User=filename
    
    # read your file into a string
    with open(filename, 'r') as myfile:
        data=myfile.read()
    
    fields =['Type', 'Origin', 'Text', 'URL', 'ID', 'Time', 'RetCount', 'Favorite', 'MentionedEntities', 'Hashtags']
    csvTemplate = [['User','Type', 'Origin', 'Text', 'URL', 'ID', 'Time', 'RetCount', 'Favorite', 'MentionedEntities', 'Hashtags']]
    
    # for each field use regex to get the entry
    for n,field in enumerate(fields):
      matches = re.findall(field+':\s?([^\n]*)\n+', data)
      # this should run only the first time to fill your 2d list with the right amount of lists
      while len(csvTemplate)<=len(matches):
        csvTemplate.append([None]*(len(fields)+1)) # Null isn't a python reserved word
      for e,m in enumerate(matches):
        if m != '':
          csvTemplate[e+1][n+1]=m.strip()
    # set the User column
    for i in range(1,len(csvTemplate)):
      csvTemplate[i][0] = User
    # output to csv....if you want tsv look at https://stackoverflow.com/a/29896136/3462319
    with open("output.csv", "wb") as f:
        writer = csv.writer(f)
        writer.writerows(csvTemplate)