Search code examples
pythonpandasseparator

Python 2.7 - pandas.read_table - how to import quadruple-pipe-separated fields from flat file


I am a decent SAS programmer, but I am quite new in Python. Now, I have been given Twitter feeds, each saved into very large flat files, with headers in row #1 and a data structure like the below:

CREATED_AT||||ID||||TEXT||||IN_REPLY_TO_USER_ID||||NAME||||SCREEN_NAME||||DESCRIPTION||||FOLLOWERS_COUNT||||TIME_ZONE||||QUOTE_COUNT||||REPLY_COUNT||||RETWEET_COUNT||||FAVORITE_COUNT

Tue Nov 14 12:33:00 +0000 2017||||930413253766791168||||ICYMI: Football clubs join the craft beer revolution! A good read|||| ||||BAB||||BABBrewers||||Monthly homebrew meet-up at 1000 Trades, Jewellery Quarter. First Tuesday of the month. All welcome, even if you've never brewed before.||||95|||| ||||0||||0||||0||||0
Tue Nov 14 12:34:00 +0000 2017||||930413253766821456||||I'm up for it|||| ||||Misty||||MistyGrl||||You CAN DO it!||||45|||| ||||0||||0||||0||||0

I guess it's like that because any sort of characters can be found in a Twitter feed, but a quadruple pipe is unlikely enough.

I know some people use JSON for that, but I've got these files as such: lots of them. I could use SAS to easily transform these files, but I prefer to "go pythonic", this time.

Now, I cannot seem to find a way to make Python (2.7) understand that the quadruple pipe is the actual separator. The output from the code below:

import pandas as pd

with open('C:/Users/myname.mysurname/Desktop/my_twitter_flow_1.txt') as theInFile:
    inTbl = pd.read_table(theInFile, engine='python', sep='||||', header=1)
    print inTbl.head()

seem to suggest that Python does not see the distinct fields as distinct but, simply, brings in each of the first 5 rows, up to the line feed character, ignoring the |||| separator.

Basically, I am getting an output like the one I wrote above to show you the data structure.

Any hints?


Solution

  • Using just the data in your question:

    >>> df = pd.read_csv('rio.txt', sep='\|{4}', skip_blank_lines=True, engine='python')
    >>> df
                           CREATED_AT                  ID  \
    0  Tue Nov 14 12:33:00 +0000 2017  930413253766791168   
    1  Tue Nov 14 12:34:00 +0000 2017  930413253766821456   
    
                                                    TEXT IN_REPLY_TO_USER_ID  \
    0  ICYMI: Football clubs join the craft beer revo...                       
    1                                      I'm up for it                       
    
        NAME SCREEN_NAME                                        DESCRIPTION  \
    0    BAB  BABBrewers  Monthly homebrew meet-up at 1000 Trades, Jewel...   
    1  Misty    MistyGrl                                     You CAN DO it!   
    
       FOLLOWERS_COUNT TIME_ZONE  QUOTE_COUNT  REPLY_COUNT  RETWEET_COUNT  \
    0               95                      0            0              0   
    1               45                      0            0              0   
    
       FAVORITE_COUNT  
    0               0  
    1               0  
    

    Notice the sep parameter. When it's more than one character long and not equal to '\s+' it's interpreted as a regular expression. But the '|' character has special meaning in a regex, hence it must be escaped, using the '\' character. I could simply have written sep='\|\|\|\|'; however, I've used an abbreviation.