I am a decent SAS programmer, but I am quite new in Python. Now, I have been given Twitter feeds, each saved into very large flat files, with headers in row #1 and a data structure like the below:
CREATED_AT||||ID||||TEXT||||IN_REPLY_TO_USER_ID||||NAME||||SCREEN_NAME||||DESCRIPTION||||FOLLOWERS_COUNT||||TIME_ZONE||||QUOTE_COUNT||||REPLY_COUNT||||RETWEET_COUNT||||FAVORITE_COUNT Tue Nov 14 12:33:00 +0000 2017||||930413253766791168||||ICYMI: Football clubs join the craft beer revolution! A good read|||| ||||BAB||||BABBrewers||||Monthly homebrew meet-up at 1000 Trades, Jewellery Quarter. First Tuesday of the month. All welcome, even if you've never brewed before.||||95|||| ||||0||||0||||0||||0 Tue Nov 14 12:34:00 +0000 2017||||930413253766821456||||I'm up for it|||| ||||Misty||||MistyGrl||||You CAN DO it!||||45|||| ||||0||||0||||0||||0
I guess it's like that because any sort of characters can be found in a Twitter feed, but a quadruple pipe is unlikely enough.
I know some people use JSON for that, but I've got these files as such: lots of them. I could use SAS to easily transform these files, but I prefer to "go pythonic", this time.
Now, I cannot seem to find a way to make Python (2.7) understand that the quadruple pipe is the actual separator. The output from the code below:
import pandas as pd
with open('C:/Users/myname.mysurname/Desktop/my_twitter_flow_1.txt') as theInFile:
inTbl = pd.read_table(theInFile, engine='python', sep='||||', header=1)
print inTbl.head()
seem to suggest that Python does not see the distinct fields as distinct but, simply, brings in each of the first 5 rows, up to the line feed character, ignoring the |||| separator.
Basically, I am getting an output like the one I wrote above to show you the data structure.
Any hints?
Using just the data in your question:
>>> df = pd.read_csv('rio.txt', sep='\|{4}', skip_blank_lines=True, engine='python')
>>> df
CREATED_AT ID \
0 Tue Nov 14 12:33:00 +0000 2017 930413253766791168
1 Tue Nov 14 12:34:00 +0000 2017 930413253766821456
TEXT IN_REPLY_TO_USER_ID \
0 ICYMI: Football clubs join the craft beer revo...
1 I'm up for it
NAME SCREEN_NAME DESCRIPTION \
0 BAB BABBrewers Monthly homebrew meet-up at 1000 Trades, Jewel...
1 Misty MistyGrl You CAN DO it!
FOLLOWERS_COUNT TIME_ZONE QUOTE_COUNT REPLY_COUNT RETWEET_COUNT \
0 95 0 0 0
1 45 0 0 0
FAVORITE_COUNT
0 0
1 0
Notice the sep
parameter. When it's more than one character long and not equal to '\s+' it's interpreted as a regular expression. But the '|' character has special meaning in a regex, hence it must be escaped, using the '\' character. I could simply have written sep='\|\|\|\|'
; however, I've used an abbreviation.