I am downloading a csv file from gmail attachment using below code. The output file generated looks like ^I delimited and I am not sure how to change it to some other delimiter so that I can load it into a pandas dataframe.
message = service.users().messages().get(userId=user_id, id=msg_id).execute()
for part in message['payload'].get('parts', ''):
if part['filename']:
file_name = part['filename']
att_id = part['body']['attachmentId']
att = service.users().messages().attachments().get(userId=user_id, messageId=msg_id,
id=att_id).execute()
data = att['data']
file_data = base64.urlsafe_b64decode(data.encode('UTF-8'))
with open(path, 'wb') as f:
f.write(file_data)
Output file content is separated like below - Location ID^IProcessed Currency Code^IExternal MID^IDBA Name
Edit 1: Now I am using tab as separator but get a different error while reading file into dataframe. Perhaps, it has got something to do with urlsafe_b64decode or data.encode('UTF-8') but I am not able to understand. The actual attachment file in gmail is a tab separated file.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
There are two issues here:
'^I'
The delimiter
'^I'
is another way of representing the tab character. Some applications may choose to represent tab characters in this way rather than using the '\t'
escape sequence. For example vim displays tabs as '^I'
when instructed to display non-printing characters.
Encoding
'0xff'
is part of a byte-order-mark or BOM - an invisible character sequence used to tell applications about the encoding used to encode text. The UTF-16 encoding uses the sequence '0xfe0xff'
to denote big-endian UTF-16 (Python codec 'utf-16-be'). '0xff0xfe'
denotes little-endian UTF-16 (Python codec 'utf-16-le'). UTf-16 uses two bytes to encode characters, so the order of the bytes is important.
So, to successfully open the attachment, it's necessary to specify that the column delimiter is a tab, and the encoding is UTF-16 (Python can use the BOM to work out which version of UTF-16):
pd.read_csv(path, header=0, sep='\t', encoding='utf-16')