Search code examples
pythonpandasbase64gmail-apibase64url

Convert ^I delimited file to pandas dataframe


I am downloading a csv file from gmail attachment using below code. The output file generated looks like ^I delimited and I am not sure how to change it to some other delimiter so that I can load it into a pandas dataframe.

message = service.users().messages().get(userId=user_id, id=msg_id).execute()
for part in message['payload'].get('parts', ''):
    if part['filename']:
        file_name = part['filename']
        att_id = part['body']['attachmentId']
        att = service.users().messages().attachments().get(userId=user_id, messageId=msg_id,
                                                               id=att_id).execute()
        data = att['data']
file_data = base64.urlsafe_b64decode(data.encode('UTF-8'))
with open(path, 'wb') as f:
    f.write(file_data)

Output file content is separated like below - Location ID^IProcessed Currency Code^IExternal MID^IDBA Name

Edit 1: Now I am using tab as separator but get a different error while reading file into dataframe. Perhaps, it has got something to do with urlsafe_b64decode or data.encode('UTF-8') but I am not able to understand. The actual attachment file in gmail is a tab separated file.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte


Solution

  • There are two issues here:

    • the column delimiter '^I'
    • the encoding of the data

    The delimiter

    '^I' is another way of representing the tab character. Some applications may choose to represent tab characters in this way rather than using the '\t' escape sequence. For example vim displays tabs as '^I' when instructed to display non-printing characters.

    Encoding

    '0xff' is part of a byte-order-mark or BOM - an invisible character sequence used to tell applications about the encoding used to encode text. The UTF-16 encoding uses the sequence '0xfe0xff' to denote big-endian UTF-16 (Python codec 'utf-16-be'). '0xff0xfe' denotes little-endian UTF-16 (Python codec 'utf-16-le'). UTf-16 uses two bytes to encode characters, so the order of the bytes is important.

    So, to successfully open the attachment, it's necessary to specify that the column delimiter is a tab, and the encoding is UTF-16 (Python can use the BOM to work out which version of UTF-16):

    pd.read_csv(path, header=0, sep='\t', encoding='utf-16')