FTP Tab Delimited Text file unable to save as utf-8 encoded

First - No, I cannot change the FTP settings. It is entirely locked down as it is embedded in a device and extremely old. It doesn't support PASV, TLS, or anything a more modern server would. The software is WS_FTP 4.8 (as far as I can tell) which I can't even find that version date, but it is probably somewhere around 20 years old. I have contacted the company that owns these devices to see if they will do the right thing and do a firmware update that puts a better FTP server in their hardware, but I have not received a response.

Issuing a set PASV gives a 502, so I know for sure it doesn't support that; I am not even sure that has anything to do with this issue. I think it has to do with whatever the underlying OS is on this device.

FTP login messages:

Connection established, waiting for welcome message...
Status: Insecure server, it does not support FTP over TLS.
Status: Server does not support non-ASCII characters.

I am going to post the different things I have tried to do to fix this:

        with open(local_temp_file, 'wb', encoding='UTF-8', errors='replace') as local_file:
            conn.retrbinary('RETR ' + filename_convention
                            + yesterday + '.txt', local_file.write)

FTP Logs:

*resp* '200 Type set to I.'
*resp* '200 PORT command successful.'
*cmd* 'RETR Data Log Trend_Ops_Data_Log_230804.txt'
*resp* '150 Opening BINARY mode data connection for Data Log Trend_Ops_Data_Log_230804.txt.'

Traceback:

{'TypeError'}
Traceback (most recent call last):
  File "c:\users\justin\onedrive\documents\epic_cleantec_work\ftp log retriever\batch_data_get.py", line 123, in get_log_data
    conn.retrbinary('RETR ' + filename_convention
  File "D:\Anaconda\Lib\ftplib.py", line 441, in retrbinary
    callback(data)
TypeError: write() argument must be str, not bytes

Ok cool - str not bytes

        with open(local_temp_file, 'w', encoding='UTF-8', errors='replace') as local_file:
            conn.retrlines('RETR ' + filename_convention
                           + yesterday + '.txt', local_file.write)

FTP logs:

*resp* '200 Type set to A.'
*resp* '200 PORT command successful.'
*cmd* 'RETR Data Log Trend_Ops_Data_Log_230804.txt'
*resp* '150 Opening ASCII mode data connection for Data Log Trend_Ops_Data_Log_230804.txt.'

Traceback:

{'UnicodeDecodeError'}
Traceback (most recent call last):
  File "c:\users\justin\onedrive\documents\epic_cleantec_work\ftp log retriever\batch_data_get.py", line 123, in get_log_data
    conn.retrlines('RETR ' + filename_convention
  File "D:\Anaconda\Lib\ftplib.py", line 465, in retrlines
    line = fp.readline(self.maxline + 1)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Now I have a ftp client and I have downloaded these files successfully with it and my code sending to S3 has worked to send. The problem is I don't know what the encoding actually is.

Opening it with Open Office Calc it just says Unicode.

I tried

def detect_encoding(file):
    detector = chardet.universaldetector.UniversalDetector()
    with open(file, "rb") as f:
        for line in f:
            detector.feed(line)
            if detector.done:
                break
        detector.close()
    return detector.result

and then f = open(local_temp_file, 'wb')

        conn.retrbinary('RETR ' + filename_convention
                        + yesterday + '.txt', f.write)
        f.close()
        f.encode('utf-8')
        print(detect_encoding(f))

Traceback:

{'AttributeError'}
Traceback (most recent call last):
  File "c:\users\justin\onedrive\documents\epic_cleantec_work\ftp log retriever\batch_data_get.py", line 139, in get_log_data
    f.encode('utf-8')
    ^^^^^^^^
AttributeError: '_io.BufferedWriter' object has no attribute 'encode'

I also tried the above function with

        f = open(local_temp_file, 'wb')

        conn.retrbinary('RETR ' + filename_convention
                        + yesterday + '.txt', f.write)
        f.close()

Traceback:

{'TypeError'}
Traceback (most recent call last):
  File "c:\users\justin\onedrive\documents\epic_cleantec_work\ftp log retriever\batch_data_get.py", line 140, in get_log_data
    print(detect_encoding(f))
          ^^^^^^^^^^^^^^^^^^
  File "c:\users\justin\onedrive\documents\epic_cleantec_work\ftp log retriever\batch_data_get.py", line 69, in detect_encoding
    with open(file, "rb") as f:
         ^^^^^^^^^^^^^^^^
TypeError: expected str, bytes or os.PathLike object, not BufferedWriter

Ultimately here is why this matters - these tab delimited text files get dropped into S3 and crawled by a glue crawler and currently it cannot read the columns properly, it only sees a single column because at the opening of the file it is showing ?? in diamonds which tells me it is encoded in some way that the crawler doesn't recognize as a csv using a '\t' delimiter (yes I set a classifier for this). I also need to append 2 fields (columns) to the tab delimited file to give the site name and a timestamp for each row of data which I can't do to just a string.

I am likely missing something simple, but I have scoured the google and seen a lot of SO posts and I cannot seem to find a solution.

Solution

Large thanks to Martin, TDelaney, and Ed_ for making me think about a few things in a different way. First, I downloaded the file and opened it in Notepad on Windows when I remembered it tells you encoding in the lower right corner. TDelaney wins UTF-16. Ed_ for pointing out I wasn't actually sending the file, but the string of the file path. Martin for making sure I was not focused on the FTP as the issue. Cheers!

Here is the outcome of their push:

def convert_to_utf8(check_file, output_file):

    with open(check_file, 'rb') as of:
        chardet_data = chardet.detect(of.read())
        fileencoding = (chardet_data['encoding'])
        print('fileencoding', fileencoding)

    if fileencoding in ['utf-8', 'ascii']:
        return {'re-encoded': False, 'encoding': fileencoding}

    else:
        with open(check_file, 'r', encoding=fileencoding) as of, \
                open(output_file, 'w', encoding='utf-8') as cf:
            cf.write(of.read())

        return {'re-encoded': True, 'encoding': fileencoding}

.....

        # Create the temporary file paths for later use
        temp_dir = tempfile.gettempdir()

        if not os.path.exists(temp_dir):
            tempfile.mkdtemp(os.path.abspath(__file__) + os.sep + 'tmp')
            temp_dir = os.path.abspath(__file__) + os.sep + 'tmp'

        inp_file = os.path.join(
            temp_dir, filename_convention + yesterday + '.txt')

        temp_file = os.path.join(temp_dir, 'temp_file.tsv')

        op_file = os.path.join(
            temp_dir, filename_convention + yesterday + '.tsv')

        # Step 4: Connect to the current FTP, Download the File,
        # Write to s3 and clean up
        # Connect to the FTP site
        conn = ftp_connection(ftp_ip, ftp_username, ftp_password)

        # Change the directory
        conn.cwd(file_directory)
        conn.sendcmd('SITE CHMOD 777 ' + filename_convention
                     + yesterday + '.txt')

        # Download the file to the local temporary directory
        with open(inp_file, 'wb') as f:
            conn.retrbinary('RETR ' + filename_convention
                            + yesterday + '.txt', f.write)

        s3_file = convert_to_utf8(inp_file,
                                  temp_file)

        print(s3_file['re-encoded'], s3_file['encoding'])

        with open(temp_file, 'r') \
            as tsv_inp, open(op_file, 'w',
                             newline='\n') as tsv_op:


        with open(temp_file, 'r') \
            as tsv_inp, open(op_file, 'w',
                             newline='\n') as tsv_op:

            csv_reader = csv.reader(tsv_inp, delimiter='\t')
            csv_writer = csv.writer(tsv_op, delimiter='\t')

            # Read the first row
            headers = next(csv_reader, None)
            csv_writer.writerow(headers+['SITE_NAME', 'SNAPSHOT'])

            for row in csv_reader:
                row.append('{}'.format(ftp_site_name))
                row.append('{}'.format(timestamp))
                csv_writer.writerow(row)

I have confirmed that the output is now UTF-8. Yes I know the CSV append is not working yet;

Could probably make the convertor more flexible if you wanted and just call it convert_encoding and pass in the desired encoding. Something like

def convert_encoding(check_file, output_file, desiredencoding):

    with open(check_file, 'rb') as of:
        chardet_data = chardet.detect(of.read())
        fileencoding = (chardet_data['encoding'])
        print('fileencoding', fileencoding)

    if fileencoding == desiredencoding:
        return {'re-encoded': False, 'encoding': fileencoding}

    else:
        with open(check_file, 'r', encoding=fileencoding) as of, \
                open(output_file, 'w', encoding=desiredencoding) as cf:
            cf.write(of.read())

        return {'re-encoded': True, 'encoding': fileencoding}

Now off to add file clean up and add logging instead of printing. If anyone is interested here is the code with some anonymizations (Yes it needs some more work, but I am all self-taught still trying to understand main and classes):

from datetime import datetime as dt
from datetime import timedelta
import base64
import boto3
import botocore
import chardet
import csv
import datetime
import ftplib
import os
import tempfile
import traceback


def assume_role_with_iam_user(access_key, secret_key, role_arn,
                              session_name='AssumedSession'):
"""
Assumes an IAM role using IAM User's access and secret keys.

Parameters:
    access_key (str): IAM User's access key.
    secret_key (str): IAM User's secret key.
    role_arn (str): ARN of the IAM role you want to assume.
    session_name (str): Name of the assumed role session (optional).

Returns:
    boto3.Session: A session with the assumed role credentials.
"""
sts_client = boto3.client('sts',
                          aws_access_key_id=base64.b64decode(
                              access_key).decode('utf-8'),
                          aws_secret_access_key=base64.b64decode(
                              secret_key).decode('utf-8'),
                          region_name='us-west-1')

# Assume the role
assumed_role = sts_client.assume_role(
    RoleArn=role_arn,
    RoleSessionName=session_name
)

# Create a new session with the assumed role credentials
assumed_credentials = assumed_role['Credentials']
session = boto3.Session(
    aws_access_key_id=assumed_credentials['AccessKeyId'],
    aws_secret_access_key=assumed_credentials['SecretAccessKey'],
    aws_session_token=assumed_credentials['SessionToken']
)

return session


def ftp_connection(HOST, USER, PASS):

try:
    ftp = ftplib.FTP(source_address=())
    ftp.connect(HOST)
    ftp.login(USER, PASS)
    ftp.set_pasv(False)
    ftp.set_debuglevel(4)

except ftplib.all_errors as ex:
    print(str(ex))
    raise

return ftp


def convert_to_utf8(check_file, output_file):

with open(check_file, 'rb') as of:
    chardet_data = chardet.detect(of.read())
    fileencoding = (chardet_data['encoding'])
    print('fileencoding', fileencoding)

if fileencoding in ['utf-8', 'ascii']:
    return {'re-encoded': False, 'encoding': fileencoding}

else:
    with open(check_file, 'r', encoding=fileencoding) as of, \
            open(output_file, 'w', encoding='utf-8') as cf:
        cf.write(of.read())

    return {'re-encoded': True, 'encoding': fileencoding}


def get_log_data():

# Define variables for later use:
yesterday = dt.strftime(dt.today() - timedelta(days=1), '%y%m%d')
# Create snapshot time
timestamp = dt.utcnow().isoformat()
# Where we pick up the ftp config file and it's name
config_s3_bucket_name = 'ftp-config-file'
config_file_key = 'ftp_config.csv'
# Where we want the data to go
data_s3_bucket_name = 'company-data'
# This is the name of the crawler we need to run
crawler_name = 'HMILogs'

# These are our AWS keys Base64 Encoded; need to impore security later
ak = 'NoKeyForYou'
sk = 'StillNoKeyForYou'
role = 'arn:aws:iam::123456789:role/service-role/DataRetrieve-role-lh22tofx'

# Step 1: Assume role to get creds
aws_session = assume_role_with_iam_user(ak, sk, role)

try:
    # Step 2: Connect to S3 and download the config file
    s3 = aws_session.client('s3',
                            config=boto3.session
                            .Config(signature_version='s3v4'))
    config_file_obj = s3.get_object(Bucket=config_s3_bucket_name,
                                    Key=config_file_key)
    config_file_data = config_file_obj['Body'] \
        .read().decode('utf-8').splitlines()
    config_reader = csv.DictReader(config_file_data)

    # Step 3: Loop through each row in the config file
    for row in config_reader:
        ftp_site_name = row['ftp_site_name']
        ftp_ip = row['ftp_ip_address']
        ftp_username = row['ftp_username']
        ftp_password = row['ftp_password']
        file_directory = row['ftp_log_directory']
        filename_convention = row['filename_convention']

        # Create the temporary file paths for later use
        temp_dir = tempfile.gettempdir()

        if not os.path.exists(temp_dir):
            tempfile.mkdtemp(os.path.abspath(__file__) + os.sep + 'tmp')
            temp_dir = os.path.abspath(__file__) + os.sep + 'tmp'

        inp_file = os.path.join(
            temp_dir, filename_convention + yesterday + '.txt')

        temp_file = os.path.join(temp_dir, 'temp_file.tsv')

        op_file = os.path.join(
            temp_dir, filename_convention + yesterday + '.tsv')

        # Step 4: Connect to the current FTP, Download the File,
        # Write to s3 and clean up
        # Connect to the FTP site
        conn = ftp_connection(ftp_ip, ftp_username, ftp_password)

        # Change the directory
        conn.cwd(file_directory)
        conn.sendcmd('SITE CHMOD 777 ' + filename_convention
                     + yesterday + '.txt')

        # Download the file to the local temporary directory
        with open(inp_file, 'wb') as f:
            conn.retrbinary('RETR ' + filename_convention
                            + yesterday + '.txt', f.write)

        s3_file = convert_to_utf8(inp_file,
                                  temp_file)

        print(s3_file['re-encoded'], s3_file['encoding'])

        with open(temp_file, 'r') \
            as tsv_inp, open(op_file, 'w',
                             newline='\n') as tsv_op:

            csv_reader = csv.reader(tsv_inp, delimiter='\t')
            csv_writer = csv.writer(tsv_op, delimiter='\t')

            # Read the first row
            headers = next(csv_reader, None)
            csv_writer.writerow(headers+['SITE_NAME', 'SNAPSHOT'])

            for row in csv_reader:
                row.append('{}'.format(ftp_site_name))
                row.append('{}'.format(timestamp))
                csv_writer.writerow(row)

        # Upload the file from the local temporary directory to S3
        s3_key = '{}/dt={}/{}.csv'.format(ftp_site_name, yesterday,
                                          filename_convention + '.tsv')
        # s3.upload_file(local_temp_file_op, Bucket=data_s3_bucket_name,
        # Key=s3_key)

        try:
            s3.head_object(Bucket=data_s3_bucket_name,
                           Key=s3_key)

        except botocore.exceptions.ClientError as error:
            if error.response['Error']['Code'] == 404:
                print("Object does not exist!")

        print('ENABLE DELETE COMMAND IN THE FUTURE')
        # conn.sendcmd('DELETE ' + filename_convention
        # + yesterday + '.txt')

        # Close connection, Close local File, and remove
        conn.close()
        # os.remove(local_temp_directory + '\\' + local_file)

    # Step 5: Crawl the new data into the Table
    glue = aws_session.client('glue', region_name='us-west-1')

    if not glue.get_crawler(Name=crawler_name)['Crawler']['State'] \
            == 'RUNNING':
        glue.start_crawler(Name=crawler_name)
        print('Crawler Started')

except Exception as ex:
    print(str({type(ex).__name__}))
    traceback.print_exception(ex)

return 'Function executed successfully.'


get_log_data()