Search code examples
pythonunicodeutf-8ftp

FTP Tab Delimited Text file unable to save as utf-8 encoded


First - No, I cannot change the FTP settings. It is entirely locked down as it is embedded in a device and extremely old. It doesn't support PASV, TLS, or anything a more modern server would. The software is WS_FTP 4.8 (as far as I can tell) which I can't even find that version date, but it is probably somewhere around 20 years old. I have contacted the company that owns these devices to see if they will do the right thing and do a firmware update that puts a better FTP server in their hardware, but I have not received a response.

Issuing a set PASV gives a 502, so I know for sure it doesn't support that; I am not even sure that has anything to do with this issue. I think it has to do with whatever the underlying OS is on this device.

FTP login messages:

Connection established, waiting for welcome message...
Status: Insecure server, it does not support FTP over TLS.
Status: Server does not support non-ASCII characters.

I am going to post the different things I have tried to do to fix this:

        with open(local_temp_file, 'wb', encoding='UTF-8', errors='replace') as local_file:
            conn.retrbinary('RETR ' + filename_convention
                            + yesterday + '.txt', local_file.write)

FTP Logs:

*resp* '200 Type set to I.'
*resp* '200 PORT command successful.'
*cmd* 'RETR Data Log Trend_Ops_Data_Log_230804.txt'
*resp* '150 Opening BINARY mode data connection for Data Log Trend_Ops_Data_Log_230804.txt.'

Traceback:

{'TypeError'}
Traceback (most recent call last):
  File "c:\users\justin\onedrive\documents\epic_cleantec_work\ftp log retriever\batch_data_get.py", line 123, in get_log_data
    conn.retrbinary('RETR ' + filename_convention
  File "D:\Anaconda\Lib\ftplib.py", line 441, in retrbinary
    callback(data)
TypeError: write() argument must be str, not bytes

Ok cool - str not bytes

        with open(local_temp_file, 'w', encoding='UTF-8', errors='replace') as local_file:
            conn.retrlines('RETR ' + filename_convention
                           + yesterday + '.txt', local_file.write)

FTP logs:

*resp* '200 Type set to A.'
*resp* '200 PORT command successful.'
*cmd* 'RETR Data Log Trend_Ops_Data_Log_230804.txt'
*resp* '150 Opening ASCII mode data connection for Data Log Trend_Ops_Data_Log_230804.txt.'

Traceback:

{'UnicodeDecodeError'}
Traceback (most recent call last):
  File "c:\users\justin\onedrive\documents\epic_cleantec_work\ftp log retriever\batch_data_get.py", line 123, in get_log_data
    conn.retrlines('RETR ' + filename_convention
  File "D:\Anaconda\Lib\ftplib.py", line 465, in retrlines
    line = fp.readline(self.maxline + 1)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Now I have a ftp client and I have downloaded these files successfully with it and my code sending to S3 has worked to send. The problem is I don't know what the encoding actually is.

Opening it with Open Office Calc it just says Unicode.

I tried

def detect_encoding(file):
    detector = chardet.universaldetector.UniversalDetector()
    with open(file, "rb") as f:
        for line in f:
            detector.feed(line)
            if detector.done:
                break
        detector.close()
    return detector.result

and then f = open(local_temp_file, 'wb')

        conn.retrbinary('RETR ' + filename_convention
                        + yesterday + '.txt', f.write)
        f.close()
        f.encode('utf-8')
        print(detect_encoding(f))

Traceback:

{'AttributeError'}
Traceback (most recent call last):
  File "c:\users\justin\onedrive\documents\epic_cleantec_work\ftp log retriever\batch_data_get.py", line 139, in get_log_data
    f.encode('utf-8')
    ^^^^^^^^
AttributeError: '_io.BufferedWriter' object has no attribute 'encode'

I also tried the above function with

        f = open(local_temp_file, 'wb')

        conn.retrbinary('RETR ' + filename_convention
                        + yesterday + '.txt', f.write)
        f.close()

Traceback:

{'TypeError'}
Traceback (most recent call last):
  File "c:\users\justin\onedrive\documents\epic_cleantec_work\ftp log retriever\batch_data_get.py", line 140, in get_log_data
    print(detect_encoding(f))
          ^^^^^^^^^^^^^^^^^^
  File "c:\users\justin\onedrive\documents\epic_cleantec_work\ftp log retriever\batch_data_get.py", line 69, in detect_encoding
    with open(file, "rb") as f:
         ^^^^^^^^^^^^^^^^
TypeError: expected str, bytes or os.PathLike object, not BufferedWriter

Ultimately here is why this matters - these tab delimited text files get dropped into S3 and crawled by a glue crawler and currently it cannot read the columns properly, it only sees a single column because at the opening of the file it is showing ?? in diamonds which tells me it is encoded in some way that the crawler doesn't recognize as a csv using a '\t' delimiter (yes I set a classifier for this). I also need to append 2 fields (columns) to the tab delimited file to give the site name and a timestamp for each row of data which I can't do to just a string.

I am likely missing something simple, but I have scoured the google and seen a lot of SO posts and I cannot seem to find a solution.


Solution

  • Large thanks to Martin, TDelaney, and Ed_ for making me think about a few things in a different way. First, I downloaded the file and opened it in Notepad on Windows when I remembered it tells you encoding in the lower right corner. TDelaney wins UTF-16. Ed_ for pointing out I wasn't actually sending the file, but the string of the file path. Martin for making sure I was not focused on the FTP as the issue. Cheers!

    Here is the outcome of their push:

    def convert_to_utf8(check_file, output_file):
    
        with open(check_file, 'rb') as of:
            chardet_data = chardet.detect(of.read())
            fileencoding = (chardet_data['encoding'])
            print('fileencoding', fileencoding)
    
        if fileencoding in ['utf-8', 'ascii']:
            return {'re-encoded': False, 'encoding': fileencoding}
    
        else:
            with open(check_file, 'r', encoding=fileencoding) as of, \
                    open(output_file, 'w', encoding='utf-8') as cf:
                cf.write(of.read())
    
            return {'re-encoded': True, 'encoding': fileencoding}
    

    .....

            # Create the temporary file paths for later use
            temp_dir = tempfile.gettempdir()
    
            if not os.path.exists(temp_dir):
                tempfile.mkdtemp(os.path.abspath(__file__) + os.sep + 'tmp')
                temp_dir = os.path.abspath(__file__) + os.sep + 'tmp'
    
            inp_file = os.path.join(
                temp_dir, filename_convention + yesterday + '.txt')
    
            temp_file = os.path.join(temp_dir, 'temp_file.tsv')
    
            op_file = os.path.join(
                temp_dir, filename_convention + yesterday + '.tsv')
    
            # Step 4: Connect to the current FTP, Download the File,
            # Write to s3 and clean up
            # Connect to the FTP site
            conn = ftp_connection(ftp_ip, ftp_username, ftp_password)
    
            # Change the directory
            conn.cwd(file_directory)
            conn.sendcmd('SITE CHMOD 777 ' + filename_convention
                         + yesterday + '.txt')
    
            # Download the file to the local temporary directory
            with open(inp_file, 'wb') as f:
                conn.retrbinary('RETR ' + filename_convention
                                + yesterday + '.txt', f.write)
    
            s3_file = convert_to_utf8(inp_file,
                                      temp_file)
    
            print(s3_file['re-encoded'], s3_file['encoding'])
    
            with open(temp_file, 'r') \
                as tsv_inp, open(op_file, 'w',
                                 newline='\n') as tsv_op:
    
    
            with open(temp_file, 'r') \
                as tsv_inp, open(op_file, 'w',
                                 newline='\n') as tsv_op:
    
                csv_reader = csv.reader(tsv_inp, delimiter='\t')
                csv_writer = csv.writer(tsv_op, delimiter='\t')
    
                # Read the first row
                headers = next(csv_reader, None)
                csv_writer.writerow(headers+['SITE_NAME', 'SNAPSHOT'])
    
                for row in csv_reader:
                    row.append('{}'.format(ftp_site_name))
                    row.append('{}'.format(timestamp))
                    csv_writer.writerow(row)
    

    I have confirmed that the output is now UTF-8. Yes I know the CSV append is not working yet;

    Could probably make the convertor more flexible if you wanted and just call it convert_encoding and pass in the desired encoding. Something like

    def convert_encoding(check_file, output_file, desiredencoding):
    
        with open(check_file, 'rb') as of:
            chardet_data = chardet.detect(of.read())
            fileencoding = (chardet_data['encoding'])
            print('fileencoding', fileencoding)
    
        if fileencoding == desiredencoding:
            return {'re-encoded': False, 'encoding': fileencoding}
    
        else:
            with open(check_file, 'r', encoding=fileencoding) as of, \
                    open(output_file, 'w', encoding=desiredencoding) as cf:
                cf.write(of.read())
    
            return {'re-encoded': True, 'encoding': fileencoding}
    

    Now off to add file clean up and add logging instead of printing. If anyone is interested here is the code with some anonymizations (Yes it needs some more work, but I am all self-taught still trying to understand main and classes):

    from datetime import datetime as dt
    from datetime import timedelta
    import base64
    import boto3
    import botocore
    import chardet
    import csv
    import datetime
    import ftplib
    import os
    import tempfile
    import traceback
    
    
    def assume_role_with_iam_user(access_key, secret_key, role_arn,
                                  session_name='AssumedSession'):
    """
    Assumes an IAM role using IAM User's access and secret keys.
    
    Parameters:
        access_key (str): IAM User's access key.
        secret_key (str): IAM User's secret key.
        role_arn (str): ARN of the IAM role you want to assume.
        session_name (str): Name of the assumed role session (optional).
    
    Returns:
        boto3.Session: A session with the assumed role credentials.
    """
    sts_client = boto3.client('sts',
                              aws_access_key_id=base64.b64decode(
                                  access_key).decode('utf-8'),
                              aws_secret_access_key=base64.b64decode(
                                  secret_key).decode('utf-8'),
                              region_name='us-west-1')
    
    # Assume the role
    assumed_role = sts_client.assume_role(
        RoleArn=role_arn,
        RoleSessionName=session_name
    )
    
    # Create a new session with the assumed role credentials
    assumed_credentials = assumed_role['Credentials']
    session = boto3.Session(
        aws_access_key_id=assumed_credentials['AccessKeyId'],
        aws_secret_access_key=assumed_credentials['SecretAccessKey'],
        aws_session_token=assumed_credentials['SessionToken']
    )
    
    return session
    
    
    def ftp_connection(HOST, USER, PASS):
    
    try:
        ftp = ftplib.FTP(source_address=())
        ftp.connect(HOST)
        ftp.login(USER, PASS)
        ftp.set_pasv(False)
        ftp.set_debuglevel(4)
    
    except ftplib.all_errors as ex:
        print(str(ex))
        raise
    
    return ftp
    
    
    def convert_to_utf8(check_file, output_file):
    
    with open(check_file, 'rb') as of:
        chardet_data = chardet.detect(of.read())
        fileencoding = (chardet_data['encoding'])
        print('fileencoding', fileencoding)
    
    if fileencoding in ['utf-8', 'ascii']:
        return {'re-encoded': False, 'encoding': fileencoding}
    
    else:
        with open(check_file, 'r', encoding=fileencoding) as of, \
                open(output_file, 'w', encoding='utf-8') as cf:
            cf.write(of.read())
    
        return {'re-encoded': True, 'encoding': fileencoding}
    
    
    def get_log_data():
    
    # Define variables for later use:
    yesterday = dt.strftime(dt.today() - timedelta(days=1), '%y%m%d')
    # Create snapshot time
    timestamp = dt.utcnow().isoformat()
    # Where we pick up the ftp config file and it's name
    config_s3_bucket_name = 'ftp-config-file'
    config_file_key = 'ftp_config.csv'
    # Where we want the data to go
    data_s3_bucket_name = 'company-data'
    # This is the name of the crawler we need to run
    crawler_name = 'HMILogs'
    
    # These are our AWS keys Base64 Encoded; need to impore security later
    ak = 'NoKeyForYou'
    sk = 'StillNoKeyForYou'
    role = 'arn:aws:iam::123456789:role/service-role/DataRetrieve-role-lh22tofx'
    
    # Step 1: Assume role to get creds
    aws_session = assume_role_with_iam_user(ak, sk, role)
    
    try:
        # Step 2: Connect to S3 and download the config file
        s3 = aws_session.client('s3',
                                config=boto3.session
                                .Config(signature_version='s3v4'))
        config_file_obj = s3.get_object(Bucket=config_s3_bucket_name,
                                        Key=config_file_key)
        config_file_data = config_file_obj['Body'] \
            .read().decode('utf-8').splitlines()
        config_reader = csv.DictReader(config_file_data)
    
        # Step 3: Loop through each row in the config file
        for row in config_reader:
            ftp_site_name = row['ftp_site_name']
            ftp_ip = row['ftp_ip_address']
            ftp_username = row['ftp_username']
            ftp_password = row['ftp_password']
            file_directory = row['ftp_log_directory']
            filename_convention = row['filename_convention']
    
            # Create the temporary file paths for later use
            temp_dir = tempfile.gettempdir()
    
            if not os.path.exists(temp_dir):
                tempfile.mkdtemp(os.path.abspath(__file__) + os.sep + 'tmp')
                temp_dir = os.path.abspath(__file__) + os.sep + 'tmp'
    
            inp_file = os.path.join(
                temp_dir, filename_convention + yesterday + '.txt')
    
            temp_file = os.path.join(temp_dir, 'temp_file.tsv')
    
            op_file = os.path.join(
                temp_dir, filename_convention + yesterday + '.tsv')
    
            # Step 4: Connect to the current FTP, Download the File,
            # Write to s3 and clean up
            # Connect to the FTP site
            conn = ftp_connection(ftp_ip, ftp_username, ftp_password)
    
            # Change the directory
            conn.cwd(file_directory)
            conn.sendcmd('SITE CHMOD 777 ' + filename_convention
                         + yesterday + '.txt')
    
            # Download the file to the local temporary directory
            with open(inp_file, 'wb') as f:
                conn.retrbinary('RETR ' + filename_convention
                                + yesterday + '.txt', f.write)
    
            s3_file = convert_to_utf8(inp_file,
                                      temp_file)
    
            print(s3_file['re-encoded'], s3_file['encoding'])
    
            with open(temp_file, 'r') \
                as tsv_inp, open(op_file, 'w',
                                 newline='\n') as tsv_op:
    
                csv_reader = csv.reader(tsv_inp, delimiter='\t')
                csv_writer = csv.writer(tsv_op, delimiter='\t')
    
                # Read the first row
                headers = next(csv_reader, None)
                csv_writer.writerow(headers+['SITE_NAME', 'SNAPSHOT'])
    
                for row in csv_reader:
                    row.append('{}'.format(ftp_site_name))
                    row.append('{}'.format(timestamp))
                    csv_writer.writerow(row)
    
            # Upload the file from the local temporary directory to S3
            s3_key = '{}/dt={}/{}.csv'.format(ftp_site_name, yesterday,
                                              filename_convention + '.tsv')
            # s3.upload_file(local_temp_file_op, Bucket=data_s3_bucket_name,
            # Key=s3_key)
    
            try:
                s3.head_object(Bucket=data_s3_bucket_name,
                               Key=s3_key)
    
            except botocore.exceptions.ClientError as error:
                if error.response['Error']['Code'] == 404:
                    print("Object does not exist!")
    
            print('ENABLE DELETE COMMAND IN THE FUTURE')
            # conn.sendcmd('DELETE ' + filename_convention
            # + yesterday + '.txt')
    
            # Close connection, Close local File, and remove
            conn.close()
            # os.remove(local_temp_directory + '\\' + local_file)
    
        # Step 5: Crawl the new data into the Table
        glue = aws_session.client('glue', region_name='us-west-1')
    
        if not glue.get_crawler(Name=crawler_name)['Crawler']['State'] \
                == 'RUNNING':
            glue.start_crawler(Name=crawler_name)
            print('Crawler Started')
    
    except Exception as ex:
        print(str({type(ex).__name__}))
        traceback.print_exception(ex)
    
    return 'Function executed successfully.'
    
    
    get_log_data()