python file-io utf-8 utf-16 byte-order-mark

shutil.copyfileobj method in python copies BOM character also while merging files

I have list of temp files which are created in utf-16 LE encoding . I need to merge those temp files and the resultant file should be in utf-16 .

What I had done

for fd in source_fds_list:
   with open(destination_url, 'ab') as destn_fd:
       shutil.copyfileobj(fd, destn_fd)
   fd.close()

It results in more than one BOM appended in the destination file .

I have few solutions for the above scenario .

Need to check and remove/skip the BOM character in every temp files
Creating temporary files in utf-8 format which don't have BOM character and then merge them using copyfileobj

What if we do if the temporary files were written in different encoding style ?

Is there better solution exists other than manually checking for BOM using file read ?

Solution

shutil.copyfileobj() copies all data, regardless. A BOM is just data in the file, shutil is not and will not be aware of such file-format specific details.

You can easily skip the BOM yourself but leave the bulk of the copying to shutil.copyfileobj() still:

import codecs

for fd in source_fds_list:
   with open(destination_url, 'ab') as destn_fd:
       with fd:
           start = fd.read(2)
           if start != codecs.BOM_UTF16_LE:
               destn_fd.write(start)
           shutil.copyfileobj(fd, destn_fd)

By reading an initial 2 bytes from the source file first, shutil.copyfileobj() will continue to read everything else in the file, skipping the BOM. All shutil.copyfileobj() does under the hood is call data = source.read(buffer) and destination.write(data), anyway.

If you don't know the codecs used for the input files, you are stuck with heuristics. You can test for the various codecs BOM constants but the possibility of false-positives then arises; a file encoded with a codec other than UTF-* and initial bytes looking like a BOM:

for fd in source_fds_list:
   with open(destination_url, 'ab') as destn_fd:
       with fd:
           start = fd.read(4)

           if start not in (codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE):
               if start[:3] != BOM_UTF8:
                   if start[:2] in (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE):
                       # UTF-16 BOM, skip 2 bytes
                       start = start[2:]
               else:
                   # UTF-8 BOM, skip 3 bytes
                   start = start[-1]
               # Not a UTF-32 BOM, write read bytes (minus skipped bytes)
               destn_fd.write(start)

           shutil.copyfileobj(fd, destn_fd)