I have list of temp files which are created in utf-16 LE encoding . I need to merge those temp files and the resultant file should be in utf-16 .
What I had done
for fd in source_fds_list:
with open(destination_url, 'ab') as destn_fd:
shutil.copyfileobj(fd, destn_fd)
fd.close()
It results in more than one BOM appended in the destination file .
What if we do if the temporary files were written in different encoding style ?
Is there better solution exists other than manually checking for BOM using file read ?
shutil.copyfileobj()
copies all data, regardless. A BOM is just data in the file, shutil
is not and will not be aware of such file-format specific details.
You can easily skip the BOM yourself but leave the bulk of the copying to shutil.copyfileobj()
still:
import codecs
for fd in source_fds_list:
with open(destination_url, 'ab') as destn_fd:
with fd:
start = fd.read(2)
if start != codecs.BOM_UTF16_LE:
destn_fd.write(start)
shutil.copyfileobj(fd, destn_fd)
By reading an initial 2 bytes from the source file first, shutil.copyfileobj()
will continue to read everything else in the file, skipping the BOM. All shutil.copyfileobj()
does under the hood is call data = source.read(buffer)
and destination.write(data)
, anyway.
If you don't know the codecs used for the input files, you are stuck with heuristics. You can test for the various codecs
BOM constants but the possibility of false-positives then arises; a file encoded with a codec other than UTF-* and initial bytes looking like a BOM:
for fd in source_fds_list:
with open(destination_url, 'ab') as destn_fd:
with fd:
start = fd.read(4)
if start not in (codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE):
if start[:3] != BOM_UTF8:
if start[:2] in (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE):
# UTF-16 BOM, skip 2 bytes
start = start[2:]
else:
# UTF-8 BOM, skip 3 bytes
start = start[-1]
# Not a UTF-32 BOM, write read bytes (minus skipped bytes)
destn_fd.write(start)
shutil.copyfileobj(fd, destn_fd)