Search code examples
python-3.xtextreaderfilemerge

Merging multiple text files into one and related problems


I'm using Windows 7 and Python 3.4.

I have several multi-line text files (all in Persian) and I want to merge them into one under one condition: each line of the output file must contain the whole text of each input file. It means if there are nine text files, the output text file must have only nine lines, each line containing the text of a single file. I wrote this:

import os
os.chdir ('C:\Dir')
with open ('test.txt', 'w', encoding = 'UTF8') as OutFile:
    with open ('news01.txt', 'r', encoding = 'UTF8') as InFile:
        while True:
            _Line = InFile.readline()
            if len (_Line) == 0:
                break
            else:
                _LineString = str (_Line)
                OutFile.write (_LineString)

It worked for that one file but it looks like it takes more than one line in output file and also the output file contains disturbing characters like: &amp, &nbsp and things like that. But the source files don't contain any of them. Also, I've got some other texts: news02.txt, news03.txt, news04.txt ... news09.txt.

Considering all these:

  1. How can I correct my code so that it reads all files one after one, putting each in only one line?
  2. How can I clean these unfamiliar and strange characters or prevent them to appear in my final text?

Solution

  • Here is an example that will do the merging portion of your question:

    def merge_file(infile, outfile, separator = ""):
        print(separator.join(line.strip("\n") for line in infile), file = outfile)
    
    
    def merge_files(paths, outpath, separator = ""):
        with open(outpath, 'w') as outfile:
            for path in paths:
                with open(path) as infile:
                    merge_file(infile, outfile, separator)
    

    Example use:

    merge_files(["C:\file1.txt", "C:\file2.txt"], "C:\output.txt")
    

    Note this makes the rather large assumption that the contents of 'infile' can fit into memory. Reasonable for most text files, but possibly quite unreasonable otherwise. If your text files will be very large, you can this alternate merge_file implementation:

    def merge_file(infile, outfile, separator = ""):
        for line in infile:
            outfile.write(line.strip("\n")+separator)
        outfile.write("\n")
    

    It's slower, but shouldn't run into memory problems.