Search code examples
pythonunicodeluigi

Handle unicode in luigi


I have several text files which are encoded in UTF-8. I am constructing a data flow with luigi and what I want is to read the files one by one into unicode strings, clean them and finally write them into some new UTF-8 files. The problem is that in the run method of the CleanText class I can't seem to be able to use unicode with luigi.LocalTarget. Any ideas will be appreciated!

Just as a side note, I need to use unicode in order to handle accented characters in a standardized manner. Here is my code:

import luigi
import os
import re

class InputText(luigi.ExternalTask):
    """
    Checks which inputs exist
    """
    filename = luigi.Parameter()

    def output(self):
        """
        Outputs a single LocalTarget
        """

        # The directory containing this file
        root = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe()))) + "/"
        return luigi.LocalTarget(root + self.filename)

class CleanText(luigi.Task):
    """docstring for CleanText"""
    input_dir = luigi.Parameter()
    clean_dir = luigi.Parameter()

    def requires(self):
        return [ InputText(self.input_dir + '/' + filename)
                for filename in os.listdir(self.input_dir) ]    

    def run(self):
        for inp, outp in zip(self.input(), self.output()):
            fi = inp.open('r')
            fo = outp.open('w')
            txt = fi.read().lower()#.decode('utf-8') ### <-- This doesnt work
            #txt = unicode(txt, 'utf-8') ### <-- This doesnt work either
            txt = self.clean_text(txt)
            print txt.decode('utf-8')[:100]
            print txt[:100]
            fo.write(txt.encode('utf-8'))
            fi.close()
            fo.close()

    def output(self):
        # return luigi.LocalTarget(self.clean_dir + '/' + 'prueba.txt')
        return [ luigi.LocalTarget(self.clean_dir + '/' + filename)
                for filename in os.listdir(self.input_dir) ]

    def clean_text(self, d):
        '''d debe ser un string en unicode'''
        d = re.sub(u'[^a-z0-9áéíóúñäëïöü]', ' ', d)
        d = re.sub(' +', ' ', d)
        d = re.sub(' ([^ ]{1,3} )+', ' ', d, )
        d = re.sub(' [^ ]*(.)\\1{2,}[^ ]* ', ' ', d)
        return d

Solution

  • I had a similar problem to write then read a unicode file with luigi.

    I found this on Github https://github.com/spotify/luigi/issues/790 about the MixedUnicodeBytesFormat in the luigi.format module. Reading the source, I've got a UTF8 format. You can pass a format parameter to a Target instance.

    import luigi
    from luigi.format import UTF8
    
    luigi.LocalTarget('/path/to/data.csv', format=UTF8)
    

    This can occur in a def output(self) method since it's a Target. I think you can also use luigi.file.LocalFileSystem with the specific format.

    Hope that helps.