Search code examples
pythonldapython-unicodeisnumeric

Effectively turning strings into unicode for python 2.7


I'm following a turtorial on LDA and encountering a problem since the turtorial is made in python 3 and I'm working in 2.7 (the turtorial claims to work in both). As far as I understand I need to turn strings into unicode in python 2.x before I can apply token.isnumeric(). Due to my lack of experience and knowledge I'm not sure how to do this nicely in the following script. Does anyone have a solution?

data_dir = 'nipstxt/'
yrs = ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
dirs = ['nips' + yr for yr in yrs]
docs = []
for yr_dir in dirs:
files = os.listdir(data_dir + yr_dir)
    for filen in files:
        # Note: ignoring characters that cause encoding errors.
        with open(data_dir + yr_dir + '/' + filen) as fid:
            txt = fid.read()
        docs.append(txt)

tokenizer = RegexpTokenizer(r'\w+')
for idx in range(len(docs)):
    docs[idx] = docs[idx].lower()  # Convert to lowercase.
    docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.

docs = [[token for token in doc if not token.isnumeric()] for doc in docs]

docs = [[token for token in doc if len(token) > 1] for doc in docs]

Solution

  • The generic way to convert a byte string to a Unicode string is with decode. If you know the string will only contain ASCII characters (as a number will), you don't have to specify a parameter, it will default to ascii.

    docs = [[token for token in doc if not token.decode().isnumeric()] for doc in docs]
    

    If there's any chance that the string will contain non-ASCII characters, you can get those replaced with a special character that won't count as numeric.

    docs = [[token for token in doc if not token.decode(errors='replace').isnumeric()] for doc in docs]