Search code examples
pythonencodingreplacedata-processingdigraphs

Python: how to get rid of non-ascii characters being read from a file


I am processing, with python, a long list of data that looks like this

data screenshot

The digraphs are probably due to encoding problems. (I am not sure whether these characters will be preserved in this site)

29/07/2016 04:00:12 0.125143    

Now, when I read such file into a script using something like open and readlines, there is an error, reading

SyntaxError: EOL while scanning string literal

I know (or may look up usage of) replace and regex functions, but I cannot do them in my script. The biggest problem is that anywhere I include or read such strange character, error occurs, pointing on the very line it is read. So I cannot do anything to them.


Solution

  • I find that the re.findall works. (I am sorry I do not have time to test all other methods, since the significance of this job has vanished, and I even forget this question itself.)

    def extract_numbers(str_i):
       pat="(\d+)/(\d+)/(\d+)\D*(\d+):(\d+):(\d+)\D*(\d+)\.(\d+)"
       match_h = re.findall(pat, str_i)
       return match_h[0]
    
    # ....
    # `f` is the handle of the file in question
    lines =f.readlines()
    for l in lines:
       ls_f =extract_numbers(l)
       # process them....