Search code examples
pythonregexpython-2.7character-encoding

How to find word after match containing Swedish characters in python 2.7


For a few days i have been struggling to get a function to work. I want to search for the word "Ort : " ( ort = city in English) and get the word after that. Works great with words without Swedish åäö. It doesn't matter if i read a line from a file written windows machine or i create a file with vim. It the "Ort" contains åäö the search comes back empty. I have tried many types of encoding and sometimes i get error but not the results i want. A text can look like this "Brand i byggnad Ort : Örebro alla ute" and the funtion looks like this

#!/usr/bin/env python
# -*- coding: utf-8 -*-
__author__ = 'stefan'
import re
import codecs
import findloc
from findloc import findloc
# tried different ways open file
#testfil = open('extra.flt.test', 'r')
#testfil =  codecs.open('/medianas/html/extra.flt.hist', 'r', '1250')
testfil =  codecs.open('extra.flt.klar', 'r', 'latin1')
#testfil =  codecs.open('/medianas/html/pocsaglog.flt', 'r', '1250')
keyword = 'Ort :'
for line in testfil:
    line = line.decode('utf8')
# Find word after Ort :
    ort = re.search(r'\Ort : (\w+)', line)
# Find word after Adr :
    adr = re.search(r'\Adr : (\w+)', line)
    if adr:
        print adr.group(1)
        adress = adr.group(1)
        cord = findloc(adress)
        lat = (cord[0])
        lng = (cord[1])
    if ort:
        print ort.group(1)
        stad = ort.group(1)
        cord = findloc(stad)
        lat = (cord[0])
        lng = (cord[1])
testfil.close()

I hope someone can help me or point me in the right direction.


Solution

  • I have taken the liberty to cut down your example according to https://stackoverflow.com/help/mcve

    For your problem, try this:

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    
    import re
    
    line="Brand i byggnad Ort : Örebro alla ute"
    
    # Find word after Ort :
    ort = re.search('Ort : (.*)', line)
    print ort.groups()[0].split(' ')[0]