Search code examples
pythonregexdocx

Extract GPS coordinates from .docx file with python


I have some hectic task to do for which I need some help from python. Please see this word document.

enter image description here

I am to extract texts and GPS coordinates from each row. There are currently over 100 coordinates in 10 docx file. My "hefty" python knowledge get me to this.

from docx import Document
import re

main_file = Document("D:/DOCUMENTS/Google_Link/1  Category I/1  Category 
I.docx")
table = main_file.tables[1] #this is same for every document

data = []
keys = None

for i, row in enumerate(table.rows):
   text = (cell.text for cell in row.cells)

if i == 0:
    keys = tuple(text)
    continue

row_data = tuple(text)
data.append(row_data)

regexReference = re.compile("(C.-)\w+")
colReference = [item[1] for item in data]

listReference = filter(regexReference.match, colReference)

for i in listReference:
    print i.encode('UTF-8')

I can print 16 reference ids from column 2. Please guide me to print something like this.

C1-20701-17-1

some site, some region

The existing CMC Office at Bariyodhala (22°40'34.3"N; 91°38'28.2"E) requires 
some repair/maintenance works including electrical wiring and electrical 
lights and appliances like ceiling fans supplies. Detail specification of 
the works are attached

x = 91°38'28.2"E
y = 22°40'34.3"N

These XY locations and descritions will be used to create KML files afterwards and attach with each document. I'd prefer a variable for each part of the above section (ref id, location, description, x and y) so that I can automate that as well.

demo docx


Solution

  • I don't know if this works if there are files with different patterns (p.s. I'm using python 2.7.11):

    # -*- coding: utf-8 -*-
    from docx import Document
    import sys
    import os
    import re
    
    reload(sys)
    sys.setdefaultencoding('utf8')
    
    for root, dirs, files in os.walk("."):
        for name in files:
            doc_file = os.path.join(root, name)
            if doc_file.endswith('docx'):
                main_file = Document(doc_file)
                table = main_file.tables[1]  # this is same for every document
    
                data = []
                keys = None
    
                for i, row in enumerate(table.rows):
                    text = (cell.text for cell in row.cells)
    
                    if i == 0:
                        keys = tuple(text)
                        continue
    
                    row_data = tuple(text)
                    data.append(row_data)
    
                regexReference = re.compile("(C.-[0-9-]+)")
                regexCoordinate = re.compile(r'(N-(.{,12})([0-9]|\')|[0-9].{,12}N)[;, ]+(E-(.{,12})([0-9]|\')|[0-9].{,12}E)')
    
                result = []
                for item in data:
                    tmp = dict()
                    matchReference = regexReference.search(item[1])
                    matchCoordinate = regexCoordinate.search(unicode(item[2]))
                    if matchReference:
                        tmp['reference'] = matchReference.group()
                    if matchCoordinate:
                        tmp['x'] = matchCoordinate.group(1)
                        tmp['y'] = matchCoordinate.group(4)
                    tmp['description'] = unicode(item[2])
                    tmp['location'] = unicode(item[3])
                    result.append(tmp)
    
                for rs in result:
                    if 'reference' in rs:
                        for k, v in rs.iteritems():
                            print('{} = {}'.format(k, v))
                        print
    
    # Output:
    # --------------------------------
    # y = 91°38'28.2"E
    # x = 22°40'34.3"N
    # description = The existing CMC Office at Bariyodhala (22°40'34.3"N; 91°38'28.2"E) requires some repair/maintenance works including electrical wiring and electrical lights and appliances like ceiling fans supplies. Detail specification of the works are attached.
    # reference = C1-20701-17-1
    # location = xxxxx Site, c Region