Search code examples
pythondate-parsingdateparser

Parsing dates from OCRed files using dateparser library


I want to extract dates from OCR images using the dateparser lib.

import dateparser
data = []
listOfPages = glob.glob(r"C:/Users/name/folder/test/*.tif")
for entry in listOfPages:
    text1 = pytesseract.image_to_string(
            Image.open(entry), lang="deu"
        )
    text = re.sub(r'\n',' ', text1)     
    date1 = re.compile(r'(Dresden(\.|,|\s+)?)(.*)', flags = re.DOTALL | re.MULTILINE)
    date = date1.search(text)
    if date:
        dates = dateparser.parse(date.group(3), date_formats=['%d %m %Y'], languages=['de'], settings={'STRICT_PARSING': True})
        
    else:
        dates = None
        if dates == None:
            dates = dateparser.parse(date.group(3), date_formats=['%d %B %Y'], locale = 'de', settings={'STRICT_PARSING': True})
        else:
            dates = None

    data.append([text, dates])
    
df0 = pd.DataFrame(data, columns =['raw_text', 'dates'])
print(df0)

Why am i getting error: NameError: name 'dates' is not defined

update: TypeError: Input type must be str

updated sample tif


Solution

  • The problem is that your date is a match data object. Also, I am not sure dateparser.parse does what you need. I'd recommend datefinder package to extract dates from text.

    This is the regex I'd use:

    \bDresden(?:[.,]|\s+)?(.*)
    

    See the regex demo. It matches Dresden as a whole word (\b is a word boundary), (?:[.,]|\s+)? is a non-capturing optional group matching ,, . or one or more whitespaces, and then captures into Group 1 any zero or more chars (re.DOTALL allows . to match line separators, too).

    Here is the Python snippet that seems to yield expected matches:

    import pytesseract, dateparser, glob, re
    import pandas as pd
    import datefinder
    from pytesseract.pytesseract import Image
    
    imgpath = r'1.tif'
    data = []
    listOfPages = glob.glob(r"C:/Users/name/folder/test/*.tif")
    listOfPages = [imgpath]
    for entry in listOfPages:
        text = pytesseract.image_to_string(
                Image.open(entry), lang="deu"
            )
    
        dates = []
        date = re.search(r'\bDresden(?:[.,]|\s+)?(.*)', text, re.DOTALL)
        if date:
            dates = [t.strftime("%d %B %Y") for t in datefinder.find_dates(date.group(1))]
            #dates = dateparser.parse(date.group(1), date_formats=['%d %m %Y'], languages=['de'], settings={'STRICT_PARSING': True})
    
        data.append([text, dates])
        
    df0 = pd.DataFrame(data, columns =['raw_text', 'dates'])
    print(df0)
    

    With your sample image, I get

                                                raw_text                               dates
    0  Sächsischer Landtag DRUCKSACHE , 1972\n2. Wahl...  [17 October 1995, 18 October 1995]