Search code examples
javaregexgrailsnlpprettytime

Java/Grails - PrettyTime NLP Possible to split non date part?


I am using PrettyTime NLP to find dates from a list.

Example

ABC High School March 5, 2016
XYZ High School 08/20/2016 Gym

When I parse using PrettyTimeNLP, it gives me a list of dates in this format.

Sat Aug 20 10:05:27 EDT 2016

My question is if it is possible to parse the string, and then split it before or after the date so I can have

string1 = 'XYZ High School'
stirng2 = '08/20/2016'
string3 = 'Gym' 

I know I can use RegEx to do the job but the example here is a simple one. My document will be 1-10 pages long and contain various formats of dates.

Any examples of how to manipulate PrettyTime will be appreciated.


Solution

  • The DateGroup provided by PrettyTimeParser.parseSyntax() contains some of the information needed to answer your question. The rest of the information can be determined from the original text.

    @GrabResolver(name='sonatype-snapshots', root='https://oss.sonatype.org/content/repositories/snapshots/')
    @Grab('org.ocpsoft.prettytime:prettytime-nlp:4.0.1.Final')
    
    import org.ocpsoft.prettytime.nlp.PrettyTimeParser
    
    def list = [
        'ABC High School March 5, 2016',
        'XYZ High School 08/20/2016 Gym'
    ]
    
    def parser = new PrettyTimeParser()
    
    list.collect {
        [rawText: it, dateGroup: parser.parseSyntax(it).head()]
    }.collect {
        def before = 0..<it.dateGroup.position
        def after = it.dateGroup.position + it.dateGroup.text.size()..<it.rawText.size()
    
        [
            before: it.rawText[before].trim(),
            date: it.dateGroup.dates.head(),
            dateString: it.dateGroup.text,
            after: it.rawText[after].trim()
        ]
    }
    

    NOTE: Don't use the @Grabs in Grails, you should already have the dependencies set up.

    How it works

    The example above uses the entire original text along with the position in which Pretty Time found the date, and the text which was parsed into a date, to create two ranges: one for the text before the date, and another for the text after the date. These two ranges are then used against the entire original text to extract the three components. OK... four, I added the Date. The output looks like this:

    [
        [
            before:ABC High School, 
            date:Sat Mar 05 11:45:56 EST 2016, 
            dateString:March 5, 2016, 
            after:
        ], 
        [
           before:XYZ High School, 
           date:Sat Aug 20 11:45:56 EDT 2016, 
           dateString:08/20/2016, 
           after:Gym
        ]
    ]