Search code examples
javadatestring-parsing

Parsing String to Dates - Java


This is the problem:

I have some .csv files with travels info, and the dates appear like strings (each line for one travel):

  • "All Mondays from January-May and October-December. All days from June To September"
  • "All Fridays from February to June"
  • "Monday, Friday and Saturday and Sunday from 10 January to 30 April"
  • "from 01 of November to 30 April. All days except fridays from 2 to 24 of november and sunday from 2 to 30 of december"
  • "All sundays from 02 december to 28 april"
  • "5, 12, 20 of march, 11, 18 of april, 2, 16, 30 of may, 6, 13, 27 june"
  • "All saturdays from February to June, and from September to December"
  • "1 to 17 of december, 1 to 31 of january"
  • "All mondays from February to november"

I must parse the strings to Dates, and keep them into an array for each travel.

The problem is that I don't know how to do it. Even my univesrity teachers told me that they don't know how to do so :S. I can't find/create a pattern using http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html

After parsing them i have to search all travels between two dates.

But how? How to parse them? it's possible?


Solution

  • This requires Natural Language Processing (NLP) , see Wikipedia for an account: http://en.wikipedia.org/wiki/Natural_language_processing.

    Your problem as stated is very hard. There are many ways of representing a single date, and your examples include ranges of dates and formulae for generating dates. It sounds as if you have a limited subset of language - frequent use of "all", "from", etc.

    If you are in control of the language (i.e. these are being generated by humans who comply with your documentation) then you have a chance of formalising it (although it will take a lot of work - months). If you are not in charge of it, then every time a new phrase appears you will have to add it to the specs.

    I suggest you got through the file and look for stock phrases "All [weekdayname]s [from | between | until | before]". Or "in [January | February ...]". Then substitute these in in phrases. If you find this covers all the cases you may be able to extract particular phrases". But if you have anaphora like "next Tuesday" it will be much harder.