Search code examples
pythonregexpython-2.7replacedata-cleaning

Splitting string before name of month with regex


I have a bunch of lines with random text, and at the end of each line, a timestamp. I am trying to split these lines right before the timestamp.

Current output:

Yes, I'd say so. Nov 08, 2014 UTC
Hell yes! Oct 01, 2014 UTC 
Anbefalt som bare det, løp og kjøp. Sep 16, 2014 UTC
Etc.

Desired output (by "tab" I mean the actual whitespace):

Yes, I'd say so. <tab> Nov 08, 2014 UTC
Hell yes! <tab> Oct 01, 2014 UTC
Anbefalt som bare det, løp og kjøp. <tab> Sep 16, 2014 UTC
Etc.

So far I have used "replace" to place a tab character right before the month. Like this:

my_string.replace("May ", "\tMay ").replace("Apr ", "\tApr ").replace("Mar ", "\tMar ").replace("Feb ", "\tFeb ") etc. (incomplete code)

This works fairly well, except when the random text involves the name of a month, e.g. "I bought it last may, great stuff". As the date is formatted in such a specific way I'd like to improve on this with regex and wildcards, if possible. Is there a way to place a tab before these dates? As you can see above, the dates are formatted as follows:

[Three-letter abbreviation of the month] [two-digit day] [,] [four-digit year] [UTC]

E.g.

Oct 31, 2014 UTC

Pardon the amateurish code and approach, I am an absolute regex n00b. I have looked around for answers here on SO, but I've come short. I hope someone can help!


Solution

  • You should be able to do this with a one RegeEx for all months:

    import re
    
    lines = [
        "Yes, I'd say so. Nov 08, 2014 UTC",
        "Hell yes! Oct 01, 2014 UTC"
    ]
    
    for ln in lines:
        print re.sub(r'(\w+\s\d{2}, \d{4} UTC)$', r'\t\1', ln)
    

    Which will return:

    Yes, I'd say so.    Nov 08, 2014 UTC
    Hell yes!   Oct 01, 2014 UTC
    

    How it works is simple. re.sub captures everything in the parentheses of the first argument and assigns it to \1. The second argument r'\t\1' is what we want to replace the string with.

    In your case you want to replace it with the original string (represented by \1) with a tab character (\t) in front of it.