I'm supposed to normalize time statements in input, converting them to a standard format. The input statements contain an hour, possibly minutes, and a part of day (morning or evening). The part of day can be expressed multiple ways. The hour might be based on a 12-hour clock.
The output must use a 24 hour clock and "am" or "pm " for the time of day. Extra characters (such as spaces) in the time statement should be kept. Minutes shouldn't be added; if the original statement doesn't include minutes, they shouldn't appear in the result.
Sample data
#input examples:
inputs = [
"6 de la manana hdhd", #example 1
"hdhhd 06: de la manana hdhd", #example 2
"hd:00 06 : de la manana hdhd", #example 3
"hdhhd 6 de la manana hdhd", #example 4
"hdhhd 06:00 de la manana hdhd", #example 5
"hdhhd 06 : 18 de la manana hdhd", #example 6
"hdhhd 18 de la manana hdhd", #example 7
"hdhhd 18:18 de la manana hdhd", #example 8
"hdhhd 18 : 00 de la manana hdhd", #example 9
"hdhhd 19 : 19 de la noche hdhd", #example 10
"hdhhd 6 de la noche hdhd", #example 11
]
There are two cases where the hour might need to be changed.
This is my code so far, where I have managed to put together the structure of the replacements but I have not yet been able to extract the data that I will need in the process. I have put pseudocode in those parts that are not finished:
import re #library for using regular expressions
am_list = ["manana", "mañana", "mediodia", "medio dia","madrugada","amanecer"]
pm_list = ["atardecer", "tarde", "ocaso", "noche", "anochecer"]
def fix_time(input_text):
is_am_time, is_pm_time = False, False
hour_number_fixed, civil_time_fixed = "", ""
re_pattern_for_am = r"\d{1,2})[\s|:]*(\d{0,2})\s*(?:de la |de el)" + am_list
if (identification condition for am):
#extract with re.group()
hour_number = int() # <--- \d{1,2}
am_or_pm = str() # <--- am_list
re_pattern_for_pm = r"\d{1,2})[\s|:]*(\d{0,2})\s*(?:de la |de el)" + pm_list
if (identification condition for pm):
#extract with re.group()
hour_number = int() # <--- \d{1,2}
am_or_pm = str() # <--- pm_list
if (am_or_pm == one element in am_list):
is_am_time = True
elif (am_or_pm == one element in pm_list):
is_pm_time = True
if (is_am_time == True):
if (hour_number >= 12):
civil_time_fixed = "pm"
else:
civil_time_fixed = "am"
hour_number_fixed = str(hour_number)
elif (is_pm_time == True):
if (hour_number < 12):
hour_number_fixed = str(hour_number + 12 )
civil_time_fixed = "pm"
#replacement process
input_text = input_text.replace(hour_number, hour_number_fixed, 1)
input_text = input_text.replace(am_or_pm, civil_time_fixed, 1)
return input_text
I need the program to decide and correct the schedules, using the data (hour_number
and am_or_pm
) that it must extract from the input_string
with re.group()
. This is what is giving me the most trouble. How can I get the regexes to capture the hour and part of day?
The correct output in each case:
"6 am hdhd" #for the example 1
"hdhhd 06: am hdhd" #for the example 2
"hd:00 06 : am hdhd" #for the example 3
"hdhhd 6 am hdhd" #for the example 4
"hdhhd 06:00 am hdhd" #for the example 5
"hdhhd 06 : 18 am hdhd" #for the example 6
"hdhhd 18 pm hdhd" #for the example 7
"hdhhd 18:18 pm hdhd" #for the example 8
"hdhhd 18 : 00 pm hdhd" #for the example 9
"hdhhd 19 : 19 pm hdhd" #for the example 10
"hdhhd 18 pm hdhd" #for the example 11
How do I do those data extractions with re.group()
(or similar method) in this code?
First, note that normalizing the hour is beyond the capabilities of regular expressions, so that will need to be performed in Python. Fortunately, re.sub
accepts a function to create the replacement string.
The sample regex has a few large issues:
There's also a minor issue: the pattern will fail for strings with 'de el' because there's no space between 'el' and the AM or PM word.
Note you can combine the two regexes into one, and then check whether the AM or PM subpattern was matched. An easy way to do this is to use two named groups, one for AM and one for PM, with the words and phrases for each period in the corresponding group.
The sub-expressions to match the hour and minute can also be named, for clarity of access. The time expression could also be named.
This gives the following Python to create the pattern:
am_pattern = '|'.join(am_list)
pm_pattern = '|'.join(pm_list)
time_pattern = r"(?P<time>(?P<hour>\d{1,2})(?P<minute>[\s:|]*\d{0,2}))
pattern = f'{time_pattern}\s*(?:de la|de el)\s(?:(?P<am>{am_pattern})|(?P<pm>{pm_pattern}))'
Evaluated (in free-spacing mode, for clarity), the regex is:
(?P<time>
(?P<hour>\d{1,2})
(?P<minute>[\s:|]*\d{0,2})
)
\s*(?:de la|de el)\s
(?:
(?P<am>manana|mañana|mediodia|medio dia|madrugada|amanecer)
|
(?P<pm>atardecer|tarde|ocaso|noche|anochecer)
)
There are a few minor improvements that can be made, such as:
\d{1,2}
will match the '23' of '123').With these changes, the regex construction becomes:
am_pattern = '|'.join(am_list)
pm_pattern = '|'.join(pm_list)
time_pattern = r"(?P<time>(?P<hour>\b\d{1,2})(?P<minute>[\s:|]+\d{0,2})?)"
pattern = f'{time_pattern}\s*de (?:la|el) (?:(?P<am>{am_pattern})|(?P<pm>{pm_pattern}))'
Evaluated:
(P<time>
(?P<hour>\b\d{1,2})
(?P<minute>[\s:|]+\d{1,2})?
)
\s*de (?:la|el)\s
(?:
(?P<am>manana|mañana|mediodia|medio dia|madrugada|amanecer)
|
(?P<pm>atardecer|tarde|ocaso|noche|anochecer)
)'
With the above regex to extract the necessary information, the replacement function has a few tasks:
def matched_group(match, groups, default='', throw=False):
"""
Return the name of the first named group from 'groups' that had a match.
"""
for group in groups:
if match.group(group):
return group
if throw:
raise KeyError(f'no group found from ({groups})')
return default # could also throw
def repl_time(match):
meridiem = matched_group(match, ['am', 'pm'])
time, hour, minute = match.group('time', 'hour', 'minute')
hour = int(hour)
if hour > 12:
meridiem = 'pm'
elif 'pm' == meridiem: # hour <= 12
hour += 12
time = str(hour) + minute
return time.rstrip() + ' ' + meridiem
reTime.sub(repl_time, input_text)
Applying the above to the samples produces the desired results:
samples = [
"6 de la manana hdhd",
"hdhhd 06: de la manana hdhd",
"hd:00 06 : de la manana hdhd",
"hdhhd 6 de la manana hdhd",
"hdhhd 06:00 de la manana hdhd",
"hdhhd 06 : 18 de la manana hdhd",
"hdhhd 18 de la manana hdhd",
"hdhhd 18:18 de la manana hdhd",
"hdhhd 18 : 00 de la manana hdhd",
"hdhhd 19 : 19 de la noche hdhd",
"hdhhd 6 de la noche hdhd",
]
[reTime.sub(repl_time, sample) for sample in samples]
Results:
[
'6 am hdhd',
'hdhhd 06: am hdhd',
'hd:00 06 : am hdhd',
'hdhhd 6 am hdhd',
'hdhhd 06:00 am hdhd',
'hdhhd 06 : 18 am hdhd',
'hdhhd 18 pm hdhd',
'hdhhd 18:18 pm hdhd',
'hdhhd 18 : 00 pm hdhd',
'hdhhd 19 : 19 pm hdhd',
'hdhhd 18 pm hdhd'
]