Search code examples
pythonregexduplicatesappendpython-re

Why is my Regex search yielding more than expected and why isn't my loop removing certain duplicates?


I'm working on a program that parses weather information. In this part of my code, I am trying to re-organise the results in order of time before continuing to append more items later on.

The time in these lines is usually the first 4 digits of any line (first 2 digits are the day and the others are the hour). The exception to this is the line that starts with 11010KT, this line is always assumed to be the first line in any weather report, and those numbers are a wind vector and NOT a time.

You will see that I am removing any line that has TEMPO INTER or PROB at the start of this example because I want lines containing these words to be added to the end of the other restructured list. These lines can be thought of as a separate list in which I want organised by time in the same way as the other items.

I am trying to use Regex to pull the times from the lines that remain after removing the TEMPO INTER and PROB lines and then sort them, then once sorted, use regex again to find that line in full and create a restructured list. Once that list has been completed, I am sorting the TEMPO INTER and PROB list and then appending that to the newly completed list I had just made.

I have also tried a for loop that will remove any duplicate lines added, but this seems to only remove one duplicate of the TEMPO line???

Can someone please help me figure this out? I am kind of new to this, thank you...

This ideally should come back looking like this:

ETA IS 0230 which is 1430 local

11010KT 5000 MODERATE DRIZZLE BKN004
FM050200 12012KT 9999 LIGHT DRIZZLE BKN008 
TEMPO 0501/0502 2000 MODERATE DRIZZLE BKN002
INTER 0502/0506 4000 SHOWERS OF MODERATE RAIN BKN008

Instead of this, I am getting repeats of the line that starts with FM050200 and then repeats of the line starting with TEMPO. It doesn't find the line starting with INTER either...

I have made a minimal reproducible example for anyone to try and help me. I will include that here:

import re

total_print = ['\nFM050200 12012KT 9999 LIGHT DRIZZLE BKN008', '\n11010KT 5000 MODERATE DRIZZLE BKN004', '\nINTER 0502/0506 4000 SHOWERS OF MODERATE RAIN BKN008', '\nTEMPO 0501/0502 2000 MODERATE DRIZZLE BKN002']

removed_lines = []
for a in total_print:  # finding and removing lines with reference to TEMPO INTER PROB
    if 'TEMPO' in a:
        total_print.remove(a)
        removed_lines.append(a)
for b in total_print:
    if 'INTER' in b:
        total_print.remove(b)
        removed_lines.append(b)
for f in total_print:
    if 'PROB' in f:
        total_print.remove(f)
        removed_lines.append(f)

list_time_in_line = []
for line in total_print: # finding the times in the remaining lines
    time_in_line = re.search(r'\d\d\d\d', line)
    list_time_in_line.append(time_in_line.group())
sorted_time_list = sorted(list_time_in_line)

removed_time_in_line = []
for g in removed_lines:  # finding the times in the lines that were originally removed
    removed_times = re.search(r'\d\d\d\d', g)
    removed_time_in_line.append(removed_times.group())
sorted_removed_time_list = sorted(removed_time_in_line)


final = []
final.append('ETA IS 1230 which is 1430 local\n')  # appending the time display
search_for_first_line = re.search(r'[\n]\d\d\d\d\dKT', ' '.join(total_print))  # searching for line that has wind vector instead of time
search_for_first_line = search_for_first_line.group()

if search_for_first_line:  # adding wind vector line so that its the firs line listed in the group
    search_for_first_line = re.search(r'%s.*' % search_for_first_line, ' '.join(total_print)).group()
    final.append('\n' + search_for_first_line)

print(sorted_time_list)  # the list of possible times found (the second item in list is the wind vector and not a time)
d = 0
for c in sorted_time_list:  # finding the whole line for the corresponding time
    print(sorted_time_list[d])
    search_for_whole_line = re.search(r'.*\w+\s*%s.*' % sorted_time_list[d], ' '.join(total_print))
    print(search_for_whole_line.group())  # it is doubling up on the 0502 time???????
    d += 1
    final.append('\n' + str(search_for_whole_line.group()))

h = 0
for i in sorted_removed_time_list:  # finding the whole line for the corresponding times from the previously removed items
    whole_line_in_removed_srch = re.search(r'.*%s.*' % sorted_removed_time_list[h], ' '.join(removed_lines))
    h += 1
    final.append('\n' + str(whole_line_in_removed_srch.group()))  # appending them

l_new = []
for item in final:  # this doesn't seeem to properly remove duplicates ?????
    if item not in l_new:
        l_new.append(item)
total_print = l_new

print(' '.join(total_print))

//////////////////////////////////////////EDIT:

I had asked this recently and got an excellent answer to my problem from @diggusbickus. I have now hit a new problem with the sorting in the answer.

Because my original question had only one type of weather line (beginning with the letters 'FM') in my data['other'], the lambda with the split() was only looking at the first item of the line [0] for the time.

data['other'] = sorted(data['other'], key=lambda x: x.split()[0])

Which is where the time is located (in previous question it was FM050200 where 05 is the day and 0200 is the time). That works very well for when there are lines beginning with FM, but I have realised that occasionally lines like this exist:

'\nBECMG 0519/0520 27007KT 9999 SHOWERS OF LIGHT RAIN SCT020 BKN030'

The time in this style of line is the FIRST 4 digits located at index [1] and is in a 4 digit format instead of the 6 digit format line in FM050200. The time in this new line is 05 as the day and 19 as the hour (so 1900).

I need this style of line to be grouped with the FM lines, the problem is that they don't sort. I am trying to find a way to be able to sort the lines by time regardless of whether the time is on the [0] index and in 6 digit format or on the [1] index and in 4 digit format.

I will include a new example with a couple of small changes on the originally answered question. This new question will have different data as the total_print vairable. This is a working example.

I essentially need the lines to be sorted by the FIRST 4 digits of any line, and the results should look like this:

ETA IS 0230 which is 1430 local

FM131200 20010KT 9999 SHOWERS OF LIGHT RAIN SCT006 BKN010 
FM131400 20010KT 9999 SHOWERS OF LIGHT RAIN SCT006 BKN010 
BECMG 1315/1317 27007KT 9999 SHOWERS OF LIGHT RAIN SCT020 BKN030 
TEMPO 1312/1320 4000 SHOWERS OF MODERATE RAIN BKN007

NB. The TEMPO line is supposed to stay at the end, so don't worry about that one.

Here is the example, thank you so much to anyone who helps.

import re

total_print = ['\nBECMG 1315/1317 27007KT 9999 SHOWERS OF LIGHT RAIN SCT020 BKN030', '\nFM131200 20010KT 9999 SHOWERS OF LIGHT RAIN SCT006 BKN010', '\nFM131400 20010KT 9999 SHOWERS OF LIGHT RAIN SCT006 BKN010','\nTEMPO 1312/1320 4000 SHOWERS OF MODERATE RAIN BKN007']
data = {
    'windvector': [], # if it is the first line of the TAF
    'other': [], # anythin with FM or BECMG
    'tip': []  # tempo/inter/prob
}

wind_vector = re.compile('^\s\d{5}KT')
for line in total_print:
    if 'TEMPO' in line \
            or 'INTER' in line \
            or 'PROB' in line:
        key = 'tip'
    elif re.match(wind_vector, line):
        key = 'windvector'
    else:
        key = 'other'
    data[key].append(line)

final = []
data['other'] = sorted(data['other'], key=lambda x: x.split()[0])
data['tip'] = sorted(data['tip'], key=lambda x: x.split()[1])


final.append('ETA IS 0230 which is 1430 local\n')

for lst in data.values():
    for line in lst:
        final.append('\n' + line[1:])  # get rid of newline

print(' '.join(final))

Solution

  • just sort your data into a dict, you're always creating lists and removing items: it's too confusing.

    your regex to catch the wind vector catches also 12012KT, that's why that line was repeated. the ^ ensures it matches only your pattern if it's a the beginning of the line

    import re
    
    total_print = ['\nFM050200 12012KT 9999 LIGHT DRIZZLE BKN008', '\n11010KT 5000 MODERATE DRIZZLE BKN004', '\nINTER 0502/0506 4000 SHOWERS OF MODERATE RAIN BKN008', '\nTEMPO 0501/0502 2000 MODERATE DRIZZLE BKN002']
    
    data = {
        'windvector': [],
        'other': [],
        'tip': [] #tempo/inter/prob
    }
    
    wind_vector=re.compile('^\s\d{5}KT')
    for line in total_print:
        if 'TEMPO' in line \
                or 'INTER' in line \
                or 'PROB' in line:
            key='tip'
        elif re.match(wind_vector, line):
            key='windvector'
        else:
            key='other'
        data[key].append(line)
            
    data['tip']=sorted(data['tip'], key=lambda x: x.split()[1])
    print('ETA IS 0230 which is 1430 local')
    print()
    for lst in data.values():
        for line in lst:
            print(line[1:]) #get rid of newline