Ok I admit it, I'm stuck. Hope someone can help me figure this out! I'll try to explain to the best of my abilities. I have two df's. One of them has the string column and municipalities and the other df has municipalities and streets. I want to create a street list per row (for that specific municipality) so it only extract streets in the string column for that specific municipality. The code I have now kinda works but it keeps iterating over all of the municipalities, therefore extracting streets in other municipalities and adding streets to the wrong rows. I hope the code examples below make my question a little more clear.
Create dataframes:
import pandas as pd
import re
# Sample dataframe with the municipality and string column
data1 = {'municipality': ['Urk','Utrecht','Almere','Utrecht','Huizen'],
'text': ["I'm going to Plantage, Pollux and Oostvaardersdiep","Tomorrow I'm going to Hoog Catharijne",
"I'm not going to the Balijelaan","I'm not going to Socrateshof today",
"Next week I'll be going to Socrateshof"]}
df = pd.DataFrame(data1, columns = ['municipality','text'])
print(df)
Output:
municipality text
0 Urk I'm going to Plantage, Pollux and Oostvaarders...
1 Utrecht Tomorrow I'm going to Hoog Catharijne
2 Almere I'm not going to the Balijelaan
3 Utrecht I'm not going to Socrateshof today
4 Huizen Next week I'll be going to Socrateshof
# Sample dataframe with the municipality and street
data2 = {'municipality': ['Urk','Urk','Utrecht','Almere','Almere','Huizen'],
'street_name': ['Plantage','Pollux','Balijelaan','Oostvaardersdiep','Catharijne','Socrateshof']}
df2 = pd.DataFrame(data2, columns = ['municipality','street_name'])
print(df2)
Output:
municipality street_name
0 Urk Plantage
1 Urk Pollux
2 Utrecht Balijelaan
3 Almere Oostvaardersdiep
4 Almere Catharijne
5 Huizen Socrateshof
Run the function below:
# Function
street = []
def extract_street(txt):
mun_list_filter = df['municipality'] # I want the streets for this municipality
df_bag_filter_mun = df2[df2['municipality'].isin(mun_list_filter)] # Filter second df on the wanted municipality
street_list_mun = list(df_bag_filter_mun['street_name'].unique()) # Select all unique streets for the specific municipality
st = re.findall(r"\b|".join(street_list_mun), txt) # Find all the streets in the string column 'tekst'
street.append(st) # Append to empty street list
return street # As you can see it keeps iterating over all municipalities
# Call function by iterating over rows in string column
for txt in df['text']:
extract_street(txt)
# Add street list to df
df = df.assign(**{'street_match': street})
df['street_match'] = [', '.join(map(str, l)) for l in df['street_match']]
df
Output:
municipality text street_match
0 Urk I'm going to Plantage, Pollux and Oostvaardersdiep Plantage, Pollux, Oostvaardersdiep
1 Utrecht Tomorrow I'm going to Hoog Catharijne Catharijne
2 Almere I'm not going to the Balijelaan Balijelaan
3 Utrecht I'm not going to Socrateshof today Socrateshof
4 Huizen Next week I'll be going to Socrateshof Socrateshof
As you can see in the first row for municipality 'Urk' the function added the street 'Oostvaardersdiep' even though this should've only been matched if the municipality for the first row is 'Almere'. Only the last row is correct since 'Socrateshof' is in fact in the municipality 'Huizen'.
Desired result:
municipality text street_match
0 Urk I'm going to Plantage, Pollux and Oostvaardersdiep Plantage, Pollux
1 Utrecht Tomorrow I'm going to Hoog Catharijne
2 Almere I'm not going to the Balijelaan
3 Utrecht I'm not going to Socrateshof today
4 Huizen Next week I'll be going to Socrateshof Socrateshof
I know what the problem is I just don't know how to fix it. I've tried with apply/lambda but no luck either. Thanks!
Adding another answer to show a shorter/simpler way to do what you wanted. (The first one was just to fix what was not working in your code.)
Using .apply()
, you can call a modified verison of your function per row of df
and then do the checking with the street names in df2
.
def extract_street(row):
street_list_mun = df2.loc[df2['municipality'] == row['municipality'], 'street_name'].unique()
streets_regex = r'\b(' + '|'.join(street_list_mun) + r')\b'
streets_found = set(re.findall(streets_regex, row['text']))
return ', '.join(streets_found)
## or if you want this to return a list of streets
# return list(streets_found)
df['street_match'] = df.apply(extract_street, axis=1)
df
Output:
municipality text street_match
0 Urk I'm going to Plantage, Pollux and Oostvaardersdiep Plantage, Pollux
1 Utrecht Tomorrow I'm going to Hoog Catharijne
2 Almere I'm not going to the Balijelaan
3 Utrecht I'm not going to Socrateshof today
4 Huizen Next week I'll be going to Socrateshof Socrateshof
Note:
There's an issue with your regex - the join
part of the expression generates strings like Plantage\b|Pollux
. Which will give a match if (a) the last street name is at the beginning of another word or (b) if the any-except-the-last street names is at the end of another word: "I'm going to NotPlantage, Polluxsss and Oostvaardersdiep" will match for both streets, but it shouldn't. Instead, the word boundary \b
should be at ends of the list of options and with parentheses to separate them. It should generate strings like: \b(Plantage|Pollux)\b
. This won't match with "Polluxsss" or "NotPlantage". I've made that change in the code above.
I'm using set
to get a unique list of street matches. If the line was "I'm going to Pollux, Pollux, Pollux" it would haven given the result 3 times instead of just once.