Keyword categorization from strings in a new column in pandas

This is not the best approach but this what I did so far:

I have this example df:

df = pd.DataFrame({
                  'City': ['I lived Los Angeles', 'I visited London and Toronto','the best one is Toronto', 'business hub is in New York',' Mexico city is stunning']
                   })

df

gives:

    City
0   I lived Los Angeles
1   I visited London and Toronto
2   the best one is Toronto
3   business hub is in New York
4   Mexico city is stunning

I am trying to match (case insensitive) city names from a nested dic and create a new column with the country name with int values for statistical purposes.

So, here is my nested dic as a reference for countries and cities:

country = { 'US': ['New York','Los Angeles','San Diego'],
         'CA': ['Montreal','Toronto','Manitoba'],
         'UK': ['London','Liverpool','Manchester']      
       }

and I created a function that should look for the city from the df and match it with the dic, then create a column with the country name:

def get_country(x):
    count = 0
    for k,v in country.items():
        for y in v:
            if y.lower() in x:
                df[k] = count + 1
            else:
                return None

then applied it to df:

df.City.apply(lambda x: get_country(x.lower()))

I got the following output:

    City                              US
0   I lived Los Angeles               1
1   I visited London and Toronto      1
2   the best one is Toronto           1
3   business hub is in New York       1
4   Mexico city is stunning           1

Expected output:

    City                              US    CA   UK
0   I lived Los Angeles               1      0    0
1   I visited London and Toronto      0      1    1
2   the best one is Toronto           0      1    0
3   business hub is in New York       1      0    0
4   Mexico city is stunning           0      0    0

Solution

Here is a solution based on your function. I changed the name of the variables to be more readable and easy to follow.

df = pd.DataFrame({
    'City': ['I lived Los Angeles', 
             'I visited London and Toronto',
             'the best one is Toronto', 
             'business hub is in New York',
             ' Mexico city is stunning']
})


country_cities = { 
    'US': ['New York','Los Angeles','San Diego'],
    'CA': ['Montreal','Toronto','Manitoba'],
    'UK': ['London','Liverpool','Manchester']      
}

def get_country(text):
    text = text.lower()
    count = 0
    country_counts = dict.fromkeys(country_cities, 0)
    
    for country, cities in country_cities.items():
        for city in cities:
            if city.lower() in text:
                country_counts[country] += 1 
                
    return pd.Series(country_counts)

df = df.join(df.City.apply(get_country))

Output:

                           City  US  CA  UK
0           I lived Los Angeles   1   0   0
1  I visited London and Toronto   0   1   1
2       the best one is Toronto   0   1   0
3   business hub is in New York   1   0   0
4       Mexico city is stunning   0   0   0

Solution based on Series.str.count

A simpler solution is using Series.str.count to count the occurences of the following regex pattern city1|city2|etc for each country (the pattern matches city1 or city2 or etc). Using the same setup as above:

country_patterns = {country: '|'.join(cities) for country, cities in country_cities.items()}

for country, pat in country_patterns.items():
    df[country] = df['City'].str.count(pat)

Why your solution doesn't work?

if y.lower() in x:
               df[k] = count + 1
           else:
               return None

The reason your function doesn't produce the right output is that you are returning None if a city is not found in the text: the remaining countries and cities are not checked, because the return statement automatically exits the function.

What is happening is that only US cities are checked, and the line df[k] = 1 (in this case k = 'US') creates an entire column named k filled with the value 1. It's not creating a single value for that row, it creates or modifies the full column. When using apply you want to change a single row or value (the input of function), so don't change directly the main DataFrame inside the function.