Search code examples
pythonpandasstringpython-re

Keyword categorization from strings in a new column in pandas


This is not the best approach but this what I did so far:

I have this example df:

df = pd.DataFrame({
                  'City': ['I lived Los Angeles', 'I visited London and Toronto','the best one is Toronto', 'business hub is in New York',' Mexico city is stunning']
                   })

df

gives:

    City
0   I lived Los Angeles
1   I visited London and Toronto
2   the best one is Toronto
3   business hub is in New York
4   Mexico city is stunning

I am trying to match (case insensitive) city names from a nested dic and create a new column with the country name with int values for statistical purposes.

So, here is my nested dic as a reference for countries and cities:

country = { 'US': ['New York','Los Angeles','San Diego'],
         'CA': ['Montreal','Toronto','Manitoba'],
         'UK': ['London','Liverpool','Manchester']      
       }

and I created a function that should look for the city from the df and match it with the dic, then create a column with the country name:

def get_country(x):
    count = 0
    for k,v in country.items():
        for y in v:
            if y.lower() in x:
                df[k] = count + 1
            else:
                return None
            

then applied it to df:

df.City.apply(lambda x: get_country(x.lower()))

I got the following output:

    City                              US
0   I lived Los Angeles               1
1   I visited London and Toronto      1
2   the best one is Toronto           1
3   business hub is in New York       1
4   Mexico city is stunning           1

Expected output:

    City                              US    CA   UK
0   I lived Los Angeles               1      0    0
1   I visited London and Toronto      0      1    1
2   the best one is Toronto           0      1    0
3   business hub is in New York       1      0    0
4   Mexico city is stunning           0      0    0

Solution

  • Here is a solution based on your function. I changed the name of the variables to be more readable and easy to follow.

    df = pd.DataFrame({
        'City': ['I lived Los Angeles', 
                 'I visited London and Toronto',
                 'the best one is Toronto', 
                 'business hub is in New York',
                 ' Mexico city is stunning']
    })
    
    
    country_cities = { 
        'US': ['New York','Los Angeles','San Diego'],
        'CA': ['Montreal','Toronto','Manitoba'],
        'UK': ['London','Liverpool','Manchester']      
    }
    
    def get_country(text):
        text = text.lower()
        count = 0
        country_counts = dict.fromkeys(country_cities, 0)
        
        for country, cities in country_cities.items():
            for city in cities:
                if city.lower() in text:
                    country_counts[country] += 1 
                    
        return pd.Series(country_counts)
    
    df = df.join(df.City.apply(get_country))
    

    Output:

                               City  US  CA  UK
    0           I lived Los Angeles   1   0   0
    1  I visited London and Toronto   0   1   1
    2       the best one is Toronto   0   1   0
    3   business hub is in New York   1   0   0
    4       Mexico city is stunning   0   0   0
    

    Solution based on Series.str.count

    A simpler solution is using Series.str.count to count the occurences of the following regex pattern city1|city2|etc for each country (the pattern matches city1 or city2 or etc). Using the same setup as above:

    country_patterns = {country: '|'.join(cities) for country, cities in country_cities.items()}
    
    for country, pat in country_patterns.items():
        df[country] = df['City'].str.count(pat)
    

    Why your solution doesn't work?

    if y.lower() in x:
                   df[k] = count + 1
               else:
                   return None
    

    The reason your function doesn't produce the right output is that you are returning None if a city is not found in the text: the remaining countries and cities are not checked, because the return statement automatically exits the function.

    What is happening is that only US cities are checked, and the line df[k] = 1 (in this case k = 'US') creates an entire column named k filled with the value 1. It's not creating a single value for that row, it creates or modifies the full column. When using apply you want to change a single row or value (the input of function), so don't change directly the main DataFrame inside the function.