Search code examples
pythonpython-2.7pandaslookupstring-search

Python Pandas: Lookup table by searching for substring


I have a dataframe with a column for app user-agents. What I need to do is to identify the particular app from this column. For example,

NewWordsWithFriendsFree/2.3 CFNetwork/672.1.15 Darwin/14.0.0 will be categorized in Words With Friends.

iPhone3,1; iPhone OS 7.1.2; com.fingerarts.sudoku2; 143441-1,24 will be Sudoku by FingerArts etc.

I will have another dataframe with the strings I need to match. For example,

Keyword                 Game 
NewWordsWithFriends     Words With Friends
com.fingerarts.sudoku   Sudoku by FingerArts

How do I do the lookup like this for a pandas dataframe? The dataframe for example is like

user    date                 user-agent
 A      2015-09-02 13:45:56  NewWordsWithFriendsFree/2.3 CFNetwork/672.1.15 Darwin/14.0.0
 B      2015-08-31 23:04:21  iPhone3,1; iPhone OS 7.1.2; com.fingerarts.sudoku2; 143441-1,24

I want a new column GameName after the lookup.


Solution

  • One possible way to achieve this would be:

    import pandas as pd                                                              
    
    # some example data
    qry = pd.DataFrame.from_dict({"Keyword": ["NewWordsWithFriends",                 
                                              "com.fingerarts.sudoku"],              
                                  "Game": ["Words With Friends",                     
                                           "Sudoku by FingerArts"]})                 
    
    df = pd.DataFrame.from_dict({"user-agent" : ["NewWordsWithFriendsFree/2.3 CFNetwork/672.1.15 Darwin/14.0.0",     
                                                 "iPhone3,1; iPhone OS 7.1.2; com.fingerarts.sudoku2; 143441-1,24"]})
    
    keywords = qry.Keyword.tolist()                                                  
    games = qry.Game.tolist()                                                        
    
    def select(x):                                                                   
        for key, game in zip(keywords, games):                                       
            if key in x:                                                             
                return game                                                          
    
    df["GameName"] = df["user-agent"].apply(select)  
    

    This will give:

    In [41]: df
    Out[41]: 
                                              user-agent              GameName
    0  NewWordsWithFriendsFree/2.3 CFNetwork/672.1.15...    Words With Friends
    1  iPhone3,1; iPhone OS 7.1.2; com.fingerarts.sud...  Sudoku by FingerArts
    

    If you need to do that for large dataset you need to test the performance of this solution and see if it is fast enough for your purpose.

    If not, perhaps optimize for example the way the strings are tested:

    Having an outer loop for all possible games and then use .apply to return the results per game per column could speed things up as it will avoid the loop over all games in each call to select() etc.

    To identify the bottlenecks you could use line_profiler (see How can I profile python code line-by-line?).