I have a dataframe with a column for app user-agents. What I need to do is to identify the particular app from this column. For example,
NewWordsWithFriendsFree/2.3 CFNetwork/672.1.15 Darwin/14.0.0
will be categorized in Words With Friends
.
iPhone3,1; iPhone OS 7.1.2; com.fingerarts.sudoku2; 143441-1,24 will be Sudoku by FingerArts etc.
I will have another dataframe with the strings I need to match. For example,
Keyword Game
NewWordsWithFriends Words With Friends
com.fingerarts.sudoku Sudoku by FingerArts
How do I do the lookup like this for a pandas dataframe? The dataframe for example is like
user date user-agent
A 2015-09-02 13:45:56 NewWordsWithFriendsFree/2.3 CFNetwork/672.1.15 Darwin/14.0.0
B 2015-08-31 23:04:21 iPhone3,1; iPhone OS 7.1.2; com.fingerarts.sudoku2; 143441-1,24
I want a new column GameName
after the lookup.
One possible way to achieve this would be:
import pandas as pd
# some example data
qry = pd.DataFrame.from_dict({"Keyword": ["NewWordsWithFriends",
"com.fingerarts.sudoku"],
"Game": ["Words With Friends",
"Sudoku by FingerArts"]})
df = pd.DataFrame.from_dict({"user-agent" : ["NewWordsWithFriendsFree/2.3 CFNetwork/672.1.15 Darwin/14.0.0",
"iPhone3,1; iPhone OS 7.1.2; com.fingerarts.sudoku2; 143441-1,24"]})
keywords = qry.Keyword.tolist()
games = qry.Game.tolist()
def select(x):
for key, game in zip(keywords, games):
if key in x:
return game
df["GameName"] = df["user-agent"].apply(select)
This will give:
In [41]: df
Out[41]:
user-agent GameName
0 NewWordsWithFriendsFree/2.3 CFNetwork/672.1.15... Words With Friends
1 iPhone3,1; iPhone OS 7.1.2; com.fingerarts.sud... Sudoku by FingerArts
If you need to do that for large dataset you need to test the performance of this solution and see if it is fast enough for your purpose.
If not, perhaps optimize for example the way the strings are tested:
Having an outer loop for all possible games and then use .apply
to return the results per game per column could speed things up as it will avoid the loop over all games in each call to select()
etc.
To identify the bottlenecks you could use line_profiler
(see How can I profile python code line-by-line?).