We have a apache_log parser. We tried several whay to parse the bot true a list of bots ( bot_list ). But without success. We tried comparing two lists, but the bot comes or is not a list.
What we want to achieve is that the bot first goes through a bot_list. So that only the bots coming through that are not in the bot_list.
log = apache_log(lines)
for r in log:
bot = r['bot']
bot_list = [ "Googlebot/2.1",
"AhrefsBot/5.0",
"bingbot/2.0",
"DotBot/1.1",
"MJ12bot/v1.4.5",
"SearchmetricsBot",
"YandexBot/3.0",
]
It is working for one bot on this way.
bot = r['bot'].strip()
if not bot.startswith("Googlebot/2.1"):
This is so to say our filter, bot.startwith. But how can we achieve that the goes first through the bot_list?
Hope someone can bring us in the right direction?
If I understand your problem, you may want to check if bot is not in the bot_list. I would suggest to get the bot name from the logfile:
bot_name = r["bot"].split(" ")[22]
if bot_name not in bot_list:
Let 22 be the position of the UserAgent in your logfile, which you might have already customized.
If the position is not clear you can use a function:
if not len(filter(lambda x: x in r["bot"], bot_list)):
Which is the same as
return_list = []
for i in bot_list:
if i in r["bot"]:
return_list.append(i)
return len(return_list)