I'm trying to find the frequency of strings from the field "Select Investors" on this website https://www.cbinsights.com/research-unicorn-companies
Is there a way to pull out the frequency of each of the comma separated strings?
For example, how frequent does the term "Sequoia Capital China" show up?
The solution provided by @Mazhar checks whether a certain term is a substring of a string delimited by commas. As a consequence, the number of occurrences of 'Sequoia Capital'
returned by this approach is the sum of the occurrences of all the strings that contain 'Sequoia Capital'
, namely 'Sequoia Capital'
, 'Sequoia Capital China'
, 'Sequoia Capital India'
, 'Sequoia Capital Israel'
and 'and Sequoia Capital China'
. The following code avoids that issue:
import pandas as pd
from collections import defaultdict
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)[0]
freqs = defaultdict(int)
for group in df['Select Investors']:
if hasattr(group, 'lower'):
for raw_investor in group.lower().split(','):
investor = raw_investor.strip()
# Ignore empty strings produced by wrong data like this:
# 'B Capital Group,, GE Ventures, McKesson Ventures'
if investor:
freqs[investor] += 1
In [57]: freqs['sequoia capital']
Out[57]: 41
In [58]: freqs['sequoia capital china']
Out[58]: 46
In [59]: freqs['sequoia capital india']
Out[59]: 25
In [60]: freqs['sequoia capital israel']
Out[60]: 2
In [61]: freqs['and sequoia capital china']
Out[61]: 1
The sum of occurrences is 115, which coincides with the frequency returned for 'sequoia capital'
by the currently accepted solution.