Search code examples
pythonpandasdataframefrequencyword-cloud

frequency of string (comma separated) in Python


I'm trying to find the frequency of strings from the field "Select Investors" on this website https://www.cbinsights.com/research-unicorn-companies

Is there a way to pull out the frequency of each of the comma separated strings?

For example, how frequent does the term "Sequoia Capital China" show up?


Solution

  • The solution provided by @Mazhar checks whether a certain term is a substring of a string delimited by commas. As a consequence, the number of occurrences of 'Sequoia Capital' returned by this approach is the sum of the occurrences of all the strings that contain 'Sequoia Capital', namely 'Sequoia Capital', 'Sequoia Capital China', 'Sequoia Capital India', 'Sequoia Capital Israel' and 'and Sequoia Capital China'. The following code avoids that issue:

    import pandas as pd
    from collections import defaultdict
    
    url = "https://www.cbinsights.com/research-unicorn-companies"
    df = pd.read_html(url)[0]
    
    freqs = defaultdict(int)
    for group in df['Select Investors']:
        if hasattr(group, 'lower'):
            for raw_investor in group.lower().split(','):
                investor = raw_investor.strip()
                # Ignore empty strings produced by wrong data like this:
                # 'B Capital Group,, GE Ventures, McKesson Ventures'
                if investor:
                    freqs[investor] += 1
    

    Demo

    In [57]: freqs['sequoia capital']
    Out[57]: 41
    
    In [58]: freqs['sequoia capital china']
    Out[58]: 46
    
    In [59]: freqs['sequoia capital india']
    Out[59]: 25
    
    In [60]: freqs['sequoia capital israel']
    Out[60]: 2
    
    In [61]: freqs['and sequoia capital china']
    Out[61]: 1
    

    The sum of occurrences is 115, which coincides with the frequency returned for 'sequoia capital' by the currently accepted solution.