Search code examples
pythonpandasdataframefor-loopfrequency

convert for-loop output into dataframe python


I am trying to convert the output of this code into a dataframe, but do not know how. What is a good way to turn the output columns (string and frequency) into a dataframe?

# Extract data
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)
first_df = df[0]


column = "Select Investors"
all_investor = []
for i in first_df[column]:
    all_investor += str(i).lower().split(',')

# Calculate frequency
for string in all_investor:
    string = string.strip()
    frequency = first_df[column].apply(
        lambda x: string in str(x).lower()).sum()
    print(string, frequency)

Output:
andreessen horowitz 41
new enterprise associates 21
battery ventures 14
index ventures 30
dst global 19
ribbit capital 8
forerunner ventures 4
crosslink capital 4
homebrew 2
sequoia capital 115
thoma bravo 3
softbank 50
tencent holdings 28
lightspeed india partners 4
sequoia capital india 25
ggv capital 14

....


Solution

  • Use Series.str.split, reshape by DataFrame.stack, convert to lowercase by Series.str.lower and last count by Series.value_counts:

    url = "https://www.cbinsights.com/research-unicorn-companies"
    df = pd.read_html(url)[0]
    
    s = df['Select Investors'].str.split(', ', expand=True).stack().str.lower().value_counts()
    print (s)
    accel                              54
    tiger global management            48
    sequoia capital china              46
    andreessen horowitz                42
    sequoia capital                    41
                                       ..
    almaz capital partners              1
    commerzventures                     1
    sunley house capital management     1
    lockheed martin ventures            1
    14w                                 1
    Length: 1187, dtype: int64
    

    For DataFrame use:

    df = s.rename_axis('values').reset_index(name='count')
    print (df)
                                   values  count
    0                               accel     54
    1             tiger global management     48
    2               sequoia capital china     46
    3                 andreessen horowitz     42
    4                     sequoia capital     41
                                  ...    ...
    1182           almaz capital partners      1
    1183                  commerzventures      1
    1184  sunley house capital management      1
    1185         lockheed martin ventures      1
    1186                              14w      1
    
    [1187 rows x 2 columns]
    

    If want modify your solution:

    from collections import Counter
    
    url = "https://www.cbinsights.com/research-unicorn-companies"
    df = pd.read_html(url)[0]
    
    
    column = "Select Investors"
    all_investor = [j.strip() for i in df[column]  for j in str(i).lower().split(',')]
    
    df1 = (pd.DataFrame(Counter(all_investor).items(), columns=['vals','count'])
             .sort_values(by='count',ascending=False, ignore_index=True))
    print (df1)
                             vals  count
    0                       accel     54
    1     tiger global management     48
    2       sequoia capital china     46
    3         andreessen horowitz     42
    4             sequoia capital     41
                          ...    ...
    1180          futurex capital      1
    1181            quiet capital      1
    1182       white star capital      1
    1183   almaz capital partners      1
    1184          endiya partners      1
    
    [1185 rows x 2 columns]