python pandas dataframe for-loop frequency

convert for-loop output into dataframe python

I am trying to convert the output of this code into a dataframe, but do not know how. What is a good way to turn the output columns (string and frequency) into a dataframe?

# Extract data
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)
first_df = df[0]


column = "Select Investors"
all_investor = []
for i in first_df[column]:
    all_investor += str(i).lower().split(',')

# Calculate frequency
for string in all_investor:
    string = string.strip()
    frequency = first_df[column].apply(
        lambda x: string in str(x).lower()).sum()
    print(string, frequency)

Output:
andreessen horowitz 41
new enterprise associates 21
battery ventures 14
index ventures 30
dst global 19
ribbit capital 8
forerunner ventures 4
crosslink capital 4
homebrew 2
sequoia capital 115
thoma bravo 3
softbank 50
tencent holdings 28
lightspeed india partners 4
sequoia capital india 25
ggv capital 14

....

Solution

Use Series.str.split, reshape by DataFrame.stack, convert to lowercase by Series.str.lower and last count by Series.value_counts:

url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)[0]

s = df['Select Investors'].str.split(', ', expand=True).stack().str.lower().value_counts()
print (s)
accel                              54
tiger global management            48
sequoia capital china              46
andreessen horowitz                42
sequoia capital                    41
                                   ..
almaz capital partners              1
commerzventures                     1
sunley house capital management     1
lockheed martin ventures            1
14w                                 1
Length: 1187, dtype: int64

For DataFrame use:

df = s.rename_axis('values').reset_index(name='count')
print (df)
                               values  count
0                               accel     54
1             tiger global management     48
2               sequoia capital china     46
3                 andreessen horowitz     42
4                     sequoia capital     41
                              ...    ...
1182           almaz capital partners      1
1183                  commerzventures      1
1184  sunley house capital management      1
1185         lockheed martin ventures      1
1186                              14w      1

[1187 rows x 2 columns]

If want modify your solution:

from collections import Counter

url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)[0]


column = "Select Investors"
all_investor = [j.strip() for i in df[column]  for j in str(i).lower().split(',')]

df1 = (pd.DataFrame(Counter(all_investor).items(), columns=['vals','count'])
         .sort_values(by='count',ascending=False, ignore_index=True))
print (df1)
                         vals  count
0                       accel     54
1     tiger global management     48
2       sequoia capital china     46
3         andreessen horowitz     42
4             sequoia capital     41
                      ...    ...
1180          futurex capital      1
1181            quiet capital      1
1182       white star capital      1
1183   almaz capital partners      1
1184          endiya partners      1

[1185 rows x 2 columns]