I am trying to convert the output of this code into a dataframe, but do not know how. What is a good way to turn the output columns (string and frequency) into a dataframe?
# Extract data
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)
first_df = df[0]
column = "Select Investors"
all_investor = []
for i in first_df[column]:
all_investor += str(i).lower().split(',')
# Calculate frequency
for string in all_investor:
string = string.strip()
frequency = first_df[column].apply(
lambda x: string in str(x).lower()).sum()
print(string, frequency)
Output:
andreessen horowitz 41
new enterprise associates 21
battery ventures 14
index ventures 30
dst global 19
ribbit capital 8
forerunner ventures 4
crosslink capital 4
homebrew 2
sequoia capital 115
thoma bravo 3
softbank 50
tencent holdings 28
lightspeed india partners 4
sequoia capital india 25
ggv capital 14
....
Use Series.str.split
, reshape by DataFrame.stack
, convert to lowercase by Series.str.lower
and last count by Series.value_counts
:
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)[0]
s = df['Select Investors'].str.split(', ', expand=True).stack().str.lower().value_counts()
print (s)
accel 54
tiger global management 48
sequoia capital china 46
andreessen horowitz 42
sequoia capital 41
..
almaz capital partners 1
commerzventures 1
sunley house capital management 1
lockheed martin ventures 1
14w 1
Length: 1187, dtype: int64
For DataFrame use:
df = s.rename_axis('values').reset_index(name='count')
print (df)
values count
0 accel 54
1 tiger global management 48
2 sequoia capital china 46
3 andreessen horowitz 42
4 sequoia capital 41
... ...
1182 almaz capital partners 1
1183 commerzventures 1
1184 sunley house capital management 1
1185 lockheed martin ventures 1
1186 14w 1
[1187 rows x 2 columns]
If want modify your solution:
from collections import Counter
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)[0]
column = "Select Investors"
all_investor = [j.strip() for i in df[column] for j in str(i).lower().split(',')]
df1 = (pd.DataFrame(Counter(all_investor).items(), columns=['vals','count'])
.sort_values(by='count',ascending=False, ignore_index=True))
print (df1)
vals count
0 accel 54
1 tiger global management 48
2 sequoia capital china 46
3 andreessen horowitz 42
4 sequoia capital 41
... ...
1180 futurex capital 1
1181 quiet capital 1
1182 white star capital 1
1183 almaz capital partners 1
1184 endiya partners 1
[1185 rows x 2 columns]