I am not certain how to describe this situation. Suppose I have the well-defined following table in dataframe pandas,
0 1 2 3 4 5 ... 2949 2950 2951 2952 2953 2954
0.txt html head meta meta meta meta ...
107.txt html head title meta meta meta ...
125.txt html head title style body div ...
190.txt html head meta title style body ...
202.txt html head meta title link style
And I want to make this table to spread out, columns representing the unique html tag and the value representing the specified row's count..
html head meta style link body ...
0.txt 1 1 4 2 1 2 ...
107.txt 1 2 3 0 0 1 ...
Somthing like the above.. I have counted the total 88 distinct html headers are in the table so the column count might be 88. If this turn out to be success, then I will apply padnas' describe()
, value_counts()
function to find out more about this tags' statistics.. However, I am stuck with the above. Please give me some ideas to tackle this. Thank you..
IIUC, you can first stack
then use groupby.value_counts
to get the stats per initial row, then unstack
to get the expected result. With the data provided, for the first 3 rows and 6 columns, you get.
res= (
df.stack()
.groupby(level=0).value_counts()
.unstack(fill_value=0)
)
print(res)
# body div head html meta style title
# 0.txt 0 0 1 1 4 0 0
# 107.txt 0 0 1 1 3 0 1
# 125.txt 1 1 1 1 0 1 1