Search code examples
pythonhtmlpandastags

How to count columns by row in python pandas?


I am not certain how to describe this situation. Suppose I have the well-defined following table in dataframe pandas,

            0     1      2      3      4      5  ... 2949 2950 2951 2952 2953 2954
0.txt    html  head   meta   meta   meta   meta  ...                              
107.txt  html  head  title   meta   meta   meta  ...                              
125.txt  html  head  title  style   body    div  ...                              
190.txt  html  head   meta  title  style   body  ...                              
202.txt  html  head   meta  title   link  style

And I want to make this table to spread out, columns representing the unique html tag and the value representing the specified row's count..

         html  head   meta  style   link   body  ... 
0.txt       1     1      4      2      1      2  ...                              
107.txt     1     2      3      0      0      1  ...                              

Somthing like the above.. I have counted the total 88 distinct html headers are in the table so the column count might be 88. If this turn out to be success, then I will apply padnas' describe() , value_counts() function to find out more about this tags' statistics.. However, I am stuck with the above. Please give me some ideas to tackle this. Thank you..


Solution

  • IIUC, you can first stack then use groupby.value_counts to get the stats per initial row, then unstack to get the expected result. With the data provided, for the first 3 rows and 6 columns, you get.

    res= (
        df.stack()
          .groupby(level=0).value_counts()
          .unstack(fill_value=0)
    )
    print(res)
    #         body  div  head  html  meta  style  title
    # 0.txt       0    0     1     1     4      0      0
    # 107.txt     0    0     1     1     3      0      1
    # 125.txt     1    1     1     1     0      1      1