Search code examples
pythonpandascountrow

Perform value counts in Python/Pandas on one column, but return values in multiple columns


In Python, Im trying to perform value_counts on a Pandas column. I can get that to work, but I can't figure out how to get some of the other associated columns.
The code:

import pandas as pd

myframe = pd.DataFrame({"Server":["Server_1","Server_1","Server_1","Server_1","Server_1","Server_2","Server_2","Server_2","Server_2","Server_3","Server_3","Server_3","Server_3","Server_3"], 
"CVE_ID":["CVE-2017-1111","CVE-2017-1112","CVE-2017-1113","CVE-2017-1114","CVE-2017-1115","CVE-2017-1111","CVE-2017-1112","CVE-2017-1113","CVE-2017-1114","CVE-2017-1113","CVE-2017-1114","CVE-2017-1115","CVE-2017-1116","CVE-2017-1117"],
"VulnName":["Java Update 1","Java Update 2","Java Update 3","Adobe 1","Chrome 1","Java Update 1","Java Update 2","Java Update 3","Adobe 1","Java Update 3","Adobe 1","Chrome 1","Chrome 2","Chrome 3"],
"ServerOwner":["Alice","Alice","Alice","Alice","Alice","Bob","Bob","Bob","Bob","Carol","Carol","Carol","Carol","Carol"]})

print "The dataframe: \n", myframe
print "Top 10 offending CVEs, Vulnerability and Count: \n"
print myframe['CVE_ID'].value_counts()

The last line prints out 2 columns: one of CVEs, and one of how many times it occurred. But I want to print out something like this, where it maintains a link between the CVE and the Vulnerability Name (seen in the middle column):

Top 10 offending CVEs, Vulnerability and Count:
CVE-2017-1113   Java Update 1     3
CVE-2017-1114   Java Update 2     3
...etc...

How do I do that? Everything I do keeps throwing errors.


Solution

  • Edit: Changed so that in output have column name access

    (Note the addition of as_index=False and .reset_index in [1] See sources 5 and 6

    [1] First groupby on CVE_ID column and use size:

    counts = myframe.groupby(['CVE_ID','VulnName','ServerOwner'], as_index=False).size().unstack(fill_value=0).reset_index()
    
    
    ServerOwner         CVE_ID       VulnName  Alice  Bob  Carol
    0            CVE-2017-1111  Java Update 1      1    1      0
    1            CVE-2017-1112  Java Update 2      1    1      0
    2            CVE-2017-1113  Java Update 3      1    1      1
    3            CVE-2017-1114        Adobe 1      1    1      1
    4            CVE-2017-1115       Chrome 1      1    0      1
    5            CVE-2017-1116       Chrome 2      0    0      1
    6            CVE-2017-1117       Chrome 3      0    0      1
    

    [2] Then sum over the Alice, Bob and Carol column to get:

    counts['Count'] = counts[['Alice','Bob','Carol']].sum(axis=1)
    
    ServerOwner         CVE_ID       VulnName  Alice  Bob  Carol  Count
    0            CVE-2017-1111  Java Update 1      1    1      0      2
    1            CVE-2017-1112  Java Update 2      1    1      0      2
    2            CVE-2017-1113  Java Update 3      1    1      1      3
    3            CVE-2017-1114        Adobe 1      1    1      1      3
    4            CVE-2017-1115       Chrome 1      1    0      1      2
    5            CVE-2017-1116       Chrome 2      0    0      1      1
    6            CVE-2017-1117       Chrome 3      0    0      1      1
    

    [3] Then delete name columns using df.drop on names:

    counts.drop(['Carol','Bob','Alice'],inplace=True,axis=1)
    
    ServerOwner         CVE_ID       VulnName  Count
    0            CVE-2017-1111  Java Update 1      2
    1            CVE-2017-1112  Java Update 2      2
    2            CVE-2017-1113  Java Update 3      3
    3            CVE-2017-1114        Adobe 1      3
    4            CVE-2017-1115       Chrome 1      2
    5            CVE-2017-1116       Chrome 2      1
    6            CVE-2017-1117       Chrome 3      1
    

    [4] Then you use sort_values on the sum column:

    counts.sort_values(by='Count', ascending=False, inplace=True)
    
    ServerOwner         CVE_ID       VulnName  Count
    2            CVE-2017-1113  Java Update 3      3
    3            CVE-2017-1114        Adobe 1      3
    0            CVE-2017-1111  Java Update 1      2
    1            CVE-2017-1112  Java Update 2      2
    4            CVE-2017-1115       Chrome 1      2
    5            CVE-2017-1116       Chrome 2      1
    6            CVE-2017-1117       Chrome 3      1
    

    Combined:

    counts = myframe.groupby(['CVE_ID','VulnName','ServerOwner'], as_index=False).size().unstack(fill_value=0).reset_index()
    counts['Count'] = counts[['Alice','Bob','Carol']].sum(axis=1)
    counts.drop(['Carol','Bob','Alice'],inplace=True,axis=1)
    counts.sort_values(by='Count', ascending=False, inplace=True)
    
    print "The dataframe: \n", myframe
    print "Top 10 offending CVEs, Vulnerability and Count: \n"
    print counts
    
    Top 10 offending CVEs, Vulnerability and Count: 
    
    ServerOwner         CVE_ID       VulnName  Count
    2            CVE-2017-1113  Java Update 3      3
    3            CVE-2017-1114        Adobe 1      3
    0            CVE-2017-1111  Java Update 1      2
    1            CVE-2017-1112  Java Update 2      2
    4            CVE-2017-1115       Chrome 1      2
    5            CVE-2017-1116       Chrome 2      1
    6            CVE-2017-1117       Chrome 3      1
    

    If needed, Can use reset_index() to reset index at this point.

    Edit: In response to comment about serverOwner index, you can reset index, drop old index, and rename new index:

    counts.reset_index(drop=True, inplace = True)
    counts.index.names = ['index']
    

    Gives:

    ServerOwner         CVE_ID       VulnName  Count
    index                                           
    0            CVE-2017-1113  Java Update 3      3
    1            CVE-2017-1114        Adobe 1      3
    2            CVE-2017-1111  Java Update 1      2
    3            CVE-2017-1112  Java Update 2      2
    4            CVE-2017-1115       Chrome 1      2
    5            CVE-2017-1116       Chrome 2      1
    6            CVE-2017-1117       Chrome 3      1
    

    (The ServerOwner name remains as a remnant of the original groupby command to detail which column was used.)

    Sources for this answer:

    [1] Groupby value counts on the dataframe pandas dataframe-pandas

    [2] Pandas: sum DataFrame rows for given columns

    [3] Delete column from pandas DataFrame

    [4] python, sort descending dataframe with pandas

    [5] Converting a Pandas GroupBy object to DataFrame

    [6] How to GroupBy a Dataframe in Pandas and keep Columns