Search code examples
pythonpandascsvsortingline-count

Compare data between two csv files and count how many rows have the same data


Let's say I have list of all OUs (AllOU.csv):

NEWS
STORE
SPRINKLES
ICECREAM

I want to look through a csv file (samplefile.csv) on the third column called 'column3', and search through each row if it matches what is in the samplefile.csv. Then I want to sort them and count how many rows each one has.

This is how the column looks:

column3
CN=Clark Kent,OU=news,dc=company,dc=com
CN=Mary Poppins,OU=ice cream, dc=company,dc=com
CN=Mary Jane,OU=news,OU=tv,dc=company,dc=com
CN=Pepper Jack,OU=store,OU=tv,dc=company,dc=com
CN=Monty Python,OU=store,dc=company,dc=com
CN=Anne Potts,OU=sprinkles,dc=company,dc=com

I want to sort them out like this (or a list):

CN=Clark Kent,OU=news,dc=company,dc=com
CN=Mary Jane,OU=news,OU=tv,dc=company,dc=com

CN=Pepper Jack,OU=tv,OU=store,dc=company,dc=com
CN=Monty Python,OU=store,dc=company,dc=com

CN=Mary Poppins,OU=ice cream, dc=company,dc=com

CN=Anne Potts,OU=sprinkles,dc=company,dc=com

This is what the final output should be:

2, news
2, store,
1, icecream
1, sprinkles

Maybe a list would be a good way of sorting them? Like this?

holdingList =['CN=Clark Kent,OU=news,dc=company,dc=com','CN=Mary Jane,OU=news,OU=tv,dc=company,dc=com'],
['CN=Pepper Jack,OU=tv,OU=store,dc=company,dc=com','CN=Monty Python,OU=store,dc=company,dc=com'],
['CN=Mary Poppins,OU=ice cream, dc=company,dc=com'],
['CN=Anne Potts,OU=sprinkles,dc=company,dc=com']

I had something like this so far:

file = open('samplefile.csv')
df = pd.read_csv(file, usecols=['column3'])

#file of all OUs
file2 = open('ALLOU.csv')
OUList = pd.read_csv(file2, header=None)

for OU in OUList[0]:
        df_dept = df[df['column3'].str.contains(f'OU={OU }')].count()
        print({OU}, df_dept)

Solution

  • Read your file first and create a list of objects. [{CN:’Clark Kent’,OU:’news’,dc:’company’,dc:’com’},…{…}]

    Once you have created the list you can convert it to data frame and then apply all the grouping, sorting and other abilities of pandas.

    Now to achieve this, first read your file into a variable lets call var filedata=yourFileContents. Next split filedata. var lines = filedata.split(‘\n’) Now loop over each lines

    dataList = [] 
    for line in lines:
        item = dict()
        elements = line.split(‘,’)
        for element in elements:
            key_value = element.split(‘=‘)
            item[key_value[0]] = key_value[1]
            dataList.append(item)
    print(dataList)
    

    Now you may load this onto a panda dataframe and apply sorting and grouping. Once you have structured the data frame, you can simply search the key from the other file in this dataframe and get your numbers