Search code examples
pythoncsvmachine-learningk-meansidentify

Identify most common combinations of values in 2 or more columns in a CSV file


How to find the most common combinations of values in 2 or more columns for rows in a CSV file. example:

event,rack,role,dc
network,north,mobile,africa
network,east,mobile,asia
oom,south,desktop,europe
cpu,east,web,northamerica
oom,north,mobile,europe
cpu,south,web,northamerica
cpu,west,web,northamerica

I have tried to create lists for some of the possible combinations I'm looking at, and then use the most_common() method in Collections.Counter to find the common patterns. but I need an algorithm to find common records for any possible combination of 2 or more columns.

My code so far:

import csv
from collections import Counter

class Alert:
    def __init__(self, event, rack, role, dc):
        self.event = event
        self.rack = rack
        self.role = role
        self.dc = dc

    def __str__(self):
        return(",".join([self.event, self.rack, self.role, self.dc]))


alerts = []
with open('data.csv', mode='r') as csv_file:
    csv_reader = csv.DictReader(csv_file)

    for row in csv_reader:
        alert = Alert(row['event'], row['rack'], row['role'], row['dc'])
        alerts.append(alert)
dcevent= []
dceventrole = []
dcrole = []
dcrolerack = []

for alert in alerts:
    dcevent.append(alert.dc + '-' + alert.event)
    dceventrole.append(alert.dc+'-'+alert.event+'-'+alert.role)
    dcrole.append(alert.dc+'-'+alert.role)
    dcrolerack.append(alert.dc+'-'+alert.role+'-'+alert.rack)


masterlist = Counter(dcevent).most_common() + Counter(dceventrole).most_common() + Counter(dcrole).most_common() + Counter(dcrolerack).most_common()

for item in sorted(masterlist, key=lambda x: x[1], reverse=True):
    print(item)

This is the output for the records mentioned above:

('northamerica-web-cpu', 3) # there are 3 rows matching the values northamerica,web and cpu
('northamerica-web', 3) # there are 3 rows matching just the values northamerica and web
('northamerica-cpu', 3) # there are 3 rows matching northamerica and cpu
('europe-oom', 2) # there are 2 rows matching europe and oom
('africa-mobile-network', 1)
('asia-mobile-network', 1)
('europe-desktop-oom', 1)
('europe-mobile-oom', 1)
('africa-mobile-north', 1)
('asia-mobile-east', 1)
('europe-desktop-south', 1)
('northamerica-web-east', 1)
('europe-mobile-north', 1)
('northamerica-web-south', 1)
('northamerica-web-west', 1)
('africa-mobile', 1)
('asia-mobile', 1)
('europe-desktop', 1)
('europe-mobile', 1)
('africa-network', 1)
('asia-network', 1)


Solution

  • Let me start by defining the data structures in-situ, since csv-reading is orthogonal to the real issue:

    lines = [line.split(',') for line in """\
    event,rack,role,dc
    network,north,mobile,africa
    network,east,mobile,asia
    oom,south,desktop,europe
    cpu,east,web,northamerica
    oom,north,mobile,europe
    cpu,south,web,northamerica
    cpu,west,web,northamerica
    """.splitlines()]
    
    for line in lines:
        print line
    

    which prints:

    ['event', 'rack', 'role', 'dc']
    ['network', 'north', 'mobile', 'africa']
    ['network', 'east', 'mobile', 'asia']
    ['oom', 'south', 'desktop', 'europe']
    ['cpu', 'east', 'web', 'northamerica']
    ['oom', 'north', 'mobile', 'europe']
    ['cpu', 'south', 'web', 'northamerica']
    ['cpu', 'west', 'web', 'northamerica']
    

    now, lets create all possible combinations of 2 or more words from each line. There are 11 ways to choose 2, 3, or 4 from 4 (4C2 + 4C3 + 4C4 == 6 + 4 + 1 == 11).

    The algorithm I'm using to find the combinations looks at the binary numbers with 4 digits (i.e. 0000, 0001, 0010, 0011, 0100, etc.) and for each such number creates the combination of words depending on if the respective binary digit is 1. E.g. for 0101 the second and fourth word are chosen:

    def find_combinations(line):
        combinations = []
        for i in range(2**len(line)):
            bits = bin(i)[2:].zfill(len(line))
            if bits.count('1') < 2:  # skip numbers with less than two 1-bits
                continue
            combination = set()
            for bit, word in zip(bits, line):
                if bit == '1':
                    combination.add(word)
            combinations.append('-'.join(sorted(combination)))
        return combinations
    

    now we can loop through all combinations and count their frequency:

    from collections import defaultdict
    counter = defaultdict(int)
    for line in lines:
        for c in find_combinations(line):
            counter[c] += 1
    

    and finally we can sort (descending) on frequency

    for combination_freq in sorted(counter.items(), key=lambda item: item[1], reverse=True):
        print combination_freq
    

    to get:

    ('cpu-northamerica', 3)
    ('northamerica-web', 3)
    ('cpu-northamerica-web', 3)
    ('cpu-web', 3)
    ('mobile-north', 2)
    ('mobile-network', 2)
    ('europe-oom', 2)
    ('east-network', 1)
    ('asia-east-mobile', 1)
    ('asia-east-network', 1)
    ('cpu-south-web', 1)
    ('east-northamerica-web', 1)
    ('europe-north', 1)
    ('cpu-east', 1)
    ...etc.