Comparing many files in Bash

I'm trying to automate a task at work that I normally do by hand, that is taking database output from the permissions of multiple users and comparing them to see what they have in common. I have a script right now that uses comm and paste, but it's not giving me all the output I'd like.

Part of the problem comes in comm only dealing with two files at once, and I need to compare at least three to find a trend. I also need to determine if two out of the three have something in common, but the third one doesn't have it (so comparing the output of two comm commands doesn't work). I need these in comma separated values so it can be imported into Excel. Each user has a column, and at the end is a listing of everything they have in common. comm would work perfectly if it could compare more than two files (and show two-out-of-three comparisons).

In addition to the code I have to clean all the extra cruft off the raw csv file, here's what I have so far in comparing four users. It's highly inefficient, but it's what I know.

cat foo1 | sort > foo5
cat foo2 | sort > foo6
cat foo3 | sort > foo7
cat foo4 | sort > foo8

comm foo5 foo6 > foomp
comm foo7 foo8 > foomp2

paste foomp foomp2 > output2
sed 's/[\t]/,/g' output2 > output4.csv
cat output4.csv

Right now this outputs two users, their similarities and differences, then does the same for another two users and pastes it together. This works better than doing it by hand, but I know I could be doing more.

An example input file would be something like:

User1

Active Directory
Internet
S: Drive
Sales Records

User2

Active Directory
Internet
Pricing Lookup
S: Drive

User3

Active Directory
Internet
Novell
Sales Records

where they have AD and Internet in common, two out of three have sales records access and S: drive permission, only one of each has Novell and Pricing access.

Can someone give me a hand in what I'm missing?

Solution

Using GNU AWK (gawk) you can print a table that shows how multiple users' permissions correlate. You could also do the same thing in any language that supports associative arrays (hashes), such as Bash 4, Python, Perl, etc.

#!/usr/bin/awk -f
{
    array[FILENAME, $0] = $0
    perms[$0] = $0
    if (length($0) > maxplen) {
        maxplen = length($0)
    }
    users[FILENAME] = FILENAME
}
END {
    pcount = asort(perms)
    ucount = asort(users)
    maxplen += 2
    colwidth = 8
    printf("%*s", maxplen, "")
    for (u = 1; u <= ucount; u++) {
        printf("%-*s", colwidth, users[u])
    }
    printf("\n")

    for (p = 1; p <= pcount; p++) {
        printf("%-*s", maxplen, perms[p])
        for (u = 1; u <= ucount; u++) {
            if (array[users[u], perms[p]]) {
                printf("%-*s", colwidth, "  X")
            } else {
                printf("%-*s", colwidth, "")
            }
        }
    printf("\n")
    }
}

Save this file, perhaps calling it "correlate", then set it to be executable:

$ chmod u+x correlate

Then, assuming that the filenames correspond to the usernames or are otherwise meaningful (your examples are "user1" through "user3" so that works well), you can run it like this:

$ ./correlate user*

and you would get the following output based on your sample input:

                  user1   user2   user3
Active Directory    X       X       X
Internet            X       X       X
Novell                              X
Pricing Lookup              X
S: Drive            X       X
Sales Records       X               X

Edit:

This version doesn't use asort() and so it should work on non-GNU versions of AWK. The disadvantage is that the order of rows and columns is unpredictable.

#!/usr/bin/awk -f
{
    array[FILENAME, $0] = $0
    perms[$0] = $0
    if (length($0) > maxplen) {
        maxplen = length($0)
    }
    users[FILENAME] = FILENAME
}
END {
    maxplen += 2
    colwidth = 8
    printf("%*s", maxplen, "")
    for (u in users) {
        printf("%-*s", colwidth, u)
    }
    printf("\n")

    for (p in perms) {
        printf("%-*s", maxplen, p)
        for (u in users) {
            if (array[u, p]) {
                printf("%-*s", colwidth, "  X")
            } else {
                printf("%-*s", colwidth, "")
            }
        }
    printf("\n")
    }
}