Search code examples
pythonmatrixsocial-networkingstata

Turning project data into a relationship matrix


My data set a list of people either working together or alone.

I have have a row for each project and columns with names of all the people who worked on that project. If column 2 is the first empty column given a row it was a solo job, if column 4 is the first empty column given a row then there were 3 people working together.

My goal is to find which people have worked together, and how many times, so I want all pairs in the data set, treating A working with B the same as B working with A.

From this a square N x N would be created with every actor labeling the column and row and in cell (A,B) and (B,A) would have how many times that pair worked together, and this would be done for every pair.

I know of a 'pretty' quick way to do it in Excel but I want it automated, hopefully in Stata or Python, just in case projects are added or removed I can just 1-click the re-run and not have to re-do it every time.

An example of the data, in a comma delimited fashion:

A
A,B
B,C,E
B,F
D,F
A,B,C
D,B
E,C,B
X,D,A

Hope that helps!

Brice. F,D B F F,X,C C,F,D


Solution

  • Maybe something like this would get you started?

    import csv
    import collections
    import itertools
    
    grid = collections.Counter()
    
    with open("connect.csv", "r", newline="") as fp:
        reader = csv.reader(fp)
        for line in reader:
            # clean empty names
            line = [name.strip() for name in line if name.strip()]
            # count single works
            if len(line) == 1:
                grid[line[0], line[0]] += 1
            # do pairwise counts
            for pair in itertools.combinations(line, 2):
                grid[pair] += 1
                grid[pair[::-1]] += 1
    
    actors = sorted(set(pair[0] for pair in grid))
    
    with open("connection_grid.csv", "w", newline="") as fp:
        writer = csv.writer(fp)
        writer.writerow([''] + actors)
        for actor in actors:
            line = [actor,] + [grid[actor, other] for other in actors]
            writer.writerow(line)
    

    [edit: modified to work under Python 3.2]

    The key modules are (1)csv, which makes reading and writing csv files much simpler; (2) collections, which provides an object called a Counter -- like a defaultdict(int), which you could use if your Python doesn't have Counter, it's a dictionary which automatically generates default values so you don't have to, and here the default count is 0; and (3) itertools, which has a combinations function to get all the pairs.

    which produces

    ,A,B,C,D,E,F,X
    A,1,2,1,1,0,0,1
    B,2,1,3,1,2,1,0
    C,1,3,0,1,2,2,1
    D,1,1,1,0,0,3,1
    E,0,2,2,0,0,0,0
    F,0,1,2,3,0,1,1
    X,1,0,1,1,0,1,0
    

    You could use itertools.product to make building the array a little more compact, but since it's only a line or two I figured it was as simple to do it manually.