Search code examples
pythonpython-3.xsocial-networking

Extracting Groups


Using Python 3.2 I was hoping to solve the below issue. My data consist of hundreds of rows (signifying a project) and 21 columns. The first of which is a unique project ID and the other 20 columns is the group of people, or person, that led the project. person_1 is always filled and if there is a name in person_3 that means 3 people are working together. If there is a name in person_18 that means 18 people are working together.

I have an excel spreadsheet that is setup the following way:

 unique ID person_1      person _2      person_3     person_4   ...  person_20
    12       Tom           Sally          Mike
    16       Joe           Mike
    5        Joe           Sally
    1       Sally          Mike           Tom
    6       Sally          Tom            Mike
    2       Jared          Joe            Mike        John      ...     Carl

I want to do a few things:

1) Make a column that will give me a unique 'Group Name' which will be, using unique ID 1 as my example, Sally/Mike/Tom. So it will be the names separated by '/'.

2) How can I treat, from my example, Sally/Mike/Tom the same as Sally/Tom/Mike. Meaning, I would like another column that makes the group name in alphabetical order (no matter the actual permutation), still separated by '/'.

3) This question is similar to (2). However, I want the person listed in person_1 to matter. Meaning Joe/Tom/Mike is different from Tom/Joe/Mike but not different than Joe/Mike/Tom. So there will be another column that keeps person_1 at the start of the group name but alphabetizes person_2 through person_20 if applicable (i.e., if the project has more than 1 person on it).

Thanks for the help and suggestions


Solution

  • The previous answer gave a clear statement of method, but perhaps you are stuck on either the string processing or the csv processing. Both are demonstrated in the following code. The relevant string methods are sorted and join. '/'.join tells join to use / as separator between joined items. The + operator between lists in tname and writerow statements concatenates the lists. A csv.reader is an iterator that delivers one list per row, and a csv.writer converts a list to a row and writes it out. You will want to add error testing to the file opens, etc. The data file used to test this code is shown after the code.

    import csv
    fi = open('xgroup.csv')
    fo = open('xgroup3.csv', 'w')
    w = csv.writer(fo)
    r = csv.reader(fi)
    li = 0
    print "Opened reader and writer"
    for row in r:
        gname = '/'.join(row[1:])
        sname = '/'.join(sorted(row[1:]))
        tname = '/'.join([row[1]]+sorted(row[2:]))
        w.writerow([row[0], gname, sname, tname]+row[1:])
        li += 1
    fi.close()
    fo.close()
    print "Closed reader and writer after",li,"lines"
    

    File xgroup.csv is shown next.

    unique-ID,person_1,person,_2,person_3,person_4,...,person_20
    12,Tom,Sally,Mike
    16,Joe,Mike
    5,Joe,Sally
    1,Sally,Mike,Tom
    6,Sally,Tom,Mike
    2,Jared,Joe,Mike,John,...,Carl
    

    Upon reading data as above, the program prints Opened reader and writer and Closed reader and writer after 7 lines and produces output in file xgroup3.csv as shown next.

    unique-ID,person_1/person/_2/person_3/person_4/.../person_20,.../_2/person/person_1/person_20/person_3/person_4,person_1/.../_2/person/person_20/person_3/person_4,person_1,person,_2,person_3,person_4,...,person_20
    12,Tom/Sally/Mike,Mike/Sally/Tom,Tom/Mike/Sally,Tom,Sally,Mike
    16,Joe/Mike,Joe/Mike,Joe/Mike,Joe,Mike
    5,Joe/Sally,Joe/Sally,Joe/Sally,Joe,Sally
    1,Sally/Mike/Tom,Mike/Sally/Tom,Sally/Mike/Tom,Sally,Mike,Tom
    6,Sally/Tom/Mike,Mike/Sally/Tom,Sally/Mike/Tom,Sally,Tom,Mike
    2,Jared/Joe/Mike/John/.../Carl,.../Carl/Jared/Joe/John/Mike,Jared/.../Carl/Joe/John/Mike,Jared,Joe,Mike,John,...,Carl
    

    Note, given a data line like

    5,Joe,Sally,,,,,
    

    instead of

    5,Joe,Sally
    

    the program as above produces

    5,Joe/Sally/////,/////Joe/Sally,Joe//////Sally,Joe,Sally,,,,,
    

    instead of

    5,Joe/Sally,Joe/Sally,Joe/Sally,Joe,Sally
    

    If that's a problem, filter out empty entries. For example, if
    row=['5', 'Joe', 'Sally', '', '', '', '', ''], then '/'.join(row[1:]) produces
    'Joe/Sally/////', while
    '/'.join(filter(lambda x: x, row[1:])) and
    '/'.join(x for x in row[1:] if x) and
    '/'.join(filter(len, row[1:])) produce
    'Joe/Sally' .