Search code examples
pythonregexstringpandasfuzzy-comparison

In python: How to find match of string in same row, compare part of (fuzzy) matched string to list?


I have a matching problem that I've tried to solve, but have not found a way to do so. I'm new to python, so there might well be simple methods for doing this. I've searched the questions, but haven't found anything that quite gets what I need.

PURPOSE

I need to identify a "type" of person (to be defined below), and if they belong to one of a list of groups, note the group. The result should be a single row in a dataframe that contains an observation of the person's name, their type, and their affiliation. The data all exist as strings, and there are multiple parts of the strings I will have to parse out.

DATA

My data currently look like this:

lastname       person1                person2                person3
smith          matt smith, company a  tom jones, company b   karen cook, company c
jones          matt smith, company a  tom jones, company b   karen cook, company c
cook           matt smith, company a  tom jones, company b   karen cook, company c

Also, I have a list of companies

CompanyName
company a
company b
company d

The steps are, roughly (in pseudocod-ish), for every row in the dataframe

If lastname is in person1:
   if person1 company is in list of CompanyNames:
       df['persontype'] = 1 #where 1 indicates being in a listed company
       df['affiliation'] = company name #the matched item from CompanyName list
   else if lastname is in person2: 
       if person2 company is in list of CompanyNames: 
            df['personatype'] = 1
            df['affiliation'] = company name
       else if lastname is in person3: 
            if person3 company name is in list of CompanyNames: 
                df['persontype'] = 1
                df['affiliation'] = company name
            else: 
                df['persontype'] = 0 #indicating a person in an unlisted company
                df['affiliation'] = 'Unlisted'

I know that's terrible to read, but I'm just starting out. (All comments welcome.)

The issues are:

  1. The lastname is short, so the match to the list of persons can be a simple is 'a' in 'b', but I expect the company names within the person1...N strings to have typos or idiosyncrasies (think "C-SPAN" and "CSPAN" as possible versions of one name).
  2. The number of columns for persons is static. That is, it is fixed (though many entries will be empty, so that while the number of columns is fixed at 100, some rows will only have 1 or 2 entries).
  3. The formatting of the person1...N strings is not consistent. That is, there may be multiple commas, periods, hyphens, etc. Thus, splitting on a comma a getting everything after it won't always give the company name.

UPDATE

Thanks to great questions, I realize I've not been detailed enough on the format of the data. The entries under the person1..N columns are highly variable. Examples of actual data from those columns below:

person1
Hill, Daniel O., Assistant Administrator for Technology, U.S. Small Business Administration
person2
R&D Credit Coalition, R. Randall Capps

This is why I didn't try using .split() methods right away. The order of the tokens are not uniform, some have a position/job as well as the name of the organization, and the names are not always in "phone book" order.

DESIRED RESULT

What I would like to end with is a dataframe with this (in addition to all the other columns)

lastname  ...   persontype     affiliation 
smith           1              company a
jones           1              company b
cook            0              None

Thanks a TON for any pointers, guidance, etc. on this. I've been working with the fuzzywuzzy module, but haven't had any success.

NOTE: I'm not providing sample code here because the failure isn't a single line or function. Anything I provide would have too many ancillary problems to be worth your time to look at.


Solution

  • Data format

    The data format provided isn't clear about association or delimiters, so I am going to restate what I think you meant. (in a python syntactic structure)

    Your data:

    lastname       person1                person2                person3
    smith          matt smith, company a  tom jones, company b   karen cook, company c
    jones          matt smith, company a  tom jones, company b   karen cook, company c
    cook           matt smith, company a  tom jones, company b   karen cook, company c
    

    Also, I have a list of companies

    CompanyName
    company a
    company b
    company d
    

    What I think it means:

    lastname = ['smith','jones','cook']
    companies = ['company a','company b']
    affiliations = {'matt smith':'company a','tom jones':'company b','karen cook':'company c'}
    

    How I think it should be formatted

    lastname = ['smith','jones','cook']
    companies = ['company a','company b']
    names = {'smith':'matt smith','jones':'tom jones','cook':'karen cook'}
    affiliations = {'matt smith':'company a','tom jones':'company b','karen cook':'company c'}
    

    Desired result

    The idea is to provide the following output:

    lastname  ...   persontype     affiliation 
    smith           1              company a
    jones           1              company b
    cook            0              None
    

    How to get the desired result

    Note: This is using the format that I think would be best.

    print('lastname\tpersontype\taffiliation\n')
    for last in lastname:
        if last in names:
            full = names[last]
            if full in affiliations:
                comp = affiliations[full]
                if comp in companies:
                    print(full+'\t1\t'+comp)
                else:
                    print(full+'\t0\tNone')
            else:
                print(full+' is not listed in affiliations.')
        else:
            print(last+' has no matches in names.')
        print('\n')
    

    I don't know if you'll be able to create the data format that I think you should use. Though if your data is in a plain text format with the shape you provide, it should be fairly straightforward to read the file and process the lines.

    If you want to print it back out to a file, that would be doable as well, study the open() and write() built in functions.

    I hope this is helpful. Enjoy!