I have a matching problem that I've tried to solve, but have not found a way to do so. I'm new to python, so there might well be simple methods for doing this. I've searched the questions, but haven't found anything that quite gets what I need.
PURPOSE
I need to identify a "type" of person (to be defined below), and if they belong to one of a list of groups, note the group. The result should be a single row in a dataframe that contains an observation of the person's name, their type, and their affiliation. The data all exist as strings, and there are multiple parts of the strings I will have to parse out.
DATA
My data currently look like this:
lastname person1 person2 person3
smith matt smith, company a tom jones, company b karen cook, company c
jones matt smith, company a tom jones, company b karen cook, company c
cook matt smith, company a tom jones, company b karen cook, company c
Also, I have a list of companies
CompanyName
company a
company b
company d
The steps are, roughly (in pseudocod-ish), for every row in the dataframe
If lastname is in person1:
if person1 company is in list of CompanyNames:
df['persontype'] = 1 #where 1 indicates being in a listed company
df['affiliation'] = company name #the matched item from CompanyName list
else if lastname is in person2:
if person2 company is in list of CompanyNames:
df['personatype'] = 1
df['affiliation'] = company name
else if lastname is in person3:
if person3 company name is in list of CompanyNames:
df['persontype'] = 1
df['affiliation'] = company name
else:
df['persontype'] = 0 #indicating a person in an unlisted company
df['affiliation'] = 'Unlisted'
I know that's terrible to read, but I'm just starting out. (All comments welcome.)
The issues are:
is 'a' in 'b'
, but I expect the company names within the person1...N strings to have typos or idiosyncrasies (think "C-SPAN" and "CSPAN" as possible versions of one name).UPDATE
Thanks to great questions, I realize I've not been detailed enough on the format of the data. The entries under the person1..N
columns are highly variable. Examples of actual data from those columns below:
person1
Hill, Daniel O., Assistant Administrator for Technology, U.S. Small Business Administration
person2
R&D Credit Coalition, R. Randall Capps
This is why I didn't try using .split()
methods right away. The order of the tokens are not uniform, some have a position/job as well as the name of the organization, and the names are not always in "phone book" order.
DESIRED RESULT
What I would like to end with is a dataframe with this (in addition to all the other columns)
lastname ... persontype affiliation
smith 1 company a
jones 1 company b
cook 0 None
Thanks a TON for any pointers, guidance, etc. on this. I've been working with the fuzzywuzzy module, but haven't had any success.
NOTE: I'm not providing sample code here because the failure isn't a single line or function. Anything I provide would have too many ancillary problems to be worth your time to look at.
Data format
The data format provided isn't clear about association or delimiters, so I am going to restate what I think you meant. (in a python syntactic structure)
Your data:
lastname person1 person2 person3
smith matt smith, company a tom jones, company b karen cook, company c
jones matt smith, company a tom jones, company b karen cook, company c
cook matt smith, company a tom jones, company b karen cook, company c
Also, I have a list of companies
CompanyName
company a
company b
company d
What I think it means:
lastname = ['smith','jones','cook']
companies = ['company a','company b']
affiliations = {'matt smith':'company a','tom jones':'company b','karen cook':'company c'}
How I think it should be formatted
lastname = ['smith','jones','cook']
companies = ['company a','company b']
names = {'smith':'matt smith','jones':'tom jones','cook':'karen cook'}
affiliations = {'matt smith':'company a','tom jones':'company b','karen cook':'company c'}
Desired result
The idea is to provide the following output:
lastname ... persontype affiliation
smith 1 company a
jones 1 company b
cook 0 None
How to get the desired result
Note: This is using the format that I think would be best.
print('lastname\tpersontype\taffiliation\n')
for last in lastname:
if last in names:
full = names[last]
if full in affiliations:
comp = affiliations[full]
if comp in companies:
print(full+'\t1\t'+comp)
else:
print(full+'\t0\tNone')
else:
print(full+' is not listed in affiliations.')
else:
print(last+' has no matches in names.')
print('\n')
I don't know if you'll be able to create the data format that I think you should use. Though if your data is in a plain text format with the shape you provide, it should be fairly straightforward to read the file and process the lines.
If you want to print it back out to a file, that would be doable as well, study the open() and write() built in functions.
I hope this is helpful. Enjoy!