I have a messy function I am working to refactor to be more effecient and readable. My python skill are beginner to intermediate at best and I imagine there is a much cleaner way to accomplish this task.
The function below takes in a string that has various business contact related information in it. The information is separated by colons. The business name is always the first field so it can be extracted easy but the rest of the "columns (data between the colons) may or may not be included and is not always in the same order.
The function takes two parameters, 1) rowdata (string containing the examples below) and 2) the data element that I am looking to get returned.
# Business Contact Information
def parseBusinessContactInformation(self,rowdata,element):
## Process Business Contact Information
## example rowdata = "Business Name, LLC : Business DBA : Email- person@email.com : Phone- 1234567890 : Website- www.site.com"
## example rowdata = "Business Name, LLC : Email- person@email.com : Phone- 1234567890 : Website- www.site.com"
## example rowdata = "Business Name, LLC : Business DBA : Phone- 1234567890 : Website- www.site.com"
## example rowdata = "Business Name, LLC : Phone- 1234567890"
businessName = None
businessDba = None
businessPhone = None
businessEmail = None
businessWebsite = None
# Split rowdata on :
contactData = rowdata.split(':')
## [0] - business name should always be present
businessName = contactData[0].strip()
## [1] - doing_business_as or another field if not present
if 1 < len(contactData) and re.search('email',contactData[1].lower()):
contactTemp = contactData[1].split('-')
businessEmail = contactTemp[1].strip()
businessDba = contactData[0].strip()
elif 1 < len(contactData) and re.search('phone',contactData[1].lower()):
contactTemp = contactData[1].split('-')
businessPhone = contactTemp[1].strip()
businessDba = contactData[0].strip()
elif 1 < len(contactData) and re.search('website',contactData[1].lower()):
contactTemp = contactData[1].split('-')
businessWebsite = contactTemp[1].strip()
businessDba = contactData[0].strip()
elif 1 < len(contactData) and not re.search(r'(phone|email|website)',contactData[1].lower()):
businessDba = contactData[1].strip()
else:
businessDba = self.dataNotAvailableMessage
## [2] - phone or email or website
if 2 < len(contactData) and re.search('email',contactData[2].lower()):
contactTemp = contactData[2].split('-')
businessEmail = contactTemp[1].strip()
elif 2 < len(contactData) and re.search('phone',contactData[2].lower()):
contactTemp = contactData[2].split('-')
businessPhone = contactTemp[1].strip()
elif 2 < len(contactData) and re.search('website',contactData[2].lower()):
contactTemp = contactData[2].split('-')
businessWebsite = contactTemp[1].strip()
## [3] - phone or email or website
if 3 < len(contactData) and re.search('email',contactData[3].lower()):
contactTemp = contactData[3].split('-')
businessEmail = contactTemp[1].strip()
elif 3 < len(contactData) and re.search('phone',contactData[3].lower()):
contactTemp = contactData[3].split('-')
businessPhone = contactTemp[1].strip()
elif 3 < len(contactData) and re.search('website',contactData[3].lower()):
contactTemp = contactData[3].split('-')
businessWebsite = contactTemp[1].strip()
if element == "businessName":
return businessName
elif element == "businessDba":
return businessDba
elif element == "businessPhone":
return businessPhone
elif element == "businessEmail":
return businessEmail
elif element == "businessWebsite":
return businessWebsite
else:
return self.dataNotAvailableMessage
I am trying to understand a better way to do this.
Refactoring is a process that is cumulative. You have a comprehensive description of the method in Refactoring by Martin Fowler and Kent Beck.
Its heart is a series of small behavior preserving transformations. (Martin Fowler, https://refactoring.com/)
The most important part is: "small" and "behavior preserving". The word "small" is self explanatory, but "behavior preserving" should be ensured by unit tests.
Preliminary remark: I suggest you stick with PEP 8 Style Guide.
Replace your comment by a docstring (https://www.python.org/dev/peps/pep-0008/#id33). This is very useful because you write some unit tests inside the docstring (a.k.a. doctests).
class MyParser:
dataNotAvailableMessage = "dataNotAvailableMessage"
# Business Contact Information
def parseBusinessContactInformation(self,rowdata,element):
"""Process Business Contact Information
Examples:
>>> p = MyParser()
>>> p.parseBusinessContactInformation("Business Name, LLC : Business DBA : Email- person@email.com : Phone- 1234567890 : Website- www.site.com", "businessPhone")
'1234567890'
>>> p.parseBusinessContactInformation("Business Name, LLC : Email- person@email.com : Phone- 1234567890 : Website- www.site.com", "businessName")
'Business Name, LLC'
>>> p.parseBusinessContactInformation("Business Name, LLC : Business DBA : Phone- 1234567890 : Website- www.site.com", "businessDba")
'Business DBA'
>>> p.parseBusinessContactInformation("Business Name, LLC : Phone- 1234567890", "businessEmail") is None
True
>>> p.parseBusinessContactInformation("Business Name, LLC : Phone- 1234567890", "?")
'dataNotAvailableMessage'
"""
...
import doctest
doctest.testmod()
You should write more unit tests (use https://docs.python.org/3/library/unittest.html to avoid flooding the docstring) to secure the current behavior, but that's a good start.
Now, a small transformation: look at those (el)if 1 < len(contactData) and ...
lines. You can test the length just once:
if 1 < len(contactData):
if re.search('email',contactData[1].lower()):
contactTemp = contactData[1].split('-')
businessEmail = contactTemp[1].strip()
businessDba = contactData[0].strip()
elif re.search('phone',contactData[1].lower()):
contactTemp = contactData[1].split('-')
businessPhone = contactTemp[1].strip()
businessDba = contactData[0].strip()
elif re.search('website',contactData[1].lower()):
contactTemp = contactData[1].split('-')
businessWebsite = contactTemp[1].strip()
businessDba = contactData[0].strip()
elif not re.search(r'(phone|email|website)',contactData[1].lower()):
businessDba = contactData[1].strip()
else:
businessDba = self.dataNotAvailableMessage
else:
businessDba = self.dataNotAvailableMessage
You notice that the penultimate else
is not reachable: either you have phone
, email
, website
or not:
if 1 < len(contactData):
if re.search('email',contactData[1].lower()):
contactTemp = contactData[1].split('-')
businessEmail = contactTemp[1].strip()
businessDba = contactData[0].strip()
elif re.search('phone',contactData[1].lower()):
contactTemp = contactData[1].split('-')
businessPhone = contactTemp[1].strip()
businessDba = contactData[0].strip()
elif re.search('website',contactData[1].lower()):
contactTemp = contactData[1].split('-')
businessWebsite = contactTemp[1].strip()
businessDba = contactData[0].strip()
else:
businessDba = contactData[1].strip()
else:
businessDba = self.dataNotAvailableMessage
Do the same for [2] and [3]:
if 2 < len(contactData):
if re.search('email',contactData[2].lower()):
contactTemp = contactData[2].split('-')
businessEmail = contactTemp[1].strip()
elif re.search('phone',contactData[2].lower()):
contactTemp = contactData[2].split('-')
businessPhone = contactTemp[1].strip()
elif re.search('website',contactData[2].lower()):
contactTemp = contactData[2].split('-')
businessWebsite = contactTemp[1].strip()
if 3 < len(contactData):
if re.search('email',contactData[3].lower()):
contactTemp = contactData[3].split('-')
businessEmail = contactTemp[1].strip()
elif re.search('phone',contactData[3].lower()):
contactTemp = contactData[3].split('-')
businessPhone = contactTemp[1].strip()
elif re.search('website',contactData[3].lower()):
contactTemp = contactData[3].split('-')
businessWebsite = contactTemp[1].strip()
Now you see a clear pattern. Except the first part assignements businessDba
, you have clearly three times the same process. First, we isolate the assignement of businessDba
in the first part:
if 1 < len(contactData):
if re.search('(email|phone|website)',contactData[1].lower()):
businessDba = contactData[0].strip()
else:
businessDba = contactData[1].strip()
else:
businessDba = self.dataNotAvailableMessage
And then:
if 1 < len(contactData):
if re.search('email',contactData[1].lower()):
contactTemp = contactData[1].split('-')
businessEmail = contactTemp[1].strip()
elif re.search('phone',contactData[1].lower()):
contactTemp = contactData[1].split('-')
businessPhone = contactTemp[1].strip()
elif re.search('website',contactData[1].lower()):
contactTemp = contactData[1].split('-')
businessWebsite = contactTemp[1].strip()
Before we go further, we can remove the line
businessName = None
businessDba = None
Since businessName
and businessDba
have always a value. And replace the new line:
businessDba = contactData[0].strip()
By:
businessDba = businessName
That makes explicit the fallback.
Now, we have three times the same process. A loop is a good idea:
for i in range(1, 3):
if i >= len(contactData):
break
if re.search('email',contactData[i].lower()):
contactTemp = contactData[i].split('-')
businessEmail = contactTemp[1].strip()
elif re.search('phone',contactData[i].lower()):
contactTemp = contactData[i].split('-')
businessPhone = contactTemp[1].strip()
elif re.search('website',contactData[i].lower()):
contactTemp = contactData[i].split('-')
businessWebsite = contactTemp[1].strip()
We can extract contactTemp =
, even if it will not always be useful:
for i in range(1, 3):
if i >= len(contactData):
break
contactTemp = contactData[i].split('-')
if re.search('email',contactData[i].lower()):
businessEmail = contactTemp[1].strip()
elif re.search('phone',contactData[i].lower()):
businessPhone = contactTemp[1].strip()
elif re.search('website',contactData[i].lower()):
businessWebsite = contactTemp[1].strip()
That's better, but I find the last part (if element == ...
) really cumbersome: you test the element
against all possibilities. One would like a dictionary here. For a small transformation, we can write:
d = {
"businessName": businessName,
"businessDba": businessDba,
"businessPhone": businessPhone,
"businessEmail": businessEmail,
"businessWebsite": businessWebsite
}
return d.get(element, self.dataNotAvailableMessage)
Now, instead of initializing the dict at the end, we can create it and update it on the fly:
d = {
"businessPhone": None,
"businessEmail": None,
"businessWebsite": None
}
# Split rowdata on :
contactData = rowdata.split(':')
## [0] - business name should always be present
d["businessName"] = contactData[0].strip()
if 1 < len(contactData):
if re.search('(email|phone|website)',contactData[1].lower()):
d["businessDba"] = d["businessName"]
else:
d["businessDba"] = contactData[1].strip()
else:
d["businessDba"] = self.dataNotAvailableMessage
for i in range(1, 4):
if i >= len(contactData):
break
contactTemp = contactData[i].split('-')
if re.search('email',contactData[i].lower()):
d["businessEmail"] = contactTemp[1].strip()
elif re.search('phone',contactData[i].lower()):
d["businessPhone" = contactTemp[1].strip()
elif re.search('website',contactData[i].lower()):
d["businessWebsite"] = contactTemp[1].strip()
return d.get(element, self.dataNotAvailableMessage)
I ran the tests on every modification and it still works, but it is not so easy to read. We can extract a function that creates the dict:
def parseBusinessContactInformation(self, rowdata, element):
d = self._parseBusinessContactInformation(rowdata)
return d.get(element, self.dataNotAvailableMessage)
def _parseBusinessContactInformation(self, rowdata):
...
That's not bad, but we can improve this with a small behavior change (I hope you will be okay with this new behavior!):
for i in range(1, 4):
if i >= len(contactData):
break
contactTemp = contactData[i].split('-')
if len(contactTemp) > 1:
d["business" + contactTemp[0].strip()] = contactTemp[1].strip()
What is the behavior change? Simply, we now accept something like
>>> p = MyParser()
>>> p.parseBusinessContactInformation("Business Name, LLC : Business DBA : Foo- Bar", "businessFoo")
'Bar'
Since we accept more element
s, we should change the loop range
:
for i in range(1, len(contactData)):
...
It is time to focus on a slight inconsistance: why can businessDba
have the value self.dataNotAvailableMessage
that was created for the case of a non existing element? We should use None
:
d = {
"businessDba": None,
...
}
and remove those two lines:
else:
d["businessDba"] = self.dataNotAvailableMessage
Then this can be simplified:
if 1 < len(contactData):
if "-" in contactData[1]:
d["businessDba"] = d["businessName"]
else:
d["businessDba"] = contactData[1].strip()
Here's the code:
def parseBusinessContactInformation(self,rowdata,element):
"""Process Business Contact Information
Examples:
>>> p = MyParser()
>>> p.parseBusinessContactInformation("Business Name, LLC : Business DBA : Email- person@email.com : Phone- 1234567890 : Website- www.site.com", "businessPhone")
'1234567890'
>>> p.parseBusinessContactInformation("Business Name, LLC : Email- person@email.com : Phone- 1234567890 : Website- www.site.com", "businessName")
'Business Name, LLC'
>>> p.parseBusinessContactInformation("Business Name, LLC : Business DBA : Phone- 1234567890 : Website- www.site.com", "businessDba")
'Business DBA'
>>> p.parseBusinessContactInformation("Business Name, LLC : Phone- 1234567890", "businessEmail") is None
True
>>> p.parseBusinessContactInformation("Business Name, LLC : Phone- 1234567890", "?")
'dataNotAvailableMessage'
>>> p.parseBusinessContactInformation("Business Name, LLC : Business DBA : Foo- Bar", "businessFoo")
'Bar'
"""
d = self._parseBusinessContactInformation(rowdata)
return d.get(element, self.dataNotAvailableMessage)
def _parseBusinessContactInformation(self,rowdata):
d = {
"businessDba": None,
"businessPhone": None,
"businessEmail": None,
"businessWebsite": None
}
# Split rowdata on :
contactData = rowdata.split(':')
## [0] - business name should always be present
d["businessName"] = contactData[0].strip()
if 1 < len(contactData):
if "-" in contactData[1]:
d["businessDba"] = d["businessName"]
else:
d["businessDba"] = contactData[1].strip()
for i in range(1, len(contactData)):
contactTemp = contactData[i].split('-')
if len(contactTemp) > 1:
d["business" + contactTemp[0].strip()] = contactTemp[1].strip()
return d
The final touch: switch to snake case, make a get
and a parse
function: parse
returns a dict while get
returns a value:
data_not_available_message = "dataNotAvailableMessage"
def get_business_contact_information(self, rowdata, element):
"""Process Business Contact Information
Examples:
>>> p = MyParser()
>>> p.get_business_contact_information("Business Name, LLC : Business DBA : Email- person@email.com : Phone- 1234567890 : Website- www.site.com", "businessPhone")
'1234567890'
>>> p.get_business_contact_information("Business Name, LLC : Email- person@email.com : Phone- 1234567890 : Website- www.site.com", "businessName")
'Business Name, LLC'
>>> p.get_business_contact_information("Business Name, LLC : Business DBA : Phone- 1234567890 : Website- www.site.com", "businessDba")
'Business DBA'
>>> p.get_business_contact_information("Business Name, LLC : Phone- 1234567890", "businessEmail") is None
True
>>> p.get_business_contact_information("Business Name, LLC : Phone- 1234567890", "?")
'dataNotAvailableMessage'
>>> p.get_business_contact_information("Business Name, LLC : Business DBA : Foo- Bar", "businessFoo")
'Bar'
:param rowdata: ...
:param element: ...
:return: ...
"""
d = self._parse_business_contact_information(rowdata)
return d.get(element, self.data_not_available_message)
With some cosmetic changes to make it more pythonic:
def parse_business_contact_information(self, rowdata):
"""Process Business Contact Information
Examples:
>>> p = MyParser()
>>> p.parse_business_contact_information("Business Name, LLC : Business DBA : Email- person@email.com : Phone- 1234567890 : Website- www.site.com") == {
... 'businessDba': 'Business DBA', 'businessPhone': '1234567890', 'businessEmail': 'person@email.com',
... 'businessWebsite': 'www.site.com', 'businessName': 'Business Name, LLC'}
True
>>> p.parse_business_contact_information("Business Name, LLC : Phone- 1234567890") == {
... 'businessDba': 'Business Name, LLC', 'businessPhone': '1234567890', 'businessEmail': None,
... 'businessWebsite': None, 'businessName': 'Business Name, LLC'}
True
:param rowdata: ...
:return: ...
"""
d = dict.fromkeys(("businessDba", "businessPhone",
"businessEmail", "businessWebsite"))
name, *others = rowdata.split(':') # destructuring assignment
d["businessName"] = name.strip()
if not others:
return d
if "-" in others[0]:
d["businessDba"] = d["businessName"]
else:
d["businessDba"] = others[0].strip()
others.pop(0) # consume others[0]
for data in others:
try:
key, value = data.split('-', 1) # a- b-c => a, b-c
except ValueError: # too many/not enough values to unpack
print("Element {} should have a dash".format(data))
else:
d["business" + key.strip()] = value.strip()
return d
The code is not perfect, but it is clearer than it was, at least to my eyes.
To summarize the method: