Search code examples
pythonrefactoringtext-parsing

Best approach to refactoring a python function


I have a messy function I am working to refactor to be more effecient and readable. My python skill are beginner to intermediate at best and I imagine there is a much cleaner way to accomplish this task.

The function below takes in a string that has various business contact related information in it. The information is separated by colons. The business name is always the first field so it can be extracted easy but the rest of the "columns (data between the colons) may or may not be included and is not always in the same order.

The function takes two parameters, 1) rowdata (string containing the examples below) and 2) the data element that I am looking to get returned.

# Business Contact Information
def parseBusinessContactInformation(self,rowdata,element):

    ## Process Business Contact Information
    ## example rowdata = "Business Name, LLC : Business DBA : Email- person@email.com : Phone- 1234567890 : Website- www.site.com"
    ## example rowdata = "Business Name, LLC : Email- person@email.com : Phone- 1234567890 : Website- www.site.com"
    ## example rowdata = "Business Name, LLC : Business DBA : Phone- 1234567890 : Website- www.site.com"
    ## example rowdata = "Business Name, LLC : Phone- 1234567890"
  
    businessName = None
    businessDba = None
    businessPhone = None
    businessEmail = None
    businessWebsite = None
    
    # Split rowdata on :
    contactData = rowdata.split(':')

    ## [0] - business name should always be present
    businessName = contactData[0].strip()
    
    ## [1] - doing_business_as or another field if not present
    if 1 < len(contactData) and re.search('email',contactData[1].lower()):
        contactTemp = contactData[1].split('-')
        businessEmail = contactTemp[1].strip()
        businessDba = contactData[0].strip()
    elif 1 < len(contactData) and re.search('phone',contactData[1].lower()):
        contactTemp = contactData[1].split('-')
        businessPhone = contactTemp[1].strip()
        businessDba = contactData[0].strip()
    elif 1 < len(contactData) and re.search('website',contactData[1].lower()):
        contactTemp = contactData[1].split('-')
        businessWebsite = contactTemp[1].strip()
        businessDba = contactData[0].strip()
    elif 1 < len(contactData) and not re.search(r'(phone|email|website)',contactData[1].lower()):
        businessDba = contactData[1].strip()
    else:
        businessDba = self.dataNotAvailableMessage
    
    ## [2] - phone or email or website
    if 2 < len(contactData) and re.search('email',contactData[2].lower()):
        contactTemp = contactData[2].split('-')
        businessEmail = contactTemp[1].strip()
    elif 2 < len(contactData) and re.search('phone',contactData[2].lower()):
        contactTemp = contactData[2].split('-')
        businessPhone = contactTemp[1].strip()
    elif 2 < len(contactData) and re.search('website',contactData[2].lower()):
        contactTemp = contactData[2].split('-')
        businessWebsite = contactTemp[1].strip()
    
    ## [3] - phone or email or website
    if 3 < len(contactData) and re.search('email',contactData[3].lower()):
        contactTemp = contactData[3].split('-')
        businessEmail = contactTemp[1].strip()
    elif 3 < len(contactData) and re.search('phone',contactData[3].lower()):
        contactTemp = contactData[3].split('-')
        businessPhone = contactTemp[1].strip()
    elif 3 < len(contactData) and re.search('website',contactData[3].lower()):
        contactTemp = contactData[3].split('-')
        businessWebsite = contactTemp[1].strip()
    
    if element == "businessName":
        return businessName
    elif element == "businessDba":
        return businessDba
    elif element == "businessPhone":
        return businessPhone
    elif element == "businessEmail":
        return businessEmail
    elif element == "businessWebsite":
        return businessWebsite
    else:
        return self.dataNotAvailableMessage

I am trying to understand a better way to do this.


Solution

  • Refactoring is a process that is cumulative. You have a comprehensive description of the method in Refactoring by Martin Fowler and Kent Beck.

    Its heart is a series of small behavior preserving transformations. (Martin Fowler, https://refactoring.com/)

    The most important part is: "small" and "behavior preserving". The word "small" is self explanatory, but "behavior preserving" should be ensured by unit tests.

    Preliminary remark: I suggest you stick with PEP 8 Style Guide.

    Behavior preserving

    Replace your comment by a docstring (https://www.python.org/dev/peps/pep-0008/#id33). This is very useful because you write some unit tests inside the docstring (a.k.a. doctests).

    class MyParser:
        dataNotAvailableMessage = "dataNotAvailableMessage"
    
        # Business Contact Information
        def parseBusinessContactInformation(self,rowdata,element):
            """Process Business Contact Information
            
            Examples:
            >>> p = MyParser()
            >>> p.parseBusinessContactInformation("Business Name, LLC : Business DBA : Email- person@email.com : Phone- 1234567890 : Website- www.site.com", "businessPhone")
            '1234567890'
            
            >>> p.parseBusinessContactInformation("Business Name, LLC : Email- person@email.com : Phone- 1234567890 : Website- www.site.com", "businessName")
            'Business Name, LLC'
            
            >>> p.parseBusinessContactInformation("Business Name, LLC : Business DBA : Phone- 1234567890 : Website- www.site.com", "businessDba")
            'Business DBA'
            
            >>> p.parseBusinessContactInformation("Business Name, LLC : Phone- 1234567890", "businessEmail") is None
            True
            
            >>> p.parseBusinessContactInformation("Business Name, LLC : Phone- 1234567890", "?") 
            'dataNotAvailableMessage'
            
            """
    
            ...
            
    import doctest
    doctest.testmod()            
      
    

    You should write more unit tests (use https://docs.python.org/3/library/unittest.html to avoid flooding the docstring) to secure the current behavior, but that's a good start.

    Now, a small transformation: look at those (el)if 1 < len(contactData) and ... lines. You can test the length just once:

    if 1 < len(contactData):
        if re.search('email',contactData[1].lower()):
            contactTemp = contactData[1].split('-')
            businessEmail = contactTemp[1].strip()
            businessDba = contactData[0].strip()
        elif re.search('phone',contactData[1].lower()):
            contactTemp = contactData[1].split('-')
            businessPhone = contactTemp[1].strip()
            businessDba = contactData[0].strip()
        elif re.search('website',contactData[1].lower()):
            contactTemp = contactData[1].split('-')
            businessWebsite = contactTemp[1].strip()
            businessDba = contactData[0].strip()
        elif not re.search(r'(phone|email|website)',contactData[1].lower()):
            businessDba = contactData[1].strip()
        else:
            businessDba = self.dataNotAvailableMessage
    else:
        businessDba = self.dataNotAvailableMessage
    

    You notice that the penultimate else is not reachable: either you have phone, email, website or not:

    if 1 < len(contactData):
        if re.search('email',contactData[1].lower()):
            contactTemp = contactData[1].split('-')
            businessEmail = contactTemp[1].strip()
            businessDba = contactData[0].strip()
        elif re.search('phone',contactData[1].lower()):
            contactTemp = contactData[1].split('-')
            businessPhone = contactTemp[1].strip()
            businessDba = contactData[0].strip()
        elif re.search('website',contactData[1].lower()):
            contactTemp = contactData[1].split('-')
            businessWebsite = contactTemp[1].strip()
            businessDba = contactData[0].strip()
        else:
            businessDba = contactData[1].strip()
    else:
        businessDba = self.dataNotAvailableMessage
    

    Do the same for [2] and [3]:

    if 2 < len(contactData):
        if re.search('email',contactData[2].lower()):
            contactTemp = contactData[2].split('-')
            businessEmail = contactTemp[1].strip()
        elif re.search('phone',contactData[2].lower()):
            contactTemp = contactData[2].split('-')
            businessPhone = contactTemp[1].strip()
        elif re.search('website',contactData[2].lower()):
            contactTemp = contactData[2].split('-')
            businessWebsite = contactTemp[1].strip()
        
    if 3 < len(contactData):
        if re.search('email',contactData[3].lower()):
            contactTemp = contactData[3].split('-')
            businessEmail = contactTemp[1].strip()
        elif re.search('phone',contactData[3].lower()):
            contactTemp = contactData[3].split('-')
            businessPhone = contactTemp[1].strip()
        elif re.search('website',contactData[3].lower()):
            contactTemp = contactData[3].split('-')
            businessWebsite = contactTemp[1].strip()
    

    Now you see a clear pattern. Except the first part assignements businessDba, you have clearly three times the same process. First, we isolate the assignement of businessDba in the first part:

    if 1 < len(contactData):
        if re.search('(email|phone|website)',contactData[1].lower()):
            businessDba = contactData[0].strip()
        else:
            businessDba = contactData[1].strip()
    else:
        businessDba = self.dataNotAvailableMessage
    

    And then:

    if 1 < len(contactData):
        if re.search('email',contactData[1].lower()):
            contactTemp = contactData[1].split('-')
            businessEmail = contactTemp[1].strip()
        elif re.search('phone',contactData[1].lower()):
            contactTemp = contactData[1].split('-')
            businessPhone = contactTemp[1].strip()
        elif re.search('website',contactData[1].lower()):
            contactTemp = contactData[1].split('-')
            businessWebsite = contactTemp[1].strip()
    

    Before we go further, we can remove the line

    businessName = None
    businessDba = None
    

    Since businessName and businessDba have always a value. And replace the new line:

    businessDba = contactData[0].strip()
    

    By:

    businessDba = businessName
    

    That makes explicit the fallback.

    Now, we have three times the same process. A loop is a good idea:

    for i in range(1, 3):
        if i >= len(contactData):
            break
            
        if re.search('email',contactData[i].lower()):
            contactTemp = contactData[i].split('-')
            businessEmail = contactTemp[1].strip()
        elif re.search('phone',contactData[i].lower()):
            contactTemp = contactData[i].split('-')
            businessPhone = contactTemp[1].strip()
        elif re.search('website',contactData[i].lower()):
            contactTemp = contactData[i].split('-')
            businessWebsite = contactTemp[1].strip()
    

    We can extract contactTemp = , even if it will not always be useful:

    for i in range(1, 3):
        if i >= len(contactData):
            break
        contactTemp = contactData[i].split('-')
            
        if re.search('email',contactData[i].lower()):
            businessEmail = contactTemp[1].strip()
        elif re.search('phone',contactData[i].lower()):
            businessPhone = contactTemp[1].strip()
        elif re.search('website',contactData[i].lower()):
            businessWebsite = contactTemp[1].strip()
    

    That's better, but I find the last part (if element == ...) really cumbersome: you test the element against all possibilities. One would like a dictionary here. For a small transformation, we can write:

    d = {
        "businessName": businessName,
        "businessDba": businessDba,
        "businessPhone": businessPhone,
        "businessEmail": businessEmail,
        "businessWebsite": businessWebsite
    }
    return d.get(element, self.dataNotAvailableMessage)
    

    Now, instead of initializing the dict at the end, we can create it and update it on the fly:

        d = {
            "businessPhone": None,
            "businessEmail": None,
            "businessWebsite": None
        }
        
        # Split rowdata on :
        contactData = rowdata.split(':')
    
        ## [0] - business name should always be present
        d["businessName"] = contactData[0].strip()
    
        if 1 < len(contactData):
            if re.search('(email|phone|website)',contactData[1].lower()):
                d["businessDba"] = d["businessName"]
            else:
                d["businessDba"] = contactData[1].strip()
        else:
            d["businessDba"] = self.dataNotAvailableMessage
    
        for i in range(1, 4):
            if i >= len(contactData):
                break
                
            contactTemp = contactData[i].split('-')
            if re.search('email',contactData[i].lower()):
                d["businessEmail"] = contactTemp[1].strip()
            elif re.search('phone',contactData[i].lower()):
                d["businessPhone" = contactTemp[1].strip()
            elif re.search('website',contactData[i].lower()):
                d["businessWebsite"] = contactTemp[1].strip()
        
        return d.get(element, self.dataNotAvailableMessage)
    

    I ran the tests on every modification and it still works, but it is not so easy to read. We can extract a function that creates the dict:

    def parseBusinessContactInformation(self, rowdata, element):
        d = self._parseBusinessContactInformation(rowdata)
        return d.get(element, self.dataNotAvailableMessage)
    
    def _parseBusinessContactInformation(self, rowdata):
        ...
    

    With a small behavior change

    That's not bad, but we can improve this with a small behavior change (I hope you will be okay with this new behavior!):

        for i in range(1, 4):
            if i >= len(contactData):
                break
                
            contactTemp = contactData[i].split('-')
            if len(contactTemp) > 1:
                d["business" + contactTemp[0].strip()] = contactTemp[1].strip()
            
    

    What is the behavior change? Simply, we now accept something like

    >>> p = MyParser()
    >>> p.parseBusinessContactInformation("Business Name, LLC : Business DBA : Foo- Bar", "businessFoo")
    'Bar'
    

    Since we accept more elements, we should change the loop range:

        for i in range(1, len(contactData)):
            ...
    

    It is time to focus on a slight inconsistance: why can businessDba have the value self.dataNotAvailableMessage that was created for the case of a non existing element? We should use None:

        d = {
            "businessDba": None,
            ...
        }
    

    and remove those two lines:

        else:
            d["businessDba"] = self.dataNotAvailableMessage
    

    Then this can be simplified:

        if 1 < len(contactData):
            if "-" in contactData[1]:
                d["businessDba"] = d["businessName"]
            else:
                d["businessDba"] = contactData[1].strip()
    

    Here's the code:

    def parseBusinessContactInformation(self,rowdata,element):
        """Process Business Contact Information
        
        Examples:
        >>> p = MyParser()
        >>> p.parseBusinessContactInformation("Business Name, LLC : Business DBA : Email- person@email.com : Phone- 1234567890 : Website- www.site.com", "businessPhone")
        '1234567890'
        
        >>> p.parseBusinessContactInformation("Business Name, LLC : Email- person@email.com : Phone- 1234567890 : Website- www.site.com", "businessName")
        'Business Name, LLC'
        
        >>> p.parseBusinessContactInformation("Business Name, LLC : Business DBA : Phone- 1234567890 : Website- www.site.com", "businessDba")
        'Business DBA'
        
        >>> p.parseBusinessContactInformation("Business Name, LLC : Phone- 1234567890", "businessEmail") is None
        True
        
        >>> p.parseBusinessContactInformation("Business Name, LLC : Phone- 1234567890", "?") 
        'dataNotAvailableMessage'
        
        >>> p.parseBusinessContactInformation("Business Name, LLC : Business DBA : Foo- Bar", "businessFoo")
        'Bar'
        
        """
        d = self._parseBusinessContactInformation(rowdata)
        return d.get(element, self.dataNotAvailableMessage)
      
    def _parseBusinessContactInformation(self,rowdata):
        d = {
            "businessDba": None,
            "businessPhone": None,
            "businessEmail": None,
            "businessWebsite": None
        }
        
        # Split rowdata on :
        contactData = rowdata.split(':')
    
        ## [0] - business name should always be present
        d["businessName"] = contactData[0].strip()
    
        if 1 < len(contactData):
            if "-" in contactData[1]:
                d["businessDba"] = d["businessName"]
            else:
                d["businessDba"] = contactData[1].strip()
    
        for i in range(1, len(contactData)):
            contactTemp = contactData[i].split('-')
            if len(contactTemp) > 1:
                d["business" + contactTemp[0].strip()] = contactTemp[1].strip()
    
        return d
    

    The final touch: switch to snake case, make a get and a parse function: parse returns a dict while get returns a value:

    data_not_available_message = "dataNotAvailableMessage"
    
    def get_business_contact_information(self, rowdata, element):
        """Process Business Contact Information
        
        Examples:
        >>> p = MyParser()
        >>> p.get_business_contact_information("Business Name, LLC : Business DBA : Email- person@email.com : Phone- 1234567890 : Website- www.site.com", "businessPhone")
        '1234567890'
        
        >>> p.get_business_contact_information("Business Name, LLC : Email- person@email.com : Phone- 1234567890 : Website- www.site.com", "businessName")
        'Business Name, LLC'
        
        >>> p.get_business_contact_information("Business Name, LLC : Business DBA : Phone- 1234567890 : Website- www.site.com", "businessDba")
        'Business DBA'
        
        >>> p.get_business_contact_information("Business Name, LLC : Phone- 1234567890", "businessEmail") is None
        True
        
        >>> p.get_business_contact_information("Business Name, LLC : Phone- 1234567890", "?") 
        'dataNotAvailableMessage'
        
        >>> p.get_business_contact_information("Business Name, LLC : Business DBA : Foo- Bar", "businessFoo")
        'Bar'
        
        :param rowdata: ...
        :param element: ...
        :return: ...
        """
        d = self._parse_business_contact_information(rowdata)
        return d.get(element, self.data_not_available_message)
    

    With some cosmetic changes to make it more pythonic:

    def parse_business_contact_information(self, rowdata):
        """Process Business Contact Information
        
        Examples:
        >>> p = MyParser()
        >>> p.parse_business_contact_information("Business Name, LLC : Business DBA : Email- person@email.com : Phone- 1234567890 : Website- www.site.com") == {
        ... 'businessDba': 'Business DBA', 'businessPhone': '1234567890', 'businessEmail': 'person@email.com', 
        ... 'businessWebsite': 'www.site.com', 'businessName': 'Business Name, LLC'}        
        True
    
        >>> p.parse_business_contact_information("Business Name, LLC : Phone- 1234567890") == {
        ... 'businessDba': 'Business Name, LLC', 'businessPhone': '1234567890', 'businessEmail': None, 
        ... 'businessWebsite': None, 'businessName': 'Business Name, LLC'}
        True
        
        :param rowdata: ...
        :return: ...
        """
        d = dict.fromkeys(("businessDba", "businessPhone", 
                           "businessEmail", "businessWebsite"))
        
        name, *others = rowdata.split(':') # destructuring assignment
    
        d["businessName"] = name.strip()
        if not others:
            return d
        
        if "-" in others[0]:
            d["businessDba"] = d["businessName"]
        else:
            d["businessDba"] = others[0].strip()
            others.pop(0) # consume others[0]
    
        for data in others:
            try:
                key, value = data.split('-', 1) # a- b-c => a, b-c
            except ValueError: # too many/not enough values to unpack
                print("Element {} should have a dash".format(data))
            else:
                d["business" + key.strip()] = value.strip()
    
        return d
    

    The code is not perfect, but it is clearer than it was, at least to my eyes.

    To summarize the method:

    1. write unit tests to secure the behavior;
    2. make small transformations that preserve the behavior and improve the readabilty. Factorize what you can and don't focus on performance here;
    3. continue until you have something clear / stop when you go around in circles and make unnecessary modifications;
    4. if necessary, improve performance.