Search code examples
nlppython-3.7

In text preprocessing, Contractions are not recognising single and double quotes


I am doing text preprocessing of my text which is articles. In my preprocessing code, one of the steps is contractions where I am trying to expand words like " I've " , " I'm " etc. But I am facing an issue where contraction is working when I type the sample text but I don't work on my text. I also know the reason. The reason is that there is a difference in the font. For example the sample text:

“I’m here because a Cabinet minister is needed.”

And below is the same text but I wrote by myself:

"I'm here because a Cabinet minister is needed."

If you look carefully you can see the difference in the quotation marks (both single and double).

How do I solve this issue?

Below is the code which I am using for contractions.

def expand_contractions(row, contraction_mapping=CONTRACTION_MAP):
    Japan_3 = row['Articles']
    Japan_3 = Japan_3.apply(lambda x: str(x).replace("’", "'"))
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())),
                                      flags=re.IGNORECASE | re.DOTALL)

    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match) \
            if contraction_mapping.get(match) \
            else contraction_mapping.get(match.lower())
        expanded_contraction = first_char + expanded_contraction[1:]
        return expanded_contraction

    expanded_text = contractions_pattern.sub(expand_match, Japan_3)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text


Japan['expanded_text'] = Japan.apply(expand_contractions, axis=1)

After altering the code I am getting below error:

AttributeError: ("'str' object has no attribute 'apply'", 'occurred at index 0')

I don't know how to explain it in a less confusing way.

Thanks in advance!


Solution

  • A solution could be to replace all the wrong contraction mark to the correct one. In your case, this can be done by applying a replace function to the Article column in Pandas Dataframe:

    Japan_3 = Japan_3.apply(lambda x:str(x).replace("’","'"))
    

    I can't test your function because I don't have your contraction mapping you are passing as parameter. But my guess is that you can add that piece of code after Japan_3 = row['Articles']. Then execute the rest of your contraction as normal. In fact, I'd call the function in this way:

    expand_contractions(Japan, contraction_mapping=CONTRACTION_MAP)
    

    But, to be sincere, I don't know exactly what are you trying to do in that code to remove the contractions. To be fair, to expand contractions I'd just replace each but their expanded form in the text. The following is what I'd do. I didn't test it, though, so it might not work accordingly, but I guess it is similar.

    CONTRACTION_MAP = {"I'm":"I am"} # contraction definition. This is just an example, please change it here with your contractions
    Japan["Article"] = Japan["Article"].apply(lambda x:str(x).replace("’","'")) # replace the wrong quotation mark by the correct one
    for contraction in CONTRACTION_MAP:
        Japan["Article"] = Japan["Article"].apply(lambda x:str(x).replace(contraction,CONTRACTION_MAP[contraction])) # in this case I'm just replacing the contraction by the expanded form. I iterate it through all the possible contractions