Search code examples
pythonpandasreadability

Calculate Gunning-Fog score on excel values


I have a spreadsheet with fields containing a body of text.

enter image description here

I want to calculate the Gunning-Fog score on each row and have the value output to that same excel file as a new column. To do that, I first need to calculate the score for each row. The code below works if I hard key the text into the df variable. However, it does not work when I define the field in the sheet (i.e., rfds) and pass that through to my r variable. I get the following error, but two fields I am testing contain 3,896 and 4,843 words respectively.

readability.exceptions.ReadabilityException: 100 words required.

Am I missing something obvious? Disclaimer, I am very new to python and coding in general! Any help is appreciated.

from readability import Readability
import pandas as pd

df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")

rfd = df["Item 1A"]
rfds = rfd.to_string() # to fix "TypeError: expected string or buffer"

r = Readability(rfds)
fog = r.gunning_fog()
print(fog.score)

Solution

  • TL;DR: You need to pass the cell value and are currently passing a column of cells.

    This line rfd = df["Item 1A"] returns a reference to a column. rfd.to_string() then generates a string containing either length (number of rows in the column) or the column reference. This is why a TypeError was thrown - neither the length nor the reference are strings.

    Rather than taking a column and going down it, approach it from the other direction. Take the rows and then pull out the column:

    for index, row in df.iterrows():
        print(row.iloc[2])
    

    The [2] is the column index.

    Now a cell identifier exists, this can be passed to the Readability calculator:

        r = Readability(row.iloc[2])
        fog = r.gunning_fog()
        print(fog.score)
    

    Note that these can be combined together into one command:

        print(Readability(row.iloc[2]).gunning_fog())
    

    This shows you how commands can be chained together - which way you find it easier is up to you. The chaining is useful when you give it to something like apply or applymap.

    Putting the whole thing together (the step by step way):

    from readability import Readability
    import pandas as pd
    df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
    
    for index, row in df.iterrows():
        r = Readability(row.iloc[2])
        fog = r.gunning_fog()
        print(fog.score)
    

    Or the clever way:

    from readability import Readability
    import pandas as pd
    df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
    
    print(df["Item 1A"].apply(lambda x: Readability(x).gunning_fog()))