Search code examples
pythonpandasdataframeword-wrap

Cleaning up DataFrame due to text wrapping creating multiple lines


I have a DataFrame I'm working with that split based on text wrapping. The description has been split over multiple rows, where the key data, in a separate column "Values" only appears in the first line of the description. I'm hoping to do two things:

  1. Merge the information in the description from the additional wrapped rows.
  2. Remove the additional rows with no "Values"

enter image description here

I've been trying iterating through the rows, but with little success. Here is some sample code showing the problem dataframe (just an example):

import pandas as pd
a = ["This is", "the first", "line", "second", "line", "third", "line", "(no pattern)", "fourth", "fifth"]
b = [1, "", "", 3, "", 47, "", "", 1, 2]
df = pd.DataFrame()
df['Description'] = a
df['Values'] = b
df.head(10)

I've tried a few things. Right now I'm just trying to get the loop to identify if "Values" has anything in it, here is the current, short, code i'm working with:

for index, row in df.iterrows():
if row.Values != '':
    a = row["Description"]
else:
    b = row["Description"]
    c = a + b
    print(c)
    #print(row["Values"])

Solution

  • after working on this all day I got an answer!!! It does seem long-winded, if you have an alternate answer, please let me know. Here is my answer:

    a_array = np.array([]) 
    for index, row in df.iterrows():
        if row.Values != '':
            a = int(index)
        a_array = np.append(a_array, a)
    df['a']= a_array
    df['Description'] = df.groupby(['a'])['Description'].transform(lambda x: ' '.join(x))
    df.replace("", np.nan, inplace=True) # replace blank values with NaN
    df.dropna(subset = ['Values']) #Drop rows with NaN in Values
    

    This code works by iterating through the rows, and identifying the index when a value appears in "Values". I add that as a dataframe column, 'a', and then group by the value of 'a' joining all the strings with a space ' '.

    Then the blanks have the NaN value entered (np.nan), and I drop all rows with a NaN in the column "Values".