Search code examples
pandasdata-processing

Pandas manipulation: matching data from other columns to one column, applied uniquely to all rows


I have a model that predicts 10 words for a particular course in order of likelihood, and I'd like the first 5 words of those words that appear in the course's description.

This is the format of the data:

course_name course_title    course_description  predicted_word_10   predicted_word_9    predicted_word_8    predicted_word_7    predicted_word_6    predicted_word_5    predicted_word_4    predicted_word_3    predicted_word_2    predicted_word_1
Xmath 32    Precalculus     Polynomial and rational functions, exponential...   directed    scholars    approach    build   african different   visual  cultures    placed  global
Xphilos 2   Morality        Introduction to ethical and political philosop...   make    presentation    weekly  european    ways    general range   questions   liberal speakers

My idea is for each row to start iterating from predicted_word_1 until I get the first 5 that are in the description. I'd like to save those words in the order they appear into additional columns description_word_1 ... description_word_5. (If there are <5 predicted words in the description I plan to return NAN in the corresponding columns).

To clarify with an example: if the course_description of a course is 'Polynomial and rational functions, exponential and logarithmic functions, trigonometry and trigonometric functions. Complex numbers, fundamental theorem of algebra, mathematical induction, binomial theorem, series, and sequences. ' and its first few predicted words are irrelevantword1, induction, exponential, logarithmic, irrelevantword2, polynomial, algebra...

I would want to return induction, exponential, logarithmic, polynomial, algebra for that in that order and do the same for the rest of the courses.

My attempt was to define an apply function that will take in a row and iterate from the first predicted word until it finds the first 5 that are in the description, but the part I am unable to figure out is how to create these additional columns that have the correct words for each course. This code will currently only keep the words for one course for all the rows.

def find_top_description_words(row):
    print(row['course_title'])
    description_words_index=1
    for i in range(num_words_per_course): 
        description = row.loc['course_description']
        word_i = row.loc['predicted_word_' + str(i+1)]
        if (word_i in description) & (description_words_index <=5) :
            print(description_words_index)
            row['description_word_' + str(description_words_index)] = word_i
            description_words_index += 1


df.apply(find_top_description_words,axis=1)

The end goal of this data manipulation is to keep the top 10 predicted words from the model and the top 5 predicted words in the description so the dataframe would look like:

course_name course_title  course_description top_description_word_1 ... top_description_word_5 predicted_word_1 ... predicted_word_10 

Any pointers would be appreciated. Thank you!


Solution

  • If I understand correctly:

    Create new DataFrame with just 100 predicted words:

    pred_words_lists = df.apply(lambda x: list(x[3:].dropna())[::-1], axis = 1)
    

    Please note that, there are lists in each row with predicted words. The order is nice, I mean the first, not empty, predicted word is on the first place, the second on the second place and so on.

    Now let's create a new DataFrame:

    pred_words_df = pd.DataFrame(pred_words_lists.tolist())
    pred_words_df.columns = df.columns[:2:-1]
    

    And The final DataFrame:

    final_df = df[['course_name', 'course_title', 'course_description']].join(pred_words_df.iloc[:,0:11])
    

    Hope this works.

    EDIT

    def common_elements(xx, yy):
        temp = pd.Series(range(0, len(xx)), index= xx)
        return list(df.reindex(yy).sort_values()[0:10].dropna().index)
    
    pred_words_lists = df.apply(lambda x: common_elements(x[2].replace(',','').split(), list(x[3:].dropna())), axis = 1)
    

    Does it satisfy your requirements?

    Adapted solution (OP):

    def get_sorted_descriptions_words(course_description, predicted_words, k):
        description_words = course_description.replace(',','').split()
        predicted_words_list = list(predicted_words)
        predicted_words = pd.Series(range(0, len(predicted_words_list)), index=predicted_words_list)
        predicted_words = predicted_words[~predicted_words.index.duplicated()]
        ordered_description = predicted_words.reindex(description_words).dropna().sort_values()
        ordered_description_list = pd.Series(ordered_description.index).unique()[:k]
    
        return ordered_description_list
    
    df.apply(lambda x: get_sorted_descriptions_words(x['course_description'], x.filter(regex=r'predicted_word_.*'), k), axis=1)