I have a some dataframes whose columns have string values (sentences). each of these dataframes have column names that either has the word 'gold' in combination with other words (e.g, df.columns: 'gold_data', 'dataset_gold',...etc' or has the word 'labeled' in combination with other words (e.g, df.columns: 'labeled_data', 'dataset_labeled',...etc' or have both 'gold' and 'labeled' in combination with other words.
Here is an example of how the dataframes look like if both column name exists.
import pandas as pd
df = pd.DataFrame({'gold_data':['hello the weather nice','this is interesting','the weather is good'],
'data2':['goodbye','the plant is green','the weather is sunny'],
'new_labeled_dataset':['hello','there is no food in the fridge','this weather amazing']})
I trying to process the strings in the columns depending on which one exists and return the dataframe where the conditions were true for the rows in the original dataframe as follows.
result = []
for index, entry in df.iterrows():
if not any(df.columns.str.contains(pat='labeled')):
text = entry.filter(regex='gold').squeeze()
else:
text = entry.filter(regex='labeled').squeeze()
if len(text.split()) > 2:
# assigment? = 'new_info:' + text (this is where i do not know how to assign back to the column which was processed)
result.append(entry)
print(pd.DataFrame(result))
so, I am saying if there is no 'labeled' in column names take the text from the column that has the word 'gold' otherwise take the text from 'labeled' column. But since I do not know the complete name of the column, i am not sure how to assign the processed text back to that column. The desired output should be:
gold_data data2 augmented_new
0 new_info:this is interesting the plant is green there is no food in the fridge
1 new_info:the weather is good the weather is sunny this weather amazing
I have tried to get the full_name of the column and assign it to that, but that is not correct either.
# df[col for col in df if 'gold' or 'labeled' in col] ='new_info:' + text
If I understood correctly, you want to apply the string transformation on an certain elements of a column chosen using the column names. If this is the case, you can avoid to manually iterate over each single row, and simply use the apply() method of Pandas over the retrieved column. Since you do not want to do this for all the strings, but only with strings of at least 3 words, you can filter them thanks to the loc method of Pandas. You can do it with the following code:
# Chose in what case you are
if not any(df.columns.str.contains(pat='labeled')):
# Retrieve the 'gold' column name
chosen_col = next(filter(lambda x: 'gold' in x, [col for col in df.columns ]))
else:
# Retrieve the 'labeled' column name
chosen_col = next(filter(lambda x: 'labeled' in x, [col for col in df.columns ]))
# Filter rows
df = df.loc[df[chosen_col].str.split().map(len) > 2]
# Transform all the string in the retrieved column
df[chosen_col] = df[chosen_col].apply(lambda x: 'new_info:' + x)
print(df)
Since you have provided two different dataframes, the results obtained by this code are:
gold_data data2 new_labeled_dataset
1 this is interesting the plant is green new_info:there is no food in the fridge
2 the weather is good the weather is sunny new_info:this weather amazing
and for the final one:
gold_data data2 augmented_new
0 new_info:hello the weather nice goodbye hello
1 new_info:this is interesting the plant is green there is no food in the fridge
2 new_info:the weather is good the weather is sunny this weather amazing