I'm a python/pandas newbie and have the following problem: I have a list called 'cat' containing different 13 strings that represent categories. I further have a dataframe called 'ku_drop' that contains 10 columns with HTML code (string format), from which I want to extract information. Now, I want to search for each string of my 'cat'-list in the dataframe and save each cell containing the specific string in the same column. (E.g. all cells containing the string 'Arbeitsatmosphäre' should be saved in Column X1, all containing 'Kommunikation' in Column X2 etc.) How can I do this? I tried with the following, but I only receive an empty dataframe ...
cat = ['Arbeitsatmosphäre', 'Kommunikation', 'Kollegenzusammenhalt', 'Work-Life-Balance', 'Vorgesetztenverhalten', 'Interessante Aufgaben', 'Gleichberechtigung', 'Umgang mit älteren Kollegen', 'Arbeitsbedingungen', 'Umwelt-/Sozialbewusstsein', 'Gehalt/Sozialleistungen', 'Image', 'Karriere/Weiterbildung']
cat_length = len(cat)
df_appender = []
for i in range(cat_length):
x = "{}".format(category[i] for category in cat)
df_cat = ku_drop[ku_drop.apply(lambda col: col.str.contains(x, case=False), axis=1)].stack().to_frame()
df_cat.columns = ['X[i]']
df_cat = df_cat.dropna(axis=0)
df_appender.append(df_cat)
df_appender
I'm aware that my code might have a lot of flaws, please excuse this as I am really not very familiar with pandas so far.
Try:
cat = ['Arbeitsatmosphäre', 'Kommunikation', 'Kollegenzusammenhalt', 'Work-Life-Balance', 'Vorgesetztenverhalten', 'Interessante Aufgaben', 'Gleichberechtigung', 'Umgang mit älteren Kollegen', 'Arbeitsbedingungen', 'Umwelt-/Sozialbewusstsein', 'Gehalt/Sozialleistungen', 'Image', 'Karriere/Weiterbildung']
ku_drop = pd.DataFrame({'c1': ['Arbeitsatmosphäre abc', 'abc', 'Work-Life-Balance abc', 'Arbeitsatmosphäre abc'], 'c2': ['abc', 'abc Vorgesetztenverhalten abc', 'Kommunikation abc abc', 'abc abc Arbeitsatmosphäre']})
df = pd.DataFrame(index= range(len(ku_drop)), columns = cat)
for i, c in enumerate(cat):
used = 0
for j, c2 in enumerate(ku_drop.columns):
temp = ku_drop[ku_drop[c2].str.contains(c)][c2].values
if len(temp)>0:
df.loc[used:used+len(temp)-1,c] = temp
used += len(temp)
Output: