I have df with many columns and each column have repeated values because its survey data. As an example my data look like this:
df:
Q36r9: sales platforms - Before purchasing a new car Q36r32: Advertising letters - Before purchasing a new car
Not Selected Selected
So i want to strip the text from column names. For example from first column I want to get the text between ":" and "-". So it should be like this: "sales platform" and in second part i want to convert vales of column, "selected" should be changed with the name of column and "Not Selected" as NaN
so desired output would be like this:
sales platforms Advertising letters
NaN Advertising letters
Edited: Another Problem if i have column name like:
Q40r1c3: WeChat - Looking for a new car - And now if you think again - Which social media platforms or sources would you use in each situation?
If i just want to get something in between ":" and "-". It should extract "WeChat"
IIUC,
we can take advantage of some regex and greed matching using .*
which matches everything between a defined pattern
import re
df.columns = [re.search(':(.*)-',i).group(1) for i in df.columns.str.strip()]
print(df.columns)
sales platforms Advertising letters
0 Not Selected None
with greedy matching we can use +?
+? Quantifier — Matches between one and unlimited times, as few times as possible, expanding as needed (lazy)
Q36r9: sales platforms - Before purchasing a new car Q40r1c3: WeChat - Looking for a new car - And now if you think again - Which social media platforms or sources would you use in each situation?
0 1
import re
[re.search(':(.+?)-',i).group(1).strip() for i in df.columns]
['sales platforms', 'WeChat']