Dataset is attached. In the column named as "transcription", I want to extract Uppercase word from a string from each and every row in a column and make it as a feature of a dataframe and the string following the uppercase word to be the value of that data point under that feature .
Expected output would be another column in the dataframe named as uppercase word found in a string and the particular data point would have a value under the feature. Tried my best to explain.
Link of sample output Sample output (Shown for first 2 data points)
Try using this :
def cust_func(data):
## split the transcription with , delimiter - later we will join
words = data.split(",")
## get index of words which are completely in uppercase and also endswith :,
column_idx = []
for i in range(len(words)):
if ((words[i].endswith(":") or words[i].endswith(": ")) and words[i].isupper()):
column_idx.append(i)
## Find the sentence for each of the capital word by joining the words
## between two consecutive capital words
## Save the cap word and the respective sentence in dict.
result = {}
for i in range(len(column_idx)):
if i != len(column_idx)-1:
result[words[column_idx[i]]] = ",".join(words[column_idx[i]+1:column_idx[i+1]])
else:
result[words[column_idx[i]]] = ",".join(words[column_idx[i]+1:])
return(pd.Series(result)) ## this creates new columns
df = pd.concat([df, df.transcription.apply(cust_func)], axis=1)
df
Output looks like this (Couldn't capture all the columns in one screenshot.):