python-3.x pandas dataframe data-processing

How to extract Uppercase word in a string in all rows of a column in a pandas dataframe?

Dataset is attached. In the column named as "transcription", I want to extract Uppercase word from a string from each and every row in a column and make it as a feature of a dataframe and the string following the uppercase word to be the value of that data point under that feature .

Expected output would be another column in the dataframe named as uppercase word found in a string and the particular data point would have a value under the feature. Tried my best to explain.

Dataset

Link of sample output Sample output (Shown for first 2 data points)

Solution

Try using this :

def cust_func(data):
    ## split the transcription with , delimiter - later we will join 
    words = data.split(",")
    
    ## get index of words which are completely in uppercase and also endswith :, 
    column_idx = []
    for i in range(len(words)):
        if ((words[i].endswith(":") or words[i].endswith(": ")) and words[i].isupper()):
            column_idx.append(i)
          
    ## Find the sentence for each of the capital word by joining the words
    ## between two consecutive capital words
    ## Save the cap word and the respective sentence in dict. 
    result = {}
    for i in range(len(column_idx)):
        if i != len(column_idx)-1:
            result[words[column_idx[i]]] = ",".join(words[column_idx[i]+1:column_idx[i+1]])
        else:
            result[words[column_idx[i]]] = ",".join(words[column_idx[i]+1:])
    return(pd.Series(result)) ## this creates new columns

df = pd.concat([df, df.transcription.apply(cust_func)], axis=1)
df

Output looks like this (Couldn't capture all the columns in one screenshot.):