Search code examples
python-3.xpandasdataframedata-processing

How to extract Uppercase word in a string in all rows of a column in a pandas dataframe?


Dataset is attached. In the column named as "transcription", I want to extract Uppercase word from a string from each and every row in a column and make it as a feature of a dataframe and the string following the uppercase word to be the value of that data point under that feature .

Expected output would be another column in the dataframe named as uppercase word found in a string and the particular data point would have a value under the feature. Tried my best to explain.

Dataset

Link of sample output Sample output (Shown for first 2 data points)

Current situation

Expected output to look like this


Solution

  • Try using this :

    def cust_func(data):
        ## split the transcription with , delimiter - later we will join 
        words = data.split(",")
        
        ## get index of words which are completely in uppercase and also endswith :, 
        column_idx = []
        for i in range(len(words)):
            if ((words[i].endswith(":") or words[i].endswith(": ")) and words[i].isupper()):
                column_idx.append(i)
              
        ## Find the sentence for each of the capital word by joining the words
        ## between two consecutive capital words
        ## Save the cap word and the respective sentence in dict. 
        result = {}
        for i in range(len(column_idx)):
            if i != len(column_idx)-1:
                result[words[column_idx[i]]] = ",".join(words[column_idx[i]+1:column_idx[i+1]])
            else:
                result[words[column_idx[i]]] = ",".join(words[column_idx[i]+1:])
        return(pd.Series(result)) ## this creates new columns
    
    df = pd.concat([df, df.transcription.apply(cust_func)], axis=1)
    df
    
    

    Output looks like this (Couldn't capture all the columns in one screenshot.):

    enter image description here

    enter image description here