Search code examples
pythonpandasdataframedelimiter

Counting Occurrences of Text in a Column in a data frames containing separators


I have a column in a data frame that contains values of what languages people have worked with. Each row is a new individual and the languages are separated by a delimiter(;).

Column to be evaluated

Is there any way to count occurrences of each language in the entire column, eg, python occurs n times, JavaScript occurs N times, etc?

I tried this but I'm confused about how I could count the occurrences of each language in the entire column df['LanguageHaveWorkedWith'].value_counts()

what I tried

I also tried to use get_dummies to one-hot encode it but how would I count the occurrences of each element? df['LanguageHaveWorkedWith'].str.get_dummies(sep = ';')

Get dummies


Solution

  • use split data by separator, then explode, and calculate unique items counts

    import pandas as pd
    
    data = {'Languages': ['Python;JavaScript;Java', 'Python;C++;Python;JavaScript', 'JavaScript;C++']}
    df = pd.DataFrame(data)
    
    # 'Languages' split column by ';', then explode the list
    df['Languages'] = df['Languages'].str.split(';')
    df = df.explode('Languages')
    
    # count each language sum
    language_counts = df['Languages'].value_counts().reset_index()
    language_counts.columns = ['Language', 'Count']