Search code examples
pythonpandasdataframeembedding

I need co-occurrence dataframe of characters


import pandas as pd

corpus = pd.DataFrame([[1, 'A B C A D B A'], [2, 'B A B B C B A']], columns=['id',
                      'sequence'])
corpus

Expected Output

    A B C D
1   3 2 1 1
2   2 4 1 0

I have a dataframe that looks like above. I need to count co-occurrence of each character.


Solution

  • Try with split then explode and str.get_dummies

    out = corpus.set_index('id').sequence.str.split(' ').explode().str.get_dummies().groupby(level=0).sum()
       A  B  C  D
    1  3  2  1  1
    2  2  4  1  0