Search code examples
pythonpython-3.xpandassequencing

How to count characters across strings in a pandas column


I have a dataframe with the following structure:

prod_sec     
A    
AA    
AAAAAAAAAAB    
AAAABCCCAA    
AACC   
ABCCCBAC

df = pd.DataFrame({'prod_sec': ['A','AA','AAAAAAAAAAB','AAAABCCCAA','AACC','ABCCCBAC']})

Each string is a sequence made up of letters (A to C in this example).
I would like to create a list for each letter that counts the occurrences in each position down the entire pandas column.

For example in the first string A is only in the first position/index and it's not in the other locations.
In the second string the A in the first two positions and it's not in the other locations
In the third string the A has all the positions until the last one. Etc...
I want a total count, for the column, by position.
Here is an example for A:

A            ->     [1,0,0,0,0,0,0,0,0,0,0]    
AA                  [1,1,0,0,0,0,0,0,0,0,0]
AAAAAAAAAAB   ->    [1,1,1,1,1,1,1,1,1,1,0] 
AAAABCCCAA          [1,1,1,1,0,0,0,0,0,0,1]
AACC                [1,1,0,0,0,0,0,0,0,0,0]
ABCCCBAC    ->      [1,0,0,0,0,0,1,0,0,0,0]

so for A, I would want an output similar to the following... A [6,4,2,2,1,1,2,1,1,1,0]
In the end, I'm trying to get a matrix with a row for each character.

                    [6,4,2,2,1,1,2,1,1,1,0]
                    [0,1,0,0,1,1,0,0,0,0,1]
                    [0,0,1,1,0,1,2,0,0,0,0]

Solution

  • The following should work. You can adjust the result, depending on your exact needs (numpy array, data frame, dictionary, etc). Tell me if you need more help with that.

    max_length=max([len(i) for i in df.prod_sec])
    
    d={'A':[0]*max_length, 'B':[0]*max_length, 'C':[0]*max_length}
    
    for i in df.prod_sec:
        for k in range(len(i)):
            d[i[k]][k]+=1
    
    result=pd.DataFrame.from_dict(d, orient='index')