I have a dataframe with the following structure:
prod_sec
A
AA
AAAAAAAAAAB
AAAABCCCAA
AACC
ABCCCBAC
df = pd.DataFrame({'prod_sec': ['A','AA','AAAAAAAAAAB','AAAABCCCAA','AACC','ABCCCBAC']})
Each string is a sequence made up of letters (A to C in this example).
I would like to create a list for each letter that counts the occurrences in each position down the entire pandas column.
For example in the first string A is only in the first position/index and it's not in the other locations.
In the second string the A in the first two positions and it's not in the other locations
In the third string the A has all the positions until the last one. Etc...
I want a total count, for the column, by position.
Here is an example for A:
A -> [1,0,0,0,0,0,0,0,0,0,0]
AA [1,1,0,0,0,0,0,0,0,0,0]
AAAAAAAAAAB -> [1,1,1,1,1,1,1,1,1,1,0]
AAAABCCCAA [1,1,1,1,0,0,0,0,0,0,1]
AACC [1,1,0,0,0,0,0,0,0,0,0]
ABCCCBAC -> [1,0,0,0,0,0,1,0,0,0,0]
so for A, I would want an output similar to the following... A [6,4,2,2,1,1,2,1,1,1,0]
In the end, I'm trying to get a matrix with a row for each character.
[6,4,2,2,1,1,2,1,1,1,0] [0,1,0,0,1,1,0,0,0,0,1] [0,0,1,1,0,1,2,0,0,0,0]
The following should work. You can adjust the result, depending on your exact needs (numpy array, data frame, dictionary, etc). Tell me if you need more help with that.
max_length=max([len(i) for i in df.prod_sec])
d={'A':[0]*max_length, 'B':[0]*max_length, 'C':[0]*max_length}
for i in df.prod_sec:
for k in range(len(i)):
d[i[k]][k]+=1
result=pd.DataFrame.from_dict(d, orient='index')