I´m not sure if the title was well picked, sorry for that. If this was already covered please let me know where I couldn´t find it. For an analysis that I am doing, I am working in JupyterLab mainly scanpy. I want to see the number of cells that are coexpressing certain genes in a leiden clustering. So far I was trying with pandas crosstab function and I get the number for each cluster. However, I have two conditions and there I´m struggling to separate the samples to get the cell counts separately.
The code I am using to get the total cell number which works fine.
pd.crosstab(adata_proc.obs['leiden_r05'], adata_proc.obs['CoEx'])
The code where I am struggling to get the numbers for the samples. I know that the aggfunc = ','.join
is not the correct way but this is to explain what the problem is.
pd.crosstab(adata_proc.obs['leiden_r05'], adata_proc.obs['CoEx'], adata_proc.obs['sample'], aggfunc = ','.join)
I can get the name of the conditions out in the table but I don´t want this. I want the numbers for the 2 conditions. How is this possible? Maybe there is a way to do this in a separate function?
Edit:
Using crosstab
, you'll need to add the 'CoEx' column to the index, and use the 'sample' as the column of interest:
pd.crosstab(index=[adata_proc.obs['leiden_r05'],adata_proc.obs['CoEx']], columns=[adata_proc.obs['sample']])
I suggest using the .groupby
function:
adata_proc.obs.groupby(['leiden_r05','CoEx'])["sample"].value_counts()
Another option (a bit of an abuse) is the pivot_table
interface. In your case it be:
pd.pivot_table(adata_proc.obs, index=["leiden_r05"], columns=["CoEx","sample"],values='barcode', aggfunc=len, fill_value=0)
*The 'values' argument is here only to reduce the amounts of columns, an artifact of using an unfit method