Search code examples
rseurat

How to add multiple pdata identities of same type to one sample, and be able to distinguish by those identities when plotting


I am trying to essentially add phenotypic information (pdata) to a sample in this dataset.

library(BiocManager)
library(GEOquery) 
library(plyr)
library(dplyr) 
library(Matrix) 
library(Seurat) 

# Loading Raw Data into RStudio ---------------------------------- 
    
    filePaths = getGEOSuppFiles("GSE118389") 
    tarF <- list.files(path = "./GSE118389/", pattern = "*.tar", full.names = TRUE) 
    untar(tarF, exdir = "./GSE118389/") 
    gzipF <- list.files(path = "./GSE118389/", pattern = "*.gz", full.names = TRUE) 
    ldply(.data = gzipF, .fun = gunzip) 
    
    list.files(path = "./GSE118389/", full.names = TRUE)

I load in the sample information for say, patient 39 (PT039), in the following manner:

# Load in full matrix ----------------------------------------------------------

fullmat <- read.table(file = './GSE118389//GSE118389_counts_rsem.txt', 
                      sep = '\t', header = TRUE, stringsAsFactors = FALSE)


# Load in PT039 matrix -----------------------------------------------------------
PT039mat <- grep(pattern =c("^PT039") , x = colnames(fullmat), value = TRUE)
PT039mat = fullmat[,grepl(c("^PT039"),colnames(fullmat))]
PT039mat[1:5,1:5]

I then add the associated pdata in the following manner, essentially creating a separate dataframe aside from PT039mat which will later be added on as the pdata into the Seurat object:

PT039pdat <- data.frame("samples" = colnames(PT039mat), 
                        "lymphovascular_invasion" = "no", 
                        "nodal_involvement" = "no",
                        "BRCA_status" = "BRCA-")

This creates the following: Output of pdata

I may be thinking about this the wrong way, but I have the following information, with PT039 present in multiple batches. I want to essentially recreate this exact dataframe into PT039pdat:

Batch   Patient ID
B1        PT039 
B2        PT058
B3        PT039, PT081, PT089
B4        PT081
B5        PT084, PT089
B6        PT084, PT089
B7        PT126
B8        PT039, PT084
B9        PT039

I can do exactly this using the following code:

PT039_batch <- list(c("B1", "B3", "B8", "B9"))
PT039pdat$batch <- PT039_batch

Output of adding batches

But when I try to plot the different batches on a UMAP, it isn't able to detect the separate possibilities, if that makes sense.

DimPlot(sobj, reduction = "umap", split.by = "batch")

Error in `[<-.data.frame`(`*tmp*`, , split.by, value = c(PT039_P11_A01_S1 = "B1",  : 
  replacement has 3538 rows, data has 1325

I want to be able to create this, but with batches information split by B1, B2, B3, etc. What want, but split by batches

Sorry for the long question, but thank you for reading!


Solution

  • See https://github.com/satijalab/seurat/issues/4650 . It looks like I need the batch information for each cell if I want to plot it in the way that I am.