Search code examples
rlistcontent-length

Unlist gives me a longer vector than expected


I am performing an RNA-seq analysis and I need a logical vector, however I am starting from a SimpleLogicalList called hk with 58037 elements, that I obtained from hk <- features.info$symbol %in% house_keeping_genes where features.info is a dataframe and house_keeping_genes is a vector.

After using unlist(hk) 58731 elements are retrieved. Then I realized that there were parts of the list that contained more than two elements (in stead of just FALSE, it contained FALSE FALSE FALSE, thus increasing the length of the result.

Then I just used a unlist(unique(hk)) and most of the unexpected variables were dropped, however there were still 58041 elements in stead of 58037 and I have no idea where are they coming from. I checked and there are no NA being generated.

What could I do to find where those 4 extra elements are coming from?

> dput(hk[60:70])
new("SimpleLogicalList", elementType = "logical", elementMetadata = NULL, 
    metadata = list(), listData = list(ENSG00000004777 = FALSE, 
        ENSG00000004779 = FALSE, ENSG00000004799 = FALSE, ENSG00000004809 = FALSE, 
        ENSG00000004838 = FALSE, ENSG00000004846 = FALSE, ENSG00000004848 = FALSE, 
        ENSG00000004864 = FALSE, ENSG00000004866 = c(FALSE, FALSE, 
        FALSE), ENSG00000004897 = TRUE, ENSG00000004939 = FALSE))

> dput(features.info$symbol[1:5])
new("SimpleCharacterList", elementType = "character", elementMetadata = NULL, 
    metadata = list(), listData = list(ENSG00000000003 = "TSPAN6", 
        ENSG00000000005 = "TNMD", ENSG00000000419 = "DPM1", ENSG00000000457 = "SCYL3", 
        ENSG00000000460 = "C1orf112"))

> dput(house_keeping_genes[1:5])
c("DPM1", "SCYL3", "GCLC", "BAD", "LAP3")

Edit: I need the logical vector to use it as an argument for RUVg() function and if I write hk an error is retrieved: > Error in Ycenter[, cIdx] : invalid subscript type 'S4'.

Packages:

other attached packages:
 [1] NCmisc_1.1.6                RUVSeq_1.28.0               EDASeq_2.28.0               ShortRead_1.52.0           
 [5] GenomicAlignments_1.30.0    Rsamtools_2.10.0            Biostrings_2.62.0           XVector_0.34.0             
 [9] snpStats_1.44.0             Matrix_1.4-0                survival_3.2-13             sva_3.42.0                 
[13] BiocParallel_1.28.3         genefilter_1.76.0           mgcv_1.8-38                 nlme_3.1-153               
[17] pheatmap_1.0.12             ggfortify_0.4.14            ggplot2_3.3.5               edgeR_3.36.0               
[21] limma_3.50.0                dplyr_1.0.7                 SummarizedExperiment_1.24.0 GenomicRanges_1.46.1       
[25] GenomeInfoDb_1.30.0         IRanges_2.28.0              S4Vectors_0.32.3            MatrixGenerics_1.6.0       
[29] matrixStats_0.61.0          tweeDEseqCountData_1.32.0   Biobase_2.54.0              BiocGenerics_0.40.0        

Solution

  • The issue would be that some of the symbol (which is a SimpleLogicalList) contains more than one element, so we loop over the list with sapply, and wrap with any which returns a single TRUE/FALSE if any of the elements in the list element are present %in% 'house_keeping_genes. The \(x) is a concise way to represent lambda function (function(x)) in the recent versions of R

    hk1 <- sapply(features.info$symbol, \(x) 
             any(x %in% house_keeping_genes, na.rm = TRUE))