Search code examples
rloadread.table

Meta data .gz file in r won't load properly


The file here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE104323

called: GSE104323_metadata_barcodes_24185cells.txt.gz

Will not load properly in R, it is missing the age column which is arguably the most important metadata and half the columns are NA's.

The following code loads the data.

hochgerner24k_2018_meta <- read.table(paste(testsetpath,"/Hochgerner2018/GSE104323_metadata_barcodes_24185cells.txt.gz", sep=""), header =TRUE, fill =TRUE)

Without fill = TRUE the following error occurs:

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 30 elements

How can I load this metadata into a dataframe without all this missing information?


Solution

  • That file doesn't have any metadata that I can see. It's a tab-separated file. How do I know that? Well I suppose I could have looked at its documentation which is probably around some where, but what I did instead was to look at it in a text editor --->

    enter image description here

    You can observe the typical arrangement of text for a tab-separated file: all the columns are left aligned with some of them shifting over when the text bleeds over into the next column.

    hochgerner24k_2018_meta <- read.table("~/Downloads/GSE104323_metadata_barcodes_24185cells.txt.gz", header =TRUE, sep="\t")
    > str(hochgerner24k_2018_meta)
    'data.frame':   24216 obs. of  11 variables:
     $ Sample.name..24185.single.cells.      : chr  "10X79_1_AAACTAGCTAGCCC-" "10X79_1_AAACTAGGATGTAT-" "10X79_1_AAACTCACGGCGTT-" "10X79_1_AAACTGTCGGCTCA-" ...
     $ source.name                           : chr  "dentate gyrus" "dentate gyrus" "dentate gyrus" "dentate gyrus" ...
     $ organism                              : chr  "Mus musculus" "Mus musculus" "Mus musculus" "Mus musculus" ...
     $ characteristics..strain               : chr  "hGFAP-GFP" "hGFAP-GFP" "hGFAP-GFP" "hGFAP-GFP" ...
     $ characteristics..age                  : chr  "P120" "P120" "P120" "P120" ...
     $ characteristics..sex.of.pooled.animals: chr  "2males+1female" "2males+1female" "2males+1female" "2males+1female" ...
     $ characteristics..cell.cluster         : chr  "Neuroblast" "OPC" "GC-adult" "MOL" ...
     $ molecule                              : chr  "total RNA" "total RNA" "total RNA" "total RNA" ...
     $ SRR.run.accession                     : chr  "SRR6089817" "SRR6089947" "SRR6089529" "SRR6089595" ...
     $ raw.file..original.file.name.         : chr  "10X79_1_AAACTAGCTAGCCC.fq.gz" "10X79_1_AAACTAGGATGTAT.fq.gz" "10X79_1_AAACTCACGGCGTT.fq.gz" "10X79_1_AAACTGTCGGCTCA.fq.gz" ...
     $ UMI_CellularBarcode                   : chr  "CGGCGATCCC_AAACTAGCTAGCCC" "AGTGGTAATG_AAACTAGGATGTAT" "GGGTGCGCTC_AAACTCACGGCGTT" "CCTTTCAACG_AAACTGTCGGCTCA" ...
    

    Note: .gz files will not always be this format (or any format). They are just compressed and the delimiter has no particular special importance to the compression algorithm. The next .gz file you see could have any structure or no structure.