Search code examples
rtidyversereadr

Struggling to use read_tsv() in place of read.csv()


ANSWERED: Thank you so much Bob, ffs the issue was not specifying comment='#'. Why this works, when 'skip' should've skipped the offending lines remains a mystery. Also see Gray's comment re: Excel's 'Text to Columns' feature for a non-R solution.

Hey folks,

this has been a demon on my back for ages.

The data I work with is always a collection of tab delimited .txt files, so my analysis always begin with gathering the file paths to each and feeding those into read.csv() and binding to a df.

dat <- list.files(
    path = 'data',
    pattern = '*.txt',
    full.names = TRUE,
    recursive = TRUE
    ) %>%
    map_df( ~read.csv( ., sep='\t', skip=16) )  # actual data begins at line 16

This does exactly what I want, but I've been transitioning to tidyverse over the last few years.

I don't mind using utils::read.csv(), where my datasets are usually small the speed benefit of readr wouldn't be felt. But, for consistency's sake I'd rather use readr.

When I do the same, but sub readr::read_tsv(), i.e.,

dat <- 
    .... same call to list.files()
    %>%
    map_df( ~read_tsv( ., skip=16 ))

I always get an empty (0x0) table. But it seems to be 'reading' the data, because I get a warning print out of 'Parsed with column specification: cols()' for every column in my data.

Clearly I'm misunderstanding here, but I don't know what about it I don't understand, which has made my search for answers challenging & fruitless.

So... what am I doing wrong here?

Thanks in advance!

edit: a example snippet of (one of) my data files was requested, hope this formats well!

# KLIBS INFO
#  > KLibs Commit: 11a7f8331ba14052bba91009694f06ae9e1cdd3d
#
# EXPERIMENT SETTINGS
#  > Trials Per Block: 72
#  > Blocks Per Experiment: 8
#
# SYSTEM INFO
#  > Operating System: macOS 10.13.4
#  > Python Version: 2.7.15
#
# DISPLAY INFO
#  > Screen Size: 21.5" diagonal
#  > Resolution: 1920x1080 @ 60Hz
#  > View Distance: 57 cm

PID search_type stimulus_type   present_absent  response    rt  error
3   time    COLOUR  present absent  5457.863881 TRUE
3   time    COLOUR  absent  absent  5357.009108 FALSE
3   time    COLOUR  present present 2870.76412  FALSE
3   time    COLOUR  absent  absent  5391.404728 FALSE
3   time    COLOUR  present present 2686.6131   FALSE
3   time    COLOUR  absent  absent  5306.652878 FALSE

edit: Using Jukob's suggestion

files <- list.files(
    path = 'data',
    pattern = '*.txt',
    full.names = TRUE,
    recursive = TRUE
    )

for (i in 1:length(files)) {
    print(read_tsv(files[i], skip=16))
}

prints:

Parsed with column specification:
cols()
# A tibble: 0 x 0

... for each file

If I print files, I do get the correct list of file paths. If I remove skip=16 I get:

Parsed with column specification:
cols(
  `# KLIBS INFO` = col_character()
)
Warning: 617 parsing failures.
row col  expected     actual                                     file
 15  -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
 16  -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
 17  -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
 18  -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
 19  -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
... ... ......... .......... ........................................
See problems(...) for more details.

... for each file

Solution

  • FWIW I was able to solve the problem using your snippet by doing something along the following line:

    # Didn't work for me since when I copy and paste your snippet,
    # the tabs become spaces, but I think in your original file
    # the tabs are preserved so this should work for you
    read_tsv("dat.tsv", comment = "#") 
    
    # This works for my case
    read_table2("dat.tsv", comment = "#")
    

    Didn't even need to specify skip argument!

    But also, no idea why using skip and not comment will fail... :(