Search code examples
rtext-filesbinning

Binning values in R with multiple files


So I've got a slight problem with binning values contained in multiple text files into set ranges. I've had a look online for various packages and came across sm which can bin values and you can specify the bin range as well like so:-

xb <- binning(x, breaks=seq(-4,4,by=0.5))

but I have a few problems with implementing this. I can't specify an upper limit as I don't know what the upper limit is (the files have thousands of values obtained from an instrument), I have 25 files each of which contains thousands of values that I need to bin all at the same time (I need to then subsequently take the median of all those binned values) and I'm having troubles with just reading in the text files to begin with. I perform:-

read.table("foobar.txt", sep=",")

to read just one text file as all the values are separated out by a ',' but it doesn't seem to be able to handle this. Also I want to be able to bin the values with a given range of 0.0005 (so for example values are binned between 200.0000 - 200.0005, 200.0005 - 200.0010 and so on)

The text files I'm using are .txt with values that are separated out by commas and contain thousands of values. The idea is that I bin the values into groups within a set range and then take the median of those values that represents that particular bin. For example, if I had the values 1,1,2,3,3,4,5,5,6,7,7,9,10 and I set the range of the bin to 2, then the first bin will contain 1,1,2, second 3,3,4, third 5,5,6 etc and the median of all those values is taken to represent the bin, the first bin having a median value of 1, second 3, third 5 (I know in this example it may seem pointless to take the median value but with the data I have it makes sense)

Is there a way of reading in multiple text files of values and processing them all simultaneously in the manner in which I described? Is there a package for something like this that I can just look at the manual for? Any suggestions or tips would be grateful!


Solution

  • There are several ways to do it, I'll provide one method using base functions. (An alternative would be to use dplyr, also well suited for this. However, the base example should be simple enough.)

    Generate Data

    (This is here merely because we don't have any of your data.)

    n <- 10
    for (ii in 1:3) {
        dat <- runif(n)
        writeLines(paste(dat, collapse = ','),
                   con = sprintf('user2062207-file%s.txt', ii))
    }
    readLines('user2062207-file1.txt')
    ## [1] "0.929472318384796,0.921938128070906,0.707776406314224,0.236701443558559,0.271322417538613,0.388766387710348,0.422867075540125,0.324589917669073,0.92406965768896,0.171326051233336"
    

    Read the Data

    This is where you'll start, assuming you have a simple pattern for finding the files.

    fnames <- list.files(pattern = 'user2062207-file.*.txt')
    allData <- unlist(sapply(fnames, read.table, sep = ','))
    allRange <- range(allData)
    df <- data.frame(x = allData)
    head(df)
    ##           x
    ## 1 0.9294723
    ## 2 0.9219381
    ## 3 0.7077764
    ## 4 0.2367014
    ## 5 0.2713224
    ## 6 0.3887664
    dim(df)
    ## [1] 30  1
    

    Set the Bins

    The {floor,ceiling} +/- binSize below is because the bins include only one side of the range (default: right side), so the minimum value(s) will not be binned. It also ensures the bins are on rounded boundaries.

    binSize <- 0.05
    allBins <- seq(floor(allRange[1] / binSize) * binSize,
                   ceiling(allRange[2] / binSize) * binSize,
                   by = binSize)
    ## bin the data
    df$bin <- cut(df$x, breaks = allBins)
    head(df)
    ##           x        bin
    ## 1 0.9294723 (0.9,0.95]
    ## 2 0.9219381 (0.9,0.95]
    ## 3 0.7077764 (0.7,0.75]
    ## 4 0.2367014 (0.2,0.25]
    ## 5 0.2713224 (0.25,0.3]
    ## 6 0.3887664 (0.35,0.4]
    

    Statistics on Each Bin

    sapply(levels(df$bin), function(lvl) median(df$x[df$bin == lvl], na.rm = TRUE))
    ##   (0,0.05] (0.05,0.1] (0.1,0.15] (0.15,0.2] (0.2,0.25] (0.25,0.3] (0.3,0.35] 
    ## 0.03802277         NA 0.11528715 0.18195392 0.22918094 0.27132242 0.33626971 
    ## (0.35,0.4] (0.4,0.45] (0.45,0.5] (0.5,0.55] (0.55,0.6] (0.6,0.65] (0.65,0.7] 
    ## 0.38009637 0.42184059         NA 0.53826028 0.57820253 0.64165116 0.67825992 
    ## (0.7,0.75] (0.75,0.8] (0.8,0.85] (0.85,0.9] (0.9,0.95]   (0.95,1] 
    ## 0.74243926         NA 0.80759621 0.88974267 0.92406966 0.95691077 
    

    This is an area where numerous other options could be advantageous. For instance, the base function by can work, though dealing with its data structure is not always intuitive even if the function call itself is easy to read:

    head(by(df$x, df$bin, median, na.rm = TRUE))
    ## df$bin
    ##   (0,0.05] (0.05,0.1] (0.1,0.15] (0.15,0.2] (0.2,0.25] (0.25,0.3] 
    ## 0.03802277         NA 0.11528715 0.18195392 0.22918094 0.27132242 
    

    You could also use dplyr with ease. This example starts with the original allData and allBins:

    library(dplyr)
    data.frame(x = allData) %>%
        mutate(bin = cut(x, breaks = allBins)) %>%
        group_by(bin) %>%
        summarise(median(x))
    ## Source: local data frame [17 x 2]
    ##           bin  median(x)
    ## 1    (0,0.05] 0.03802277
    ## 2  (0.1,0.15] 0.11528715
    ## 3  (0.15,0.2] 0.18195392
    ## 4  (0.2,0.25] 0.22918094
    ## 5  (0.25,0.3] 0.27132242
    #### ..snip..
    

    The first example preserves empty bins whereas the other methods are not aware of empty bins. There are possibly other ways of using by and dplyr that would include these empty bins, but this seems to suffice.

    EDIT

    After a bit of chat, it was determined that the range of the data was too wide with a bin width of 0.0005. A better solution was devised. (No sample data to provide, sorry, not mine to give ...) I'll use random data to mimic the process:

    set.seed(42)
    x <- 5e7 * runif(5e5)
    
    library(dplyr)
    binSize <- 0.0005
    df <- data.frame(dat = sort(x))
    df$bin <- floor(df$dat / binSize) * binSize
    head(df)
    ##         dat       bin
    ## 1  410.9577  410.9575
    ## 2  456.6275  456.6270
    ## 3  552.3674  552.3670
    ## 4  875.4898  875.4895
    ## 5 1018.6806 1018.6805
    ## 6 1102.2436 1102.2435
    system.time(results <- df %>% group_by(bin) %>% summarize(med = median(dat)))
    ##    user  system elapsed 
    ##   12.08    0.00   12.11 
    head(results)
    ## Source: local data frame [6 x 2]
    ##         bin       med
    ## 1  410.9575  410.9577
    ## 2  456.6270  456.6275
    ## 3  552.3670  552.3674
    ## 4  875.4895  875.4898
    ## 5 1018.6805 1018.6806
    ## 6 1102.2435 1102.2436