Search code examples
rcsvmarkdownr-markdownrenjin

Convert CSV file contents to Markdown


Background

The objective is to read from a CSV file and write the contents in a Markdown table format.

The application uses the R engine Renjin, which does not support knitr, kable, or pandoc.

Problem

The write.table command has an eol option, but no corresponding sol option. Thus for the following:

f <- read.csv('planning.csv')
write.table(
   format(f, digits=2), "",
   sep="|", row.names=F, col.names=F, quote=F, eol="|\n")

The output appears as follows:

Geothermal|1250.0|Electricity|0.0|
Houses|  13.7|Shelter|4.2|
Compostor|   1.2|Recycling|0.2|

But each line should appear with a | prefix, as follows:

|Geothermal|1250.0|Electricity|0.0|
|Houses|  13.7|Shelter|4.2|
|Compostor|   1.2|Recycling|0.2|

It should be possible to do something like (note the extra eol pipe):

write.table(
       format(f, digits=2), "",
       sep="|", row.names=F, col.names=F, quote=F, eol="|\n|")

Then capture everything as a string, concatenate a leading pipe, and finally trim the extraneous ending pipe. That is, fix the problems with the output that'll resemble:

Geothermal|1250.0|Electricity|0.0|
|Houses|  13.7|Shelter|4.2|
|Compostor|   1.2|Recycling|0.2|
|Fire Station|  -9.6|Protection|0.5|
|Roads|   0.0|Transport|0.9|
|

Such string manipulation doesn't seem very R-like, though.

Question

What is the most efficient way to transmogrify a CSV file into a Markdown format without relying on third-party libraries?

The Markdown flavour in question looks like:

|Header|Header|Header|
|---|---|---|
|Data|Data|Data|
|Data|Data|Data|

Hints for how to write only the header data and the table header separator are also welcome.


Solution

  • Since you want to put it into markdown, I think it's safe to say that the table size is manageable, so performance is not a factor. (Edit #3: I had some small bugs relating to presence of row names, so to simplify things I'm going to remove them completely from the sample data.)

    mtcars$rowname <- rownames(mtcars)
    rownames(mtcars) <- NULL
    mtcars <- mtcars[,c(ncol(mtcars), 1:(ncol(mtcars)-1))]
    head(mtcars)
    #             rowname  mpg cyl disp  hp drat    wt  qsec vs am gear carb
    # 1         Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
    # 2     Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
    # 3        Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
    # 4    Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
    # 5 Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
    # 6           Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
    

    Now the work:

    dashes <- paste(rep("---", ncol(mtcars)), collapse = "|")
    txt <- capture.output(
      write.table(mtcars, stdout(), quote = FALSE, sep = "|", row.names = FALSE)
    )
    txt2 <- sprintf("|%s|", c(txt[1], dashes, txt[-1]))
    head(txt2)
    # [1] "|rowname|mpg|cyl|disp|hp|drat|wt|qsec|vs|am|gear|carb|"  
    # [2] "|---|---|---|---|---|---|---|---|---|---|---|---|"       
    # [3] "|Mazda RX4|21|6|160|110|3.9|2.62|16.46|0|1|4|4|"         
    # [4] "|Mazda RX4 Wag|21|6|160|110|3.9|2.875|17.02|0|1|4|4|"    
    # [5] "|Datsun 710|22.8|4|108|93|3.85|2.32|18.61|1|1|4|1|"      
    # [6] "|Hornet 4 Drive|21.4|6|258|110|3.08|3.215|19.44|1|0|3|1|"
    

    If you are concerned about alignment, you can check for characters (and perhaps others, over to you). This uses the alignment row of the markdown table format:

    (ischar <- vapply(mtcars, is.character, logical(1)))
    # rowname     mpg     cyl    disp      hp    drat      wt    qsec      vs      am    gear    carb 
    #    TRUE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE 
    dashes <- paste(ifelse(ischar, ":--", "--:"), collapse = "|")
    txt <- capture.output(write.table(mtcars, stdout(), quote = FALSE, sep = "|", row.names = FALSE))
    txt2 <- sprintf("|%s|", c(txt[1], dashes, txt[-1]))
    head(txt2)
    # [1] "|rowname|mpg|cyl|disp|hp|drat|wt|qsec|vs|am|gear|carb|"  
    # [2] "|:--|--:|--:|--:|--:|--:|--:|--:|--:|--:|--:|--:|"       
    # [3] "|Mazda RX4|21|6|160|110|3.9|2.62|16.46|0|1|4|4|"         
    # [4] "|Mazda RX4 Wag|21|6|160|110|3.9|2.875|17.02|0|1|4|4|"    
    # [5] "|Datsun 710|22.8|4|108|93|3.85|2.32|18.61|1|1|4|1|"      
    # [6] "|Hornet 4 Drive|21.4|6|258|110|3.08|3.215|19.44|1|0|3|1|"
    

    And when you're finally ready to save, use cat(txt2, file = "sometable.md") (or writeLines).

    Edit #1: note that the other suggested answers (including mine above) do not address pipe symbols within the content:

    mtcars$mpg[1] <- "2|1.0"
    ischar <- vapply(mtcars, is.character, logical(1))
    dashes <- paste(ifelse(ischar, ":--", "--:"), collapse = "|")
    txt <- capture.output(write.table(mtcars, stdout(), quote = FALSE, sep = "|", row.names = FALSE))
    txt2 <- sprintf("|%s|", c(txt[1], dashes, txt[-1]))
    head(txt2, n = 3)
    # [1] "|rowname|mpg|cyl|disp|hp|drat|wt|qsec|vs|am|gear|carb|"
    # [2] "|:--|:--|--:|--:|--:|--:|--:|--:|--:|--:|--:|--:|"     
    # [3] "|Mazda RX4|2|1.0|6|160|110|3.9|2.62|16.46|0|1|4|4|"    
    ###                ^ this is the problem
    

    You can escape it manually on all character (or add in factors, too) columns:

    ischar <- vapply(mtcars, is.character, logical(1))
    mtcars[ischar] <- lapply(mtcars[ischar], function(x) gsub("\\|", "&#124;", x))
    dashes <- paste(ifelse(ischar, ":--", "--:"), collapse = "|")
    txt <- capture.output(write.table(mtcars, stdout(), quote = FALSE, sep = "|", row.names = FALSE))
    txt2 <- sprintf("|%s|", c(txt[1], dashes, txt[-1]))
    head(txt2, n = 3)
    # [1] "|rowname|mpg|cyl|disp|hp|drat|wt|qsec|vs|am|gear|carb|" 
    # [2] "|:--|:--|--:|--:|--:|--:|--:|--:|--:|--:|--:|--:|"      
    # [3] "|Mazda RX4|2&#124;1.0|6|160|110|3.9|2.62|16.46|0|1|4|4|"
    ###                ^^^^^^ this is the pipe, interpreted correctly in markdown
    

    This doesn't work well when the pipe is within a code block, though a workaround was suggested here: https://stackoverflow.com/a/17320389/3358272

    At this point, as @alistaire suggested, you're somewhat reimplementing knitr::kable. For that matter, just grab knitr/R/table.R) and use kable_markdown which does the pipe-escaping for you. It takes a character matrix, not a data.frame, so kable_markdown(as.matrix(mtcars)). You can't just grab the single function as it uses several helper functions also in that file. You can certainly prune some functions, including kable itself which requires functions in other files.

    Edit #2: since you said renjin doesn't support *apply functions (a comment suggests that is incorrect, but I'll continue for the sake of argument), here's a for-loop implementation that includes alignment and |-escaping:

    mtcars$mpg[1] <- "2|1.0" # just a reminder that it's here
    dashes <- rep("--:", length(mtcars))
    for (i in seq_along(mtcars)) {
      if (is.character(mtcars[[i]]) || is.factor(mtcars[[i]])) {
        mtcars[[i]] <- gsub("\\|", "&#124;", mtcars[[i]])
        dashes[i] <- ":--"
      }
    }
    txt <- capture.output(write.table(mtcars, stdout(), quote = FALSE, sep = "|", row.names = FALSE))
    txt2 <- sprintf("|%s|", c(txt[1], paste(dashes, collapse = "|"), txt[-1]))
    head(txt2, n = 3)
    # [1] "|rowname|mpg|cyl|disp|hp|drat|wt|qsec|vs|am|gear|carb|" 
    # [2] "|:--|:--|--:|--:|--:|--:|--:|--:|--:|--:|--:|--:|"      
    # [3] "|Mazda RX4|2&#124;1.0|6|160|110|3.9|2.62|16.46|0|1|4|4|"
    

    For the record, my *apply and for-loop implementations are effectively the same performance, while @alistaire's solution is over twice as fast (with mtcars):

    Unit: microseconds
                  expr      min        lq      mean    median        uq      max neval
         apply_noalign  917.881  947.9665 1031.9288  971.3060 1041.5050 1999.499   100
           apply_align  945.960  975.1350 1083.2856  995.7390 1063.7500 3523.101   100
     apply_align_pipes 1110.429 1148.5360 1255.5460 1176.9815 1275.2600 1905.778   100
               forloop 1188.104 1217.0950 1309.2549 1261.2205 1342.3600 2979.010   100
             alistaire  451.830  473.7105  511.5778  496.1370  518.5645  827.443   100
       alistaire_pipes  593.687  626.6900  718.6898  652.7645  700.5360 5460.970   100
    

    I used his original function for alistaire and added a simple gsub for alistaire_pipes. There may be a more efficient way to do it, but (a) simple/straight-forward is good, and (b) I think your tables will be small enough where true performance will not be a driving force.