Search code examples
bashcsvawksedtoml

Retrieve text from predictable toml files and output as CSV


I have some predictable .toml files with content structure like:

key1 = "someID"
key2 = "someVersionNumber"
key3 = "someTag"
key4 = "someOtherTag"
key5 = [] #empty array, sometimes contains strings
key6 = "long text"
key7 = "more text"
key8 = """
- text
- more text
- so much text
"""

I want to transform it to CSV like this:

"key1","key2","key3","key4","key5","key6","key7","key8"
"someID","someVersionNumber","someTag","someOtherTag","","long text","more text", "- text- more text- so much text"

Can I do this with a few lines of bash commands?

What about if I want to combine all the lines of CSV into one, e.g.

"key1","key2","key3","key4","key5","key6","key7","key8"
"someID","someVersionNumber","someTag","someOtherTag","","long text","more text", "- text- more text- so much text"
"someID","someVersionNumber","someTag","someOtherTag","","long text","more text", "- text- more text- so much text"
"someID","someVersionNumber","someTag","someOtherTag","","long text","more text", "- text- more text- so much text"

...i.e. the output would be one line of CSV per .toml file plus the header at the top (always the same CSV header and number of columns since the .toml files are predictable).

Am I looking at sed, awk, or something even simpler? I have looked at some related questions but feel I must be missing something as I'm getting too much functionality:

Extract data between two points in a text file

Parsing json with awk/sed in bash to get key value pair


Solution

  • If there were only one input file, I'd go with a Perl one-liner. Unfortunately, it comes out rather complex:

    perl -pe 'if(/"""/&&s/"""/"/.../"""/&&s/"""/"\n/){s/[\n\r]//;};if(/ = \[([^]]*)]/){$r=$1eq""?"\"\"":$1=~s/"\s*,\s*"/ /gr;s/ = \[([^]]*)]/ = $r/};s/"\s*#[^"\n]*$/"/' one.toml | perl -ne 'if(/^([^"]+) = "(.*)"/){push@k,$1;push@v,"\"$2\""}END{print((join",",@k),"\n",join",",@v)}'
    

    Things only get worse if we need to operate on multiple (*) files at once:

    perl -ne 'if(/"""/&&s/"""/"/.../"""/&&s/"""/"\n/){s/[\n\r]//;};if(/ = \[([^]]*)]/){$r=$1eq""?"\"\"":$1=~s/"\s*,\s*"/ /gr;s/ = \[([^]]*)]/ = $r/};s/"\s*#[^"\n]*$/"/;print;print"-\n"if eof' *.toml | perl -ne 'if(/^-$/){push@o,join",",@k if scalar@o==0;push@o,join",",@v;@k=@v=()};if(/^([^"]+) = "(.*)"/){push@k,$1;push@v,"\"$2\""}END{print join"\n",@o}'
    

    These two factors call for a structured script. Here it is in Perl, but the same can be done in Python or any language you're comfortable with:

    #!/usr/bin/env perl
    use strict; use warnings; my @output;
    
    foreach my $filename (@ARGV) {
        my $content, my @lines, my $replace, my @keys, my @values;
        open my $fh, "<:encoding(utf8)", $filename or die "Could not open $filename: $!";
        {local $/; $content = <$fh>;}
        $content =~ s/"""([^"]*)"""/'"' . $1=~s#[\r\n]##rg . '"'/ge;
        @lines = split (/[\r\n]/, $content);
        foreach my $line (@lines) {
            if ($line =~ m/ = \[([^]]*)]/) {
                $replace = $1 eq "" ? '""' : $1 =~ s/"\s*,\s*"/ /gr;
                $line =~ s/ = \[([^]]*)]/ = $replace/
            }
            $line =~ s/"\s*#[^"]*$/"/;
            $line =~ m/^([^"]+) = "(.*)"/;
            push @keys, $1;
            push @values, '"' . $2 . '"'
        }
        push @output, join ",", @keys if scalar @output == 0;
        push @output, join ",", @values
    }
    print join "\n", @output
    

    Notes:

    Much of the complexity is due to having to deal with arrays (!), comments and multiline strings. Some preprocessing is needed for each, and that's what takes up most of the solution length. Moreover, additional information would be needed about possible corner cases and how to deal with them (e.g. how to fit an array of strings in a CSV). All of this only underlines the importance of input data quality and consistence. The proposed solution is by no means complete or robust, as it does make several assumptions about input data and the desired output format. Here's how I tackled the mentioned issues:

    • values are expected to be strings only, as they are in the posted sample file. The script does not handle numbers, dates and booleans.
    • arrays can be either empty [] or arrays of strings ["my", "array"]. In the absence of a clear specification by the OP, they translate to a single string that is the concatenation of all element strings. No line breaks are allowed within an array, nor can an array contain other arrays.
    • comments are handled only if they come inline after a string value. No comment-only lines.
    • indentation, blank lines and section headers are not handled

    Test run:

    $ perl toml-to-csv.pl *.toml
    "someID1","someVersionNumber1","someTag1","someOtherTag1","","long text1","more text1","- text- more text- so much text"
    "someID2","someVersionNumber2","someTag2","someOtherTag2","Array","long text2","more text2","- text- more text- so much text"
    "someID3","someVersionNumber3","someTag3","someOtherTag3","My array","long text3","more text3","- text- more text- so much text"