Search code examples
text-processingtxr

TXR: removing trailing and leading commas in data respecting header line


I have a lot of data like the following:

There are many ways data could be missing.,,,,,,,,,
,,,,,,,,,,,
An entire interior column could be missing.,,,,,,,,,

[missing/data/inside],,,,,,,,,
a,b,c,,,,,,,
1,,3,,,,,,,
1,,4,,,,,,,
3,,2,,,,,,,
,,,,,,,,,
An indented data with 2 completely missing columns.,,,,,,,,,

,,,,,,,[missing/data/outside],,
,,,,,,,a,b,c
,,,,,,,,3,
,,,,,,,,4,,,,,,,,
,,,,,,,,2,,,,,,,,

I want to tidy it up a bit into:

There are many ways data could be missing.

An entire interior column could be missing.

[missing/data/inside]
a,b,c
1,,3
1,,4
3,,2

An indented data with 2 completely missing columns.

[missing/data/outside]
a,b,c
,3,
,4,
,2,

The challenges are:

  • keeping all non-table text annotations (cleanup up any leading or trailing commas)
  • keeping the appropriate number of commas in data tables based on their header

If I didn't have the second challenge, I would just pipe my output through sed:

... | output | sed 's/,*$//g' | sed 's/^,*//g'

I trust that the number of commas to the left of the data will be equal in the header and data lines. However, I can't trust the same for the lagging commas.

I've written the following TXR code:

@(define empty_line)@\
@  (cases)@\
@/,*/@(eol)@\
@  (or)@\
@/[ ]*/@(eol)@\
@  (or)@\
@(eol)@\
@  (end)@\
@(end)
@(define heading)@/[a-z]+(:[^,]*)?/@(end)
@(define header)@\
@  (cases)@\
@    (heading),@(header)@\
@  (or)@\
@    (heading)@\
@  (end)@\
@(end)
@(define content (hdr))@/.*/@(end)
@(define table (loc head data))
@/,*/[@loc]@(skip)
@{lead /,*/}@{head (header)}@(skip)
@  (collect)
@lead@{data (content head)}@(skip)
@  (until)
@(empty_line)
@  (end)
@(end)
@(collect)
@annotation
@(empty_line)
@(table loc head data)
@(end)
@(output)
@  (repeat)
@annotation

[@loc]
@head
@    (repeat)
@data
@    (end)

@  (end)
@(end)

How might I write the content function to extract out the appropriate number of columns from the input data? I thought maybe it might be as easy as using the coll or rep directives like:

@(define content (hdr))@\
@  (coll :gap 0 :times (length (split-str hdr ",")))@{x /[^,]/}@(end)@\
@(end)

This code doesn't reliably capture nor clean up annotations. Since annotation can exist anywhere that is not a table. How can I extract them and clean them up? I tried a few ways using @(maybe) and another nested @(collect) with no luck.

@  (maybe)
@    (collect)
@/,*/@annotation@/,*/
@    (until)
@(empty_line)
@/,*/[@loc]@(skip)
@    (end)
@  (end)

Update:

I tried to solve just the table data collection part independently, for which I wrote the following code:

@(define heading)@/[^,]+/@(end)
@(define header)@\
@  (cases)@\
@    (heading),@(header)@\
@  (or)@\
@    (heading)@\
@  (end)@\
@(end)
@(define content (hdr))@\
@  (coll :gap 1 :mintimes 1 :maxtimes (length (split-str hdr ",")))@\
@/[^,]*/@\
@  (end)@\
@(end)
@{lead /,*/}@{head (header)}@(skip)
@(collect :gap 0 :vars (data))
@lead@{data (content head)}@/,*/
@(end)
@(output)
@head
@  (repeat)
@data
@  (end)
@(end)

Here is my sample data:

,,alpha,foxtrot: m,tango: b,,
,,1,a,3,,
,,1,b,,,
,,whisky,c,foxtrot,,
,,,d,,,
,,1,,,,
,,,c,,,,,,

The code gives the correct result in all cases except for the penultimate line. It seems to me the trick to solving this problem is to write a regular expression for coll that correctly extracts blank data. Is there another approach that would make this possible? For example, appending the necessary remaining commas?


Solution

  • Below is code which seems to work:

    @(define empty_line)@\
    @  (cases)@\
    @/,*/@(eol)@\
    @  (or)@\
    @/[ ]*/@(eol)@\
    @  (or)@\
    @(eol)@\
    @  (end)@\
    @(end)
    @(define heading)@/[^,]+/@(end)
    @(define header)@\
    @  (cases)@\
    @    (heading),@(header)@\
    @  (or)@\
    @    (heading)@\
    @  (end)@\
    @(end)
    @(define content (hdr))@\
    @/[^,]*/@\
    @  (coll :gap 0 :times (- (length (split-str hdr ",")) 1))@\
    ,@/[^,]*/@\
    @  (end)@\
    @(end)
    @(define table (loc head data))
    @/,*/[@loc]@(skip)
    @{lead /,*/}@{head (header)}@(skip)
    @  (collect)
    @lead@{data (content head)}@(skip)
    @  (until)
    @(empty_line)
    @  (end)
    @(end)
    @(collect)
    @  (collect)
    @/,*/@{annotation /[A-Za-z0-9]+.*[^,]+/}@/,*/
    @  (until)
    @    (cases)
    @(empty_line)
    @/,*/[@loc]@(skip)
    @    (or)
    @(eof)
    @    (end)
    @  (end)
    @(empty_line)
    @(table loc head data)
    @(end)
    @(output)
    @  (repeat)
    @    (repeat)
    @annotation
    
    @    (end)
    [@loc]
    @head
    @    (repeat)
    @data
    @    (end)
    
    @  (end)
    @(end)