Search code examples
awktextfieldextract

extract structured data from text files (awk ?) : missing fields must get default value


(I'm working with macos).

I have 70k text files in subfolders I want to extract some data from, recursively, then – if possible – have output written in one tab delimited file for later spreadsheet processing. Files, coming from my wiki (I use PmWiki, which saves data in text files) are formatted this way when complete (unwanted data deleted for readability) :

version=
agent=
author=
charset=
csum=
ctime=1041379201
description=
host=
name=Name.12
rev=3
targets=Target.1,OtherTarget.23,Target.90
text=
time=
title=My title
author:
csum:
diff:
host:
author:
csum:
diff:

I would like to extract data delimited with = for fields named ctime name rev targets title (5 fields).

My main issue is how to get data (keys ctime= rev= targets= name= title=), plus having default values when some are missing ?

I believe one has to test if each of the targeted key exists ; create it with default value if missing ; then extract the wanted fields values and finally tabulate the data.

Expected output would be tab delimited ; missing data would be named something easy to catch later on. i.e., for full file given in example (tabs in place of spaces), output would give something like (ctime, rev, name, title, targets) :

1041379201 3 Name.12 my title Target.1,OtherTarget.23,Target.90

and, for not complete file (missing field, in line 1, is rev ; in line 2, rev and title) :

1041379201 XXX Name.12 my title Target.1,OtherTarget.23,Target.90
1041379201 XXX Name.12 XXX Target.1,OtherTarget.23,Target.90

Final project is to be able to extract data once a month, then have text file easy to use in spreadsheet, monthly updated.

My less worst attempt is something like that (yet doesn't work at all, lacking if/else condition) :

awk 'BEGIN { FS = "=" ;} /^ctime=/ {
                print $2
                next
                }
/^rev=/ {
                print $2
                next}
/^name=/ {
                print $2
                next}
/^title=/ {
                print $2
                next}
/^targets=/ {
                print $2
                next}'

Here it is an original PmWiki file (in that case I still would like to extract ctime name rev targets title (and have default values for missing fields, ctime and title) :

version=pmwiki-2.2.64 ordered=1 urlencoded=1
author=simon
charset=UTF-8
csum=add summary
name=Main.HomePage
rev=203
targets=PmWiki.DocumentationIndex,PmWiki.InitialSetupTasks,PmWiki.BasicEditing,Main.WikiSandbox
text=(:Summary:The default home page for the PmWiki distribution:)%0aWelcome to PmWiki!%0a%0aA local copy of PmWiki's%0adocumentation has been installed along with the software,%0aand is available via the [[PmWiki/documentation index]].  %0a%0aTo continue setting up PmWiki, see [[PmWiki/initial setup tasks]].%0a%0aThe [[PmWiki/basic editing]] page describes how to create pages%0ain PmWiki.  You can practice editing in the [[wiki sandbox]].%0a%0aMore information about PmWiki is available from [[http://www.pmwiki.org]].%0a
time=1400472661

Updating my question.

The way I posted my question may look like more complex than it is. From this, repeated in 70k text files :

word1=line1
word2=line2
word3=line3
...

I would like to get a file collecting every line1, line3, lineX (for command targetting word1, word2, wordX) AND have default value in case word1=line1 or word2=line2 or wordX=lineX doesn't exist at all.

At the end, I have found something pretty close to my need with Rick Smith's answer given to Retrieve default value with grep -e?


Solution

  • I just noticed you said you only wanted to print the values for specific tags which makes things easier. Using GNU awk for ENDFILE and gensub():

    $ cat tst.awk
    BEGIN {
        OFS="\t"
        numTags = split("ctime rev targets name title",tags)
    
        for (tagNr=1; tagNr<=numTags; tagNr++) {
            tag = tags[tagNr]
            printf "%s%s", tag, (tagNr<numTags ? OFS : ORS)
        }
    }
    
    match($0,/^([[:alnum:]_]+)[=:](.*)/,a) {
        tag = a[1]
        val = gensub(" ?" OFS " ?"," ","g",a[2])
        tag2val[tag] = val
    }
    
    ENDFILE {
        for (tagNr=1; tagNr<=numTags; tagNr++) {
            tag = tags[tagNr]
            val = ( tag in tag2val ? tag2val[tag] : "_ABSENT_" )
            val = ( val == "" ? "_NULL_" : val )
            printf "%s%s", val, (tagNr<numTags ? OFS : ORS)
        }
        delete tag2val
    }
    

    $ awk -f tst.awk file
    ctime   rev     targets name    title
    1041379201      3       Target.1,OtherTarget.23,Target.90       Name.12 My title
    

    $ awk -f tst.awk file | column -s$'\t' -t
    ctime       rev  targets                            name     title
    1041379201  3    Target.1,OtherTarget.23,Target.90  Name.12  My title
    

    Original answer:

    It sounds like this might be what you're trying to do if the tags in each input file are unique, requiring GNU awk for several extensions:

    $ cat tst.awk
    BEGIN { OFS="\t" }
    match($0,/^([[:alnum:]_]+)[=:](.*)/,a) {
        tag = a[1]
        val = gensub(" ?" OFS " ?"," ","g",a[2])
    
        if ( !seen[tag]++ ) {
            tags[++numTags] = tag
        }
    
        key2val[ARGIND,tag] = val
    }
    END {
        for (tagNr=1; tagNr<=numTags; tagNr++) {
            tag = tags[tagNr]
            printf "%s%s", tag, (tagNr<numTags ? OFS : ORS)
        }
    
        for ( fileNr=1; fileNr<=ARGIND; fileNr++) {
            for (tagNr=1; tagNr<=numTags; tagNr++) {
                tag = tags[tagNr]
                key = fileNr SUBSEP tag
                val = ( key in key2val ? key2val[key] : "_ABSENT_" )
                val = ( val == "" ? "_NULL_" : val )
                printf "%s%s", val, (tagNr<numTags ? OFS : ORS)
            }
        }
    }
    

    $ awk -f tst.awk file
    version agent   author  charset csum    ctime   description     host    name    rev     targets text    time    title   diff
    _NULL_  _NULL_  _NULL_  _NULL_  _NULL_  1041379201      _NULL_  _NULL_  Name.12 3       Target.1,OtherTarget.23,Target.90       _NULL_   _NULL_  My title        _NULL_
    

    To see the columns visually aligned:

    $ awk -f tst.awk file | column -s$'\t' -t
    version  agent   author  charset  csum    ctime       description  host    name     rev  targets                            text    time    title     diff
    _NULL_   _NULL_  _NULL_  _NULL_   _NULL_  1041379201  _NULL_       _NULL_  Name.12  3    Target.1,OtherTarget.23,Target.90  _NULL_  _NULL_  My title  _NULL_
    

    Just run it on all of your files at once as:

    awk -f tst.awk file1 file2 etc.
    

    and it'll figure out all of the tags across all of your files then print a TSV with the values for all of those tags from all of those files.