(I'm working with macos).
I have 70k text files in subfolders I want to extract some data from, recursively, then – if possible – have output written in one tab delimited file for later spreadsheet processing. Files, coming from my wiki (I use PmWiki, which saves data in text files) are formatted this way when complete (unwanted data deleted for readability) :
version=
agent=
author=
charset=
csum=
ctime=1041379201
description=
host=
name=Name.12
rev=3
targets=Target.1,OtherTarget.23,Target.90
text=
time=
title=My title
author:
csum:
diff:
host:
author:
csum:
diff:
I would like to extract data delimited with =
for fields named ctime name rev targets title
(5 fields).
My main issue is how to get data (keys ctime= rev= targets= name= title=
), plus having default values when some are missing ?
I believe one has to test if each of the targeted key exists ; create it with default value if missing ; then extract the wanted fields values and finally tabulate the data.
Expected output would be tab delimited ; missing data would be named something easy to catch later on. i.e., for full file given in example (tabs in place of spaces), output would give something like (ctime, rev, name, title, targets) :
1041379201 3 Name.12 my title Target.1,OtherTarget.23,Target.90
and, for not complete file (missing field, in line 1, is rev ; in line 2, rev and title) :
1041379201 XXX Name.12 my title Target.1,OtherTarget.23,Target.90
1041379201 XXX Name.12 XXX Target.1,OtherTarget.23,Target.90
Final project is to be able to extract data once a month, then have text file easy to use in spreadsheet, monthly updated.
My less worst attempt is something like that (yet doesn't work at all, lacking if/else condition) :
awk 'BEGIN { FS = "=" ;} /^ctime=/ {
print $2
next
}
/^rev=/ {
print $2
next}
/^name=/ {
print $2
next}
/^title=/ {
print $2
next}
/^targets=/ {
print $2
next}'
Here it is an original PmWiki file (in that case I still would like to extract ctime name rev targets title
(and have default values for missing fields, ctime
and title
) :
version=pmwiki-2.2.64 ordered=1 urlencoded=1
author=simon
charset=UTF-8
csum=add summary
name=Main.HomePage
rev=203
targets=PmWiki.DocumentationIndex,PmWiki.InitialSetupTasks,PmWiki.BasicEditing,Main.WikiSandbox
text=(:Summary:The default home page for the PmWiki distribution:)%0aWelcome to PmWiki!%0a%0aA local copy of PmWiki's%0adocumentation has been installed along with the software,%0aand is available via the [[PmWiki/documentation index]]. %0a%0aTo continue setting up PmWiki, see [[PmWiki/initial setup tasks]].%0a%0aThe [[PmWiki/basic editing]] page describes how to create pages%0ain PmWiki. You can practice editing in the [[wiki sandbox]].%0a%0aMore information about PmWiki is available from [[http://www.pmwiki.org]].%0a
time=1400472661
Updating my question.
The way I posted my question may look like more complex than it is. From this, repeated in 70k text files :
word1=line1
word2=line2
word3=line3
...
I would like to get a file collecting every line1, line3, lineX
(for command targetting word1, word2, wordX) AND have default value in case word1=line1 or word2=line2 or wordX=lineX doesn't exist at all.
At the end, I have found something pretty close to my need with Rick Smith's answer given to Retrieve default value with grep -e?
I just noticed you said you only wanted to print the values for specific tags which makes things easier. Using GNU awk for ENDFILE
and gensub()
:
$ cat tst.awk
BEGIN {
OFS="\t"
numTags = split("ctime rev targets name title",tags)
for (tagNr=1; tagNr<=numTags; tagNr++) {
tag = tags[tagNr]
printf "%s%s", tag, (tagNr<numTags ? OFS : ORS)
}
}
match($0,/^([[:alnum:]_]+)[=:](.*)/,a) {
tag = a[1]
val = gensub(" ?" OFS " ?"," ","g",a[2])
tag2val[tag] = val
}
ENDFILE {
for (tagNr=1; tagNr<=numTags; tagNr++) {
tag = tags[tagNr]
val = ( tag in tag2val ? tag2val[tag] : "_ABSENT_" )
val = ( val == "" ? "_NULL_" : val )
printf "%s%s", val, (tagNr<numTags ? OFS : ORS)
}
delete tag2val
}
$ awk -f tst.awk file
ctime rev targets name title
1041379201 3 Target.1,OtherTarget.23,Target.90 Name.12 My title
$ awk -f tst.awk file | column -s$'\t' -t
ctime rev targets name title
1041379201 3 Target.1,OtherTarget.23,Target.90 Name.12 My title
Original answer:
It sounds like this might be what you're trying to do if the tags in each input file are unique, requiring GNU awk for several extensions:
$ cat tst.awk
BEGIN { OFS="\t" }
match($0,/^([[:alnum:]_]+)[=:](.*)/,a) {
tag = a[1]
val = gensub(" ?" OFS " ?"," ","g",a[2])
if ( !seen[tag]++ ) {
tags[++numTags] = tag
}
key2val[ARGIND,tag] = val
}
END {
for (tagNr=1; tagNr<=numTags; tagNr++) {
tag = tags[tagNr]
printf "%s%s", tag, (tagNr<numTags ? OFS : ORS)
}
for ( fileNr=1; fileNr<=ARGIND; fileNr++) {
for (tagNr=1; tagNr<=numTags; tagNr++) {
tag = tags[tagNr]
key = fileNr SUBSEP tag
val = ( key in key2val ? key2val[key] : "_ABSENT_" )
val = ( val == "" ? "_NULL_" : val )
printf "%s%s", val, (tagNr<numTags ? OFS : ORS)
}
}
}
$ awk -f tst.awk file
version agent author charset csum ctime description host name rev targets text time title diff
_NULL_ _NULL_ _NULL_ _NULL_ _NULL_ 1041379201 _NULL_ _NULL_ Name.12 3 Target.1,OtherTarget.23,Target.90 _NULL_ _NULL_ My title _NULL_
To see the columns visually aligned:
$ awk -f tst.awk file | column -s$'\t' -t
version agent author charset csum ctime description host name rev targets text time title diff
_NULL_ _NULL_ _NULL_ _NULL_ _NULL_ 1041379201 _NULL_ _NULL_ Name.12 3 Target.1,OtherTarget.23,Target.90 _NULL_ _NULL_ My title _NULL_
Just run it on all of your files at once as:
awk -f tst.awk file1 file2 etc.
and it'll figure out all of the tags across all of your files then print a TSV with the values for all of those tags from all of those files.