I have a corpus of text files that I need to copy, but limiting each file to roughly the same word length, while maintaining complete sentences. Treating any punctuation within {.?!}
as a sentence boundary is acceptable. I could do this with python, but I am trying to learn bash, so suggestions are welcome. The approach I have been considering is to overshoot my target word length by a few words and then trim the result to the last sentence boundary.
I am familiar with head
and wc
, but I can't come up with a way to combine the two. The man
file for head
does not indicate a way to use word-counts, and the man
file for wc
does not indicate a way to split the file.
Context:
I am working on a text classification task with machine-learning (using weka
, for the record). I want to make sure that text length (which varies widely in my data) is not influencing the outcomes too much. To do this, I am trying to normalize my text lengths before I perform feature extraction.
Let's consider this test file:
$ cat file
Do I exist? I program. Therefore, I am!
Suppose that we want to truncate this file to complete sentences of 20 characters or fewer:
$ awk -v n=20 -v RS='[.?!]' '{if (length(s $0 RT)>n) exit; else s=s $0 RT;} END{print s;}' file
Do I exist?
If we want 30 characters or fewer:
$ awk -v n=30 -v RS='[.?!]' '{if (length(s $0 RT)>n) exit; else s=s $0 RT;} END{print s;}' file
Do I exist? I program.
-v n=20
This sets the awk variable n
to the max length that we want (not counting the file's final newline character).
-v RS='[.?!]'
This sets the awk record separator, RS
, to any of the three characters that you mentioned.
if (length(s $0 RT)>n) exit; else s=s $0 RT
For each record in the file (a record being a sentence), we test to see if adding it to s
would make the output too long. If it makes the output too long, then we exit. If not, we add it to s
.
In awk, $0
represents the complete record and RT
is the record separator that awk found at the end of the record.
END{print s;}
Before we exit, this prints the string s
.
Suppose instead that we want to truncate based on the number of words. If we want, for example, 6 words:
$ awk -v n=6 -v RS='[[:space:]]+' 'NR>n{exit;} {printf "%s%s",$0,RT;} END{print"";}' file
Do I exist? I program. Therefore,
The difference is that we know used whitespace as a record separator. In this way, each record is a word and keep printing words until we reach the limit.
$ awk -v n=6 -v RS='[.?!]' '{c+=NF; if (c>n) exit; else s=s $0 RT;} END{print s;}' file
Do I exist? I program.
The above sets the record separator, RS
, to a regular expression. This may require GNU awk (gawk). The OSX man page for awk
does not say whether this feature is supported or not. @bebop, however, reports that the above code can be run successfully on OSX after installing gawk
from macports.