bash machine-learning nlp feature-selection normalize

limit text files to a certain word length, but keep complete sentences

I have a corpus of text files that I need to copy, but limiting each file to roughly the same word length, while maintaining complete sentences. Treating any punctuation within {.?!} as a sentence boundary is acceptable. I could do this with python, but I am trying to learn bash, so suggestions are welcome. The approach I have been considering is to overshoot my target word length by a few words and then trim the result to the last sentence boundary.

I am familiar with head and wc, but I can't come up with a way to combine the two. The man file for head does not indicate a way to use word-counts, and the man file for wc does not indicate a way to split the file.

Context: I am working on a text classification task with machine-learning (using weka, for the record). I want to make sure that text length (which varies widely in my data) is not influencing the outcomes too much. To do this, I am trying to normalize my text lengths before I perform feature extraction.

Solution

Let's consider this test file:

$ cat file
Do I exist? I program. Therefore, I am!

Suppose that we want to truncate this file to complete sentences of 20 characters or fewer:

$ awk -v n=20 -v RS='[.?!]' '{if (length(s $0 RT)>n) exit; else s=s $0 RT;} END{print s;}' file
Do I exist?

If we want 30 characters or fewer:

$ awk -v n=30 -v RS='[.?!]' '{if (length(s $0 RT)>n) exit; else s=s $0 RT;} END{print s;}' file
Do I exist? I program.

How it works

-v n=20

This sets the awk variable n to the max length that we want (not counting the file's final newline character).
-v RS='[.?!]'

This sets the awk record separator, RS, to any of the three characters that you mentioned.
if (length(s $0 RT)>n) exit; else s=s $0 RT

For each record in the file (a record being a sentence), we test to see if adding it to s would make the output too long. If it makes the output too long, then we exit. If not, we add it to s.

In awk, $0 represents the complete record and RT is the record separator that awk found at the end of the record.
END{print s;}

Before we exit, this prints the string s.

Alternate 1: Truncating based on number of words

Suppose instead that we want to truncate based on the number of words. If we want, for example, 6 words:

$ awk -v n=6 -v RS='[[:space:]]+' 'NR>n{exit;} {printf "%s%s",$0,RT;} END{print"";}' file
Do I exist? I program. Therefore,

The difference is that we know used whitespace as a record separator. In this way, each record is a word and keep printing words until we reach the limit.

Alternative 2: Whole sentences but limited number of words

$ awk -v n=6 -v RS='[.?!]' '{c+=NF; if (c>n) exit; else s=s $0 RT;} END{print s;}' file
Do I exist? I program.

Mac OSX

The above sets the record separator, RS, to a regular expression. This may require GNU awk (gawk). The OSX man page for awk does not say whether this feature is supported or not. @bebop, however, reports that the above code can be run successfully on OSX after installing gawk from macports.