Search code examples
bashmachine-learningnlpfeature-selectionnormalize

limit text files to a certain word length, but keep complete sentences


I have a corpus of text files that I need to copy, but limiting each file to roughly the same word length, while maintaining complete sentences. Treating any punctuation within {.?!} as a sentence boundary is acceptable. I could do this with python, but I am trying to learn bash, so suggestions are welcome. The approach I have been considering is to overshoot my target word length by a few words and then trim the result to the last sentence boundary.

I am familiar with head and wc, but I can't come up with a way to combine the two. The man file for head does not indicate a way to use word-counts, and the man file for wc does not indicate a way to split the file.

Context: I am working on a text classification task with machine-learning (using weka, for the record). I want to make sure that text length (which varies widely in my data) is not influencing the outcomes too much. To do this, I am trying to normalize my text lengths before I perform feature extraction.


Solution

  • Let's consider this test file:

    $ cat file
    Do I exist? I program. Therefore, I am!
    

    Suppose that we want to truncate this file to complete sentences of 20 characters or fewer:

    $ awk -v n=20 -v RS='[.?!]' '{if (length(s $0 RT)>n) exit; else s=s $0 RT;} END{print s;}' file
    Do I exist?
    

    If we want 30 characters or fewer:

    $ awk -v n=30 -v RS='[.?!]' '{if (length(s $0 RT)>n) exit; else s=s $0 RT;} END{print s;}' file
    Do I exist? I program.
    

    How it works

    • -v n=20

      This sets the awk variable n to the max length that we want (not counting the file's final newline character).

    • -v RS='[.?!]'

      This sets the awk record separator, RS, to any of the three characters that you mentioned.

    • if (length(s $0 RT)>n) exit; else s=s $0 RT

      For each record in the file (a record being a sentence), we test to see if adding it to s would make the output too long. If it makes the output too long, then we exit. If not, we add it to s.

      In awk, $0 represents the complete record and RT is the record separator that awk found at the end of the record.

    • END{print s;}

      Before we exit, this prints the string s.

    Alternate 1: Truncating based on number of words

    Suppose instead that we want to truncate based on the number of words. If we want, for example, 6 words:

    $ awk -v n=6 -v RS='[[:space:]]+' 'NR>n{exit;} {printf "%s%s",$0,RT;} END{print"";}' file
    Do I exist? I program. Therefore, 
    

    The difference is that we know used whitespace as a record separator. In this way, each record is a word and keep printing words until we reach the limit.

    Alternative 2: Whole sentences but limited number of words

    $ awk -v n=6 -v RS='[.?!]' '{c+=NF; if (c>n) exit; else s=s $0 RT;} END{print s;}' file
    Do I exist? I program.
    

    Mac OSX

    The above sets the record separator, RS, to a regular expression. This may require GNU awk (gawk). The OSX man page for awk does not say whether this feature is supported or not. @bebop, however, reports that the above code can be run successfully on OSX after installing gawk from macports.