Search code examples
bashmacossplitonenote

Splitting a text file using a separate file of line numbers


I have a text file that contains multiple (~80) OneNote pages all concatenated together, which I'm trying to split into component files for each page. I'm trying to do this by the line numbers of the page titles, as the pages are of variable length, and while I've been able to extract the line numbers into a seperate file I've not been able to figure out how to make the split with them. Eg

Log.txt:

Tuning             //Page Title
09 November 2016   //Date
23:19              //Time
 
Content text...    //Page Content
 
Week 46            //Another title, want to split here
14 November 2016
13:47
 
Text..
More text...       //Content can be over multiple lines

Week 47            //Another title, want to split here
22 November 2016
11:15

Text
etc...

Line numbers in a seperate file: Lines.txt:

1
7
14

Expected output in this example would give three files, each going from the page title down to the last line before the next page title.

log1.txt log2.txt log3.txt

$ cat log1.txt
Tuning             
09 November 2016
23:19

Content text...

$

I found a lot of answers regarding splitting into fixed chunks (eg every 50 lines) which doesn't work here as the sections are of variable length. Most of those around splitting at fixed line numbers dealt with just a few line numbers that could be hardcoded in, eg using head or tail commands.

This answer came really close to what I'm looking for, but again the input of line numbers to split at is very small and can be written directly into the command. I couldn't figure out how to use the file of line numbers in place of writing it in as a string "1 7 14" etc.

I'm using bash on macos, and am quite new to this level of work at the command line and have no real experience using grep, sed, awk etc, so its hard for me to generalise other answers to this particular case.

PS I can include the code I used to get the line numbers if necessary, although I'm sure it's far from optimal. (It involves grepping for the line numbers of the time stamps with a regex, then stripping away the matching text and subtracting 2 from each line to get the page titles)


Solution

  • Bash and awk solution

    # Assumption: You have a bash array named arr with the indices you want,
    # like this
    arr=( 1 7 14 )
    
    counter=1
    
    for ((i=0; i<${#arr[@]}-1; i++)); do
        # Get current index
        index="${arr[$i]}"
        # Get next index
        next_index="${arr[$i+1]}"
    
        awk "NR>=$index && NR<$next_index" file_to_chop.txt > "log${counter}.txt"
    
        (( counter++ ))
    done
    
    # If the array is non-empty, we also need to write last set of lines
    # to the last file
    [ "${#arr[@]}" -gt 1 ] && {
        # Get last element in the array
        index="${arr[${#arr[@]}-1]}"
    
        awk "NR>=$index" file_to_chop.txt > "log${counter}.txt"
    }
    

    This script won't work with a narrowly POSIX-compliant shell, since it uses several "bashisms", including arithmetic within (()).

    This functions primarily by using awk's NR, which gives the record number. The expression

    NR>=3
    

    for example tells awk to only perform actions on (or in our case, print) records (or in our case, lines) with record numbers greater than or equal to 3. More complex boolean expressions involving NR can be produced using &&, for example,

    NR>=3 && NR<=7
    

    If you do not already have the indices in a bash array, you can generate the array from a file like this:

    arr=()
    while read -r line; do arr+=( "$line" ); done < /path/to/your/file/here
    

    Or if you want to generate the array from the output of a command:

    arr=()
    while read -r line; do arr+=( "$line" ); done < <(your_command_here)
    

    Python solution

    import sys
    
    
    def write_lines(filename, lines):
        try:
            with open(filename, 'w') as f:
                f.write('\n'.join(lines))
        except OSError:
            print(f'Error: failed to write to "{filename}".', file=sys.stderr)
            exit(1)
    
    
    if len(sys.argv) != 2:
        print('Must pass path to input file.', file=sys.stderr)
        exit(1)
    
    input_file = sys.argv[1]
    line_indices = [line.rstrip() for line in sys.stdin]
    
    try:
        with open(input_file, 'r') as f:
            input_lines = [line.rstrip() for line in f]
    except OSError:
        print(f'Error: failed to read from "{input_file}".', file=sys.stderr)
        exit(1)
    
    counter = 1
    
    while len(line_indices) > 1:
        index = int(line_indices.pop(0))
        next_index = int(line_indices[0])
    
        write_lines(f'log{counter}.txt', input_lines[index-1:next_index-1])
    
        counter += 1
    
    if line_indices:
        index = int(line_indices[0])
    
        write_lines(f'log{counter}.txt', input_lines[index-1:])
    

    This is the usage, assuming you wanted to cut a file so lines 1-6 are output to log1.txt, lines 7-13 output to log2.txt, and lines 14 and on are output to log3.txt:

    printf '1\n7\n14\n' | python chop_file_script.py /path/to/file/to/chop
    

    The way this operates is by reading stdin to see how to chop the input file into separate files. This is by design, so the required line numbers can be fed to the script from a parent shell script using a pipe (as in the usage example above).

    This is not a fully robust script. It does not handle things like, for example:

    • Line numbers in stdin not being in ascending order
    • stdin containing non-numeric values
    • Numbers in stdin exceeding the length of the input file

    I believe that it is fine that this script is not fully robust, as it should work correctly as long as it is used in the intended way.