Search code examples
bashshellcommand-linegnugnu-coreutils

Wrong numeric suffix using GNU split


I have a large (1.8GB) file, that I want to split in files of 100MB max. For this I am using the GNU split function, with option -d. The resulting numeric suffixes are weird. Until 89, all good, but then it starts from 9000, then 9001, ... and so on. Does anybody have an idea why I am getting this weird behavior?


Solution

  • This behavior of split might be unexpected, but it is intended this way.

    In order to create an arbitrary number of files while maintaining the correct lexical order, the suffix generator expands the number of digits when it reaches the highest digit possible in its first location.

    The lexical order is necessary to easily reverse the split using cat:

    split foo bar_
    cat bar_* > foo
    

    If lexical order were not maintained, the new foo would be jumbled up.

    To achieve continuous numbers as the suffix, you can add the -a <n> argument, where <n> is the number of digits.

    The following command will produce files foo_000 through foo_199:

    seq 20000 | split -d -a 3 -l 100 - foo_
    

    However, then it is up to you to choose a number of digits sufficiently large to be able to generate all the suffixes you need, otherwise the split command will terminate prematurely with the error message:

    split: output file suffixes exhausted
    

    The issue has since been included in GNU coreutils gotchas page