Search code examples
bashmallocvalgrind

Bash crashes in Valgrind when LC_CTYPE is set to C.UTF-8


Bash 5.2 crashes due to an assertion failure in malloc but only when run in Valgrind and only when LC_CTYPE is set. Here's an example output:

$ path/to/env - foo=bar LC_CTYPE=C.UTF-8 path/to/valgrind path/to/bash -c 'echo ${foo#spam}'

...

malloc: subst.c:5331: assertion botched
free: called with unallocated block argument
Aborting...==2753214== 
==2753214== Process terminating with default action of signal 6 (SIGABRT): dumping core
==2753214==    at 0x48DFA8C: __pthread_kill_implementation (in /nix/store/aw2fw9ag10wr9pf0qk4nk5sxi0q0bn56-glibc-2.37-8/lib/libc.so.6)
==2753214==    by 0x4890C85: raise (in /nix/store/aw2fw9ag10wr9pf0qk4nk5sxi0q0bn56-glibc-2.37-8/lib/libc.so.6)
==2753214==    by 0x487A8B9: abort (in /nix/store/aw2fw9ag10wr9pf0qk4nk5sxi0q0bn56-glibc-2.37-8/lib/libc.so.6)
==2753214==    by 0x443AF9: programming_error (in /nix/store/vqvj60h076bhqj6977caz0pfxs6543nb-bash-5.2-p15/bin/bash)
==2753214==    by 0x4ACAC4: internal_free.constprop.0 (in /nix/store/vqvj60h076bhqj6977caz0pfxs6543nb-bash-5.2-p15/bin/bash)
==2753214==    by 0x450A5E: remove_pattern (in /nix/store/vqvj60h076bhqj6977caz0pfxs6543nb-bash-5.2-p15/bin/bash)
==2753214==    by 0x465D2B: parameter_brace_remove_pattern (in /nix/store/vqvj60h076bhqj6977caz0pfxs6543nb-bash-5.2-p15/bin/bash)
==2753214==    by 0x46023A: param_expand (in /nix/store/vqvj60h076bhqj6977caz0pfxs6543nb-bash-5.2-p15/bin/bash)
==2753214==    by 0x460CD9: expand_word_internal (in /nix/store/vqvj60h076bhqj6977caz0pfxs6543nb-bash-5.2-p15/bin/bash)
==2753214==    by 0x466C0D: shell_expand_word_list.constprop.0 (in /nix/store/vqvj60h076bhqj6977caz0pfxs6543nb-bash-5.2-p15/bin/bash)
==2753214==    by 0x467479: expand_words (in /nix/store/vqvj60h076bhqj6977caz0pfxs6543nb-bash-5.2-p15/bin/bash)
==2753214==    by 0x4361CE: execute_command_internal (in /nix/store/vqvj60h076bhqj6977caz0pfxs6543nb-bash-5.2-p15/bin/bash)

...

==2753214== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)
/nix/store/a683qmhmrrzrwn8fmqh53yyylm7yn2hq-test.sh: line 2: 2753214 Aborted                 (core dumped) /nix/store/v45j2p2izb3pa2fxdw978bahhkb2ghza-toybox-0.8.10/bin/env - LC_CTYPE=C.UTF-8 /nix/store/14fg82n6grqhrd2algx31sv1kmgvz0gl-valgrind-3.21.0/bin/valgrind /nix/store/vqvj60h076bhqj6977caz0pfxs6543nb-bash-5.2-p15/bin/bash -c 'echo ${PATH#":"}'

(full output here)

${parameter#word} is a kind of parameter expansion described here.

The indicated line of source code points here, but is the problematic assertion in free or malloc?

Experimenting with some variations:

  • Outside of Valgrind, Bash succeeds (does not crash at all).
  • Leaving foo unset or set to empty-string causes Bash to succeed (no crash); but any non-empty setting of foo seems to cause a crash .
  • If we replace the pattern with a pattern that does exist in foo, Bash crashes on subst.c:5336 instead of subst.c:5331; both cases cause a crash, when the pattern is matched by the parameter expansion and when it isn't, but in slightly different places.
  • When LC_CTYPE is not set or set to any other locale (including non-existant locales), Bash does not crash (although there is a non-fatal invalid free()).

How should I go about debugging this problem?

A note on reproducibility:

  • Because I am not using any environment variables, binaries, or libraries from the default system, I think this should be fairly reproducible on any x86_64 machine. In case it matters, I am running Ubuntu 22.04.3 on Linux 5.15.0-83-generic.
  • I created a Nix flake here. If you download flake.nix and flake.lock to an empty directory, you should be able to type nix run and (hopefully) get a crash too.

Solution

  • I would create a special build of Bash in which Bash's malloc wrapping is disabled and try to reproduce the problem under Valgrind as before.

    You're running into the issue that Bash itself is self-diagnosing a malloc issue. It will not do as good a job as Valgrind itself.

    Bash's diagnostic is saying that free was called on an unallocated block. A similar diagnostic from Valgrind is more informative. If an allocated object existed at that address previously, Valgrind will show that, along with a backtrace where it was freed.

    Bash's debugging malloc (see the internal_free function in lib/malloc/malloc.c is relying on checking some header information in the freed block to conclude that it's a double free. That is not accurate. The code looks like:

      if (p->mh_alloc != ISALLOC)
        {
          if (p->mh_alloc == ISFREE)
            xbotch (mem, ERR_DUPFREE,
                    _("free: called with already freed block argument"), file, line);
          else
            xbotch (mem, ERR_UNALLOC,
                    _("free: called with unallocated block argument"), file, line);
        }
    

    If a magic byte is found in the block header which is not ISALLOC (that being 0xF7), it checks specifically for ISFREE. If it's not ISFREE either (0x57) then it emits the diagnostic you are seeing.

    This is wide open to a false positive, because it occurs when the magic code in the header has been clobbered to value that is not one of two possible values out of 256.

    We cannot reasonably believe this to be a double free problem. It is quite likely corruption, and Valgrind's allocator will do a much better job of diagnosing it, if it reproduces.