Search code examples
spell-checkingcodespell

How to make codespell not report false positives in base64 strings?


I'm trying to configure codespell for a code base that uses Jupyter notebooks a lot.

Codespell throws a lot of false positives on a notebook that contains images embedded using base64 encoding. It appears that the / and + are interpreted as word boundaries. With long enough base64, that trips a bunch of rules like ue->use, due etc.

How can I make codespell ignore those base64 encoded strings altogether?

I've looked into using ignore-regex, but as far as I can tell, that option operates only on already split words. I'd need ignore-regex to skip entire sections of text.

Minimal reproducible example:

CkNvZGV+ue/+ue+zcGVsbCB0aHJvd3MgYSBsb3Qgb2YgZmFsc2UgcG9zaXRpdmVzIG9uIGEgbm90ZWJvb2sgdGhhdCBjb250YWlucyBpbWF

Running codespell on this gets me:

$ codespell
./test.py:1: ue ==> use, due
./test.py:1: ue ==> use, due

Solution

  • One can use --ignore-regex to exclude sufficiently long strings that could be base64. This works for me:

    codespell --ignore-regex='[A-Za-z0-9+/]{100,}'
    

    To include it in the .codespellrc or equivalent configfile, make sure to not enclose the regex in single or double quotes (I made this mistake at first):

    ignore-regex = [A-Za-z0-9+/]{100,}