difference between detect-secrets and detect-secrets-hook results

I'm evaluating detect-secrets and I'm not sure why I get different results from detect-secrets and the hook.

Here is a log of a simplification:

$ cat docs/how-to-2.md
AZ_STORAGE_CS="DefaultEndpointsProtocol=https;AccountName=storageaccount1234;AccountKey=1OM7c6u5Ocp/zyUMWcRChowzd8czZmxPhzHZ8o45X7tAryr6JFF79+zerFFQS34KzVTK0yadoZGkvZh42A==;EndpointSuffix=core.windows.net"
$ detect-secrets scan --string $(cat docs/how-to-2.md)
AWSKeyDetector         : False
ArtifactoryDetector    : False
Base64HighEntropyString: True  (5.367)
BasicAuthDetector      : False
CloudantDetector       : False
HexHighEntropyString   : False
IbmCloudIamDetector    : False
IbmCosHmacDetector     : False
JwtTokenDetector       : False
KeywordDetector        : False
MailchimpDetector      : False
PrivateKeyDetector     : False
SlackDetector          : False
SoftlayerDetector      : False
StripeDetector         : False
TwilioKeyDetector      : False
$ detect-secrets-hook docs/how-to-2.md
$ detect-secrets-hook --baseline .secrets.baseline docs/how-to-2.md

I would have expected that detect-secrets-hook would tell me about that Azure storage account key that has high entropy.

more details about the baseline:

$ cat .secrets.baseline
{
  "custom_plugin_paths": [],
  "exclude": {
    "files": null,
    "lines": null
  },
  "generated_at": "2020-10-09T10:06:54Z",
  "plugins_used": [
    {
      "name": "AWSKeyDetector"
    },
    {
      "name": "ArtifactoryDetector"
    },
    {
      "base64_limit": 4.5,
      "name": "Base64HighEntropyString"
    },
    {
      "name": "BasicAuthDetector"
    },
    {
      "name": "CloudantDetector"
    },
    {
      "hex_limit": 3,
      "name": "HexHighEntropyString"
    },
    {
      "name": "IbmCloudIamDetector"
    },
    {
      "name": "IbmCosHmacDetector"
    },
    {
      "name": "JwtTokenDetector"
    },
    {
      "keyword_exclude": null,
      "name": "KeywordDetector"
    },
    {
      "name": "MailchimpDetector"
    },
    {
      "name": "PrivateKeyDetector"
    },
    {
      "name": "SlackDetector"
    },
    {
      "name": "SoftlayerDetector"
    },
    {
      "name": "StripeDetector"
    },
    {
      "name": "TwilioKeyDetector"
    }
  ],
  "results": {
    ".devcontainer/Dockerfile": [
      {
        ###obfuscated###
      }
    ],
    "deployment/export-sp.sh": [
      {
        ###obfuscated###
      }
    ],
    "docs/pip-install-from-artifacts-feeds.md": [
      {
        ###obfuscated###
      }
    ]
  },
  "version": "0.14.3",
  "word_list": {
    "file": null,
    "hash": null
  }
}

Solution

This is definitely peculiar behavior, but after some investigation, I realize that you've stumbled upon an edge case of the tool.

tl;dr

HighEntropyStringPlugin supports a limited set of characters (not including ;)
To reduce false positives, HighEntropyStringPlugin leverages the heuristic that strings are quoted in certain contexts.
To improve UI, inline string scanning does not require quoted strings.

Therefore, the functionality differs: when run through detect-secrets-hook, it does not parse the string accordingly due to the existence of ;. However, when run through detect-secrets scan --string, it does not require quotes, and breaks the string up.

Detailed Explanation

HighEntropyString tests are pretty noisy, if not aggressively pruned for false positives. One way it attempts to do this is via applying a rather strict regex (source), which requires it to be inside quotes. However, for certain contexts, this quoted requirement is removed (e.g. YAML files, and inline string scanning).

When this quoted requirement is removed, we get the following breakdown:

>>> line = 'AZ_STORAGE_CS="DefaultEndpointsProtocol=https;AccountName=storageaccount1234;AccountKey=1OM7c6u5Ocp/zyUMWcRChowzd8czZmxPhzHZ8o45X7tAryr6JFF79+zerFFQS34KzVTK0yadoZGkvZh42A==;EndpointSuffix=core.windows.net"'
>>> with self.non_quoted_string_regex(is_exact_match=False):
...    self.regex.findall(line)
['AZ_STORAGE_CS=', 'DefaultEndpointsProtocol=https', 'AccountName=storageaccount1234', 'AccountKey=1OM7c6u5Ocp/zyUMWcRChowzd8czZmxPhzHZ8o45X7tAryr6JFF79+zerFFQS34KzVTK0yadoZGkvZh42A==', 'EndpointSuffix=core', 'windows', 'net']

When we do so, we can see that the AccountKey=1OM7c6u5Ocp/zyUMWcRChowzd8czZmxPhzHZ8o45X7tAryr6JFF79+zerFFQS34KzVTK0yadoZGkvZh42A== would trigger the base64 plugin, as demonstrated below:

$ detect-secrets scan --string 'AccountKey=1OM7c6u5Ocp/zyUMWcRChowzd8czZmxPhzHZ8o45X7tAryr6JFF79+zerFFQS34KzVTK0yadoZGkvZh42A=='
AWSKeyDetector         : False
ArtifactoryDetector    : False
Base64HighEntropyString: True  (5.367)
BasicAuthDetector      : False
CloudantDetector       : False
HexHighEntropyString   : False
IbmCloudIamDetector    : False
IbmCosHmacDetector     : False
JwtTokenDetector       : False
KeywordDetector        : False
MailchimpDetector      : False
PrivateKeyDetector     : False
SlackDetector          : False
SoftlayerDetector      : False
StripeDetector         : False
TwilioKeyDetector      : False

However, when this quoted requirement is applied, then the entire string payload is scanned as one potential secret: DefaultEndpointsProtocol=https;AccountName=storageaccount1234;AccountKey=1OM7c6u5Ocp/zyUMWcRChowzd8czZmxPhzHZ8o45X7tAryr6JFF79+zerFFQS34KzVTK0yadoZGkvZh42A==;EndpointSuffix=core.windows.net

This doesn't get flagged because it fails the original base64 regex rules, which doesn't know how to handle ;.

>>> self.regex.findall(line)
[]

Therefore, this functionality differs, though, not immediately apparent through the invocation pattern described.

How Do I Fix This?

This is a much more challenging question, since allowing other characters would change the entropy calculation, and the probability of flagging strings. There has been some discussion around creating a plugin for all characters, but the team has not yet been able to decide on a default entropy limit for this.