Search code examples
static-analysiscppcheckclang-static-analyzerflawfinder

How much can we trust to warnings generated by static analysis tools for vulnerablity detection?


I am running flawfinder on a set of libraries written in C/C++. I have a lot of generated warnings by flawfinder. My question is that, how much I can rely on these generated warnings? For example, consider the following function from numpy library (https://github.com/numpy/numpy/blob/4ada0641ed1a50a2473f8061f4808b4b0d68eff5/numpy/f2py/src/fortranobject.c):

static PyObject *
fortran_doc(FortranDataDef def)
{
    char *buf, *p;
    PyObject *s = NULL;
    Py_ssize_t n, origsize, size = 100;

    if (def.doc != NULL) {
        size += strlen(def.doc);
    }
    origsize = size;
    buf = p = (char *)PyMem_Malloc(size);
    if (buf == NULL) {
        return PyErr_NoMemory();
    }

    if (def.rank == -1) {
        if (def.doc) {
            n = strlen(def.doc);
            if (n > size) {
                goto fail;
            }
            memcpy(p, def.doc, n);
            p += n;
            size -= n;
        }
        else {
            n = PyOS_snprintf(p, size, "%s - no docs available", def.name);
            if (n < 0 || n >= size) {
                goto fail;
            }
            p += n;
            size -= n;
        }
    }
    else {
        PyArray_Descr *d = PyArray_DescrFromType(def.type);
        n = PyOS_snprintf(p, size, "'%c'-", d->type);
        Py_DECREF(d);
        if (n < 0 || n >= size) {
            goto fail;
        }
        p += n;
        size -= n;

        if (def.data == NULL) {
            n = format_def(p, size, def) == -1;
            if (n < 0) {
                goto fail;
            }
            p += n;
            size -= n;
        }
        else if (def.rank > 0) {
            n = format_def(p, size, def);
            if (n < 0) {
                goto fail;
            }
            p += n;
            size -= n;
        }
        else {
            n = strlen("scalar");
            if (size < n) {
                goto fail;
            }
            memcpy(p, "scalar", n);
            p += n;
            size -= n;
        }
    }
    if (size <= 1) {
        goto fail;
    }
    *p++ = '\n';
    size--;

    /* p now points one beyond the last character of the string in buf */
#if PY_VERSION_HEX >= 0x03000000
    s = PyUnicode_FromStringAndSize(buf, p - buf);
#else
    s = PyString_FromStringAndSize(buf, p - buf);
#endif

    PyMem_Free(buf);
    return s;

 fail:
    fprintf(stderr, "fortranobject.c: fortran_doc: len(p)=%zd>%zd=size:"
                    " too long docstring required, increase size\n",
            p - buf, origsize);
    PyMem_Free(buf);
    return NULL;
}

There are two memcpy() API calls, and flawfinder tells me that:

['vul_fortranobject.c:216: [2] (buffer) memcpy:\\n Does not check for buffer overflows when copying to destination (CWE-120).\\n Make sure destination can always hold the source data.\\n memcpy(p, "scalar", n);']

I am not sure whether the report is true.


Solution

  • To answer your question: static analysis tools (like FlawFinder) can generate a LOT of "false positives".

    I Googled to find some quantifiable information for you, and found an interesting article about "DeFP":

    https://arxiv.org/pdf/2110.03296.pdf

    Static analysis tools are frequently used to detect potential vulnerabilities in software systems. However, an inevitable problem of these tools is their large number of warnings with a high false positive rate, which consumes time and effort for investigating. In this paper, we present DeFP, a novel method for ranking static analysis warnings.

    Based on the intuition that warnings which have similar contexts tend to have similar labels (true positive or false positive), DeFP is built with two BiLSTM models to capture the patterns associated with the contexts of labeled warnings. After that, for a set of new warnings, DeFP can calculate and rank them according to their likelihoods to be true positives (i.e., actual vulnerabilities).

    Our experimental results on a dataset of 10 real-world projects show that using DeFP, by investigating only 60% of the warnings, developers can find +90% of actual vulnerabilities. Moreover, DeFP improves the state-of-the-art approach 30% in both Precision and Recall.

    Apparently, the authors built a neural network to analyze FlawFinder results, and rank them.

    I doubt DeFP is a practical "solution" for you. But yes: if you think that specific "memcpy()" warning is a "false positive" - then I'm inclined to agree. It very well could be :)