Search code examples
c++stdcoredumpbitsetbloom-filter

Why is bitset throwing out_of_range error?


I am implementing a bloom filter with help of bitsets in c++ for finding out malicious URLs. I have a bitset of 100 bits and a simple hash function. But still I get this error.

#include<bits/stdc++.h>
using namespace std;
typedef long long int ll;
#define m 100
#define k 1

ll hash1(string str)
{   
    ll value=0;
    for(ll i=0;i<str.size();i++)
    {
        value=(value+str[i])%m;
    }
    return value;
}

int main(int argc, char *argv[])
{   
    vector<bitset<m>>v(1);
    ifstream fin;
    fin.open(argv[1]);

    string line;
    string temp;
    while(getline(fin,line))
    {   
        vector<string>row;
        stringstream s(line);
        string word;
        while(getline(s ,word, ','))
        {
            row.push_back(word);
        }
        if(row.size()!=2) continue;
        for(ll i=0;i<k;i++)
        {
            if(row[1]=="bad")
            {   
                v[0].set(hash1(row[0]));
                cout<<row[0]<<" inserted into bloom filter\n";
            }
        }
        row.clear();
    }
    //Now bitset contains all the malicious urls
    //Generating validation dataset
    fin.clear();
    fin.seekg(0);

    vector<vector<string>>validation;
        
    while(getline(fin,line))
    {
        vector<string>row;
        
        ll x=rand()%10;
        if(x>0) continue;
        
        string word;
        stringstream s(line);
        while(getline(s,word,','))
        {
            row.push_back(word);
        }
        validation.push_back(row);
        row.clear();
    }

    for(ll i=0;i<validation.size();i++)
    {   
        if(v[0].test(hash1(validation[i][0])))
        { 
            if(validation[i][1]=="bad")
            {
                cout<<i+1<<" : True Positive\n";
            }
            else
            {
                cout<<i+1<<" : False Positive\n";
            }
        }
        else 
        {
            cout<<i+1<<" : True Negative\n";
        }
    }
    return 0;
}

The error is:- terminate called after throwing an instance of 'std::out_of_range' what(): bitset::set: __position (which is 18446744073709551587) >= _Nb (which is 100) Aborted (core dumped)

The dataset contains 2 columns viz. URLs and good/bad.

This is a link to the dataset. https://www.kaggle.com/antonyj453/urldataset


Solution

  • After reading carefully I thought the reason could be signed characters. If signed characters are involved,

    for (char i : str) {
        value = (value + i) % m;
    }
    

    could result in negative hashes. However it's unlikely that this would actually happen, since urls don't often contain high ascii (indeed you would expect their IDNA encoded version to be in the list).

    A quick check reveals 70 such domains in the list

    xn---21-6cdxogafsegjgafgrkj4opcf2a.xn--p1ai/images/aaa/,bad
    xn--cafsolo-dya.de/media/header/A.php,bad
    xn--b1afpdqbb8d.xn--p1ai/web/data/mail/-/aaaaa/Made.htm,bad
    xn-----6kcbb3dhfdcijbv8e2b.xn--p1ai/libraries/openid/Auth/OpenID/sec/RBI/index/index.htm,bad
    

    etc.

    If that's not the case, then something else is throwing out_of_range. Nothing /should/ because operator[] doesn't bounds-check according to the standard.

    However maybe some implementations do the bounds-checking in Debug builds (look at MSVC here, they have all manner of iterator debugging enabled by default in Debug builds, so maybe this as well?).

    Ironically you could use some bounds-checking yourself e.g. here:

    int main(int argc, char* argv[]) {
        std::vector<std::string_view> const args(argv, argv + argc);
        std::string const filename(args.at(1));
    

    This way you don't just invoke Undefined Behaviour when the command line argument is not given.

    Line Ends

    There's a gotcha with line-ends. The files are CRLF, sothe way you parse the columns results in the last column containing "bad\r", instead of "bad" on Linux .

    Review/Simplify

    In the search for other bugs I've simplified the code. It will perform a lot better now. Here's the run-down of improvement suggestions.

    1. The includes. Just include what you need, really

      #include <bitset>
      #include <fstream>
      #include <iostream>
      #include <sstream>
      #include <string>
      #include <vector>
      
    2. No need for wonky types or macros.

      static constexpr size_t m = 100;
      static constexpr size_t k = 1; // unused for now
      
    3. As mentioned before, protect against signed character results:

      static size_t hash1(std::string_view str) {
         size_t value = 0;
         for (unsigned char i : str) {
             value = (value + i) % m;
         }
         return value;
      }
      
    4. Also as mentioned before protect against trailing special characters in the classification text:

      enum goodbad { good, bad, invalid }; // The Good, The Bad and ...
      
      static goodbad to_classification(std::string_view text) {
          if (text == "bad") return bad;
          if (text == "good") return good;
          return invalid;
      }
      
    5. Next up, a big one. You parse through the same file twice. And repeat the code. Let's not. Instead have a function that parses it, and pass it a callback to decide what to do with the parsed data.

      While we're at it, let's stop ubiquitous vector<vector<vector<string> > > disease. Really, you know how many columns there are, and what they mean. This also reduces the allocations by a lot.

      void process_csv(std::string filename, auto callback) {
          std::ifstream fin(filename);
      
          std::stringstream ss;
          for (std::string line; getline(fin, line);) {
              std::string_view row(line);
      
              // eat line end remnants
              row = row.substr(0, row.find_last_not_of("\r\n") + 1);
      
              if (auto comma = row.find_last_of(','); comma + 1) {
                  auto url     = row.substr(0, comma);
                  auto goodbad = row.substr(comma + 1);
                  auto classif = to_classification(goodbad);
      
                  if (classif == invalid)
                      std::cerr << "Ignored unclassified line '" << row << "'\n";
                  else
                      callback(url, to_classification(goodbad));
              }
          }
      }
      

      That's all. Notice a key insight where we split using the last comma only. Because otherwise you got wrong results in case the url contained commas.

    6. Now, you can piece together the main program:

      int main(int argc, char* argv[]) {
          std::vector const args(argv, argv + argc);
          std::string const filename(args.at(1));
      

      Starting with the aforementioned safe way to use the command line arguments,

      std::bitset<m> bloom;
      

      reducing the verbosity vector<vector<> > syndrome (and improving the name - v?!)

    7. Here comes the first file reading pass:

      size_t bloom_size = 0;
      process_csv(filename,
          [&](std::string_view url, goodbad classification) {
              if (classification == bad) {
                  bloom_size += 1;
                  bloom.set(hash1(url));
                  //std::cout << url << " inserted into bloom filter\n";
              }
          });
      

      I decided it would not be necessary (and slow) to print all bad urls, so let's print the number of them instead:

      // Now bitset contains all the malicious urls
      std::cerr << "Bloom filter primed with " << bloom_size << " bad urls\n";
      
    8. Now the validation pass:

      // do a 1 in 10 validation check
      process_csv(filename,
          [&bloom, line = 0](std::string_view url,
                             goodbad classification) mutable {
              line += 1;
              if (rand() % 10) return; // TODO #include <random>
      
              auto hit      = bloom.test(hash1(url));
              bool expected = (classification == bad);
      
              std::cout << line << ": " << std::boolalpha
                        << (hit == expected)
                        << (hit ? " positive" : " negative") << "\n";
          });
      }
      

    Live Demo

    Live On Coliru

    #include <bitset>
    #include <fstream>
    #include <iostream>
    #include <sstream>
    #include <string>
    #include <vector>
    
    static constexpr size_t m = 100;
    //static constexpr size_t k = 1;
    
    static size_t hash1(std::string_view str) {
        size_t value = 0;
        for (unsigned char i : str) {
            value = (value + i) % m;
        }
        return value;
    }
    
    enum goodbad { good, bad, invalid }; // The Good, The Bad and ...
    
    static goodbad to_classification(std::string_view text) {
        if (text == "bad") return bad;
        if (text == "good") return good;
        return invalid;
    }
    
    void process_csv(std::string filename, auto callback) {
        std::ifstream fin(filename);
    
        std::stringstream ss;
        for (std::string line; getline(fin, line);) {
            std::string_view row(line);
    
            // eat line end remnants
            row = row.substr(0, row.find_last_not_of("\r\n") + 1);
    
            if (auto comma = row.find_last_of(','); comma + 1) {
                auto url     = row.substr(0, comma);
                auto goodbad = row.substr(comma + 1);
                auto classif = to_classification(goodbad);
    
                if (classif == invalid)
                    std::cerr << "Ignored unclassified line '" << row << "'\n";
                else
                    callback(url, to_classification(goodbad));
            }
        }
    }
    
    int main(int argc, char* argv[]) {
        std::vector const args(argv, argv + argc);
        std::string const filename(args.at(1));
    
        std::bitset<m> bloom;
    
        size_t bloom_size = 0;
        process_csv(filename, [&](std::string_view url, goodbad classification) {
            if (classification == bad) {
                bloom_size += 1;
                bloom.set(hash1(url));
            }
        });
    
        // Now bitset contains all the malicious urls
        std::cerr << "Bloom filter primed with " << bloom_size << " bad urls\n";
    
        // do a 1 in 10 validation check
        process_csv(filename,
            [&bloom, line = 0](std::string_view url,
                               goodbad classification) mutable {
                line += 1;
                if (rand() % 10) return;
    
                auto hit      = bloom.test(hash1(url));
                bool expected = (classification == bad);
    
                std::cout << line << ":" << std::boolalpha
                          << (hit == expected)
                          << (hit ? " positive" : " negative") << "\n";
            });
    }
    

    On Coliru only the bad dataset fits, so we never get any positives

    g++ -std=c++2a -O2 -Wall -pedantic -pthread main.cpp
    ./a.out bad.csv  | cut -d: -f2 | sort | uniq -c | sort -n
    

    Prints

    Ignored unclassified line 'url,label'
    Bloom filter primed with 75643 bad urls
    Ignored unclassified line 'url,label'
       7602 true positive
    

    On my own system:

    enter image description here

    Oh, and it runs in 0.4s instead of 3.7s before.

    After parsing bug fix even faster: 0.093s on average for the full set.