Search code examples
c++filec++11data-management

How to read a file backwards to find substring efficiently


I have a huge log file in this kind of structure:

"timestamp": {"identifier":value}

"1463403600":{"AA":74.42},
"1463403601":{"AA":29.55},
"1463403603":{"AA":24.78},
"1463403604":{"AA":8.46},
"1463403605":{"AA":44.84},
"1463403607":{"AA":87.05},
"1463403608":{"AA":54.81},
"1463403609":{"AA":93.1},
"1463403611":{"AA":77.64},
"1463403612":{"AA":33.39},
"1463403613":{"AA":69.2},

I want to extract the content after(!) a given timestamp like:

std::ifstream * myfunc( uint32_t timestamp) 

example:

myfunc(1463403611);
/* returns
"1463403611":{"AA":77.64},
"1463403612":{"AA":33.39},
"1463403613":{"AA":69.2},
*/

The logfile is long - too long to keep it in memory. The code will run on a resource limited embedded devices (80Mhz, ~10kB free memory), so Im looking for some ideas for a effective solution.

The logfile might have 500k+ entries and in 99% of the time the timestamp will be in the last 100 lines, so starting at the beginnig of the file and check every line for the right timestamp would be very inefficient.

So I guess Im looking for a solution to read the file backwards, line by line. I dont realy have a solution how to do that efficient without loading big chunks into memory.

I tried with reading in chunks of 200bytes starting from the EOF, but was faced with the issue, that the chunk cut the timestamp into half in many cases. I tried to detect that and reselect some bytes if needed, but got the feeling, that there must be a smarte solution.


Solution

  • Well I found this kind of interesting so I knocked up a proof-of-concept for the binary-search idea.

    This is poorly tested and probably a little buggy but seems to work so far and demonstrates the idea of divide-and-conquer. You check in the middle of the file and, depending if you are to high or too low, you divide the data into two and search the relevant half. You do that recursively until you get close enough.

    #include <ctime>
    #include <cmath>
    #include <cstdlib>
    #include <string>
    #include <fstream>
    #include <iostream>
    
    // Don't use this, its just to show how many reads
    // are being done to find the record.
    int global_counter;
    
    std::streampos find_stamp(std::istream& is, long stamp, std::streampos pos, std::streampos end)
    {
        ++global_counter;
    
        if(pos == 0) // can't divide zero
            return 0;
    
        std::string s;
        long found_stamp;
    
        // extract nearest timestamp after pos
        is.seekg(pos);
        if(!(std::getline(std::getline(is, s, ','), s, '"') >> found_stamp))
            return end;
    
        // if its too big check first half of this region
        if(found_stamp > stamp)
            return find_stamp(is, stamp, pos / 2, pos);
    
        // if its not within 10 timestamp seconds check end half of this region
        if(stamp - found_stamp > 10)
            return find_stamp(is, stamp, (pos + end) / 2, end);
    
        // read record by record (prolly more efficient than skipping)
        pos = is.tellg();
        while(std::getline(std::getline(is, s, ','), s, '"') >> found_stamp)
        {
            if(found_stamp > stamp)
                return pos;
            pos = is.tellg();
        }
        return end;
    }
    
    void print_after(const std::string& filename, long stamp)
    {
        // open at end of file (to get length)
        std::ifstream ifs(filename, std::ios::ate);
    
        std::streampos end = ifs.tellg();
        auto pos = end / 2; // start checking in middle
    
        // find position before required record
        // (may be in the middle of a record)
        if((pos = find_stamp(ifs, stamp, pos, end)) != end)
        {
            ifs.seekg(pos);
    
            std::string line;
            std::getline(ifs, line, ','); // skip to next whole record
    
            // print out all following recors
            while(std::getline(ifs, line, ','))
                std::cout << line;
        }
    }
    
    inline
    std::string leading_zeros(int n, int zeros = 2)
    {
        std::string s;
        for(int z = std::pow(10, zeros - 1); z; z /= 10)
            s += (n < z ? "0":"");
        return s + std::to_string(n);
    }
    
    int main()
    {
        std::srand(std::time(0));
    
        // generate some test data
        std::ofstream ofs("test.txt");
    
        for(int i = 0; i < 1000; ++i)
        {
            ofs << '"' << leading_zeros(i, 10) << '"';
            ofs << ":{\"AA\":" << (std::rand() % 100);
            ofs << '.' << (std::rand() % 100) << "},\n";
        }
    
        ofs.close();
    
        global_counter = 0;
        print_after("test.txt", 993);
    
        std::cout << "find checked " << global_counter << " places in the file\n";
    }
    

    Output:

    "0000000994":{"AA":80.6}
    "0000000995":{"AA":11.90}
    "0000000996":{"AA":16.43}
    "0000000997":{"AA":53.11}
    "0000000998":{"AA":68.43}
    "0000000999":{"AA":79.77}
    find checked 6 places in the file