Search code examples
c++boost

How can I improve the speed of my large txt processing script?


I have a program that scans a very large txt file (.pts file actually) that looks like this :

437288479
-6.9465 -20.49 -1.3345 70
-6.6835 -20.82 -1.3335 83
-7.3105 -20.179 -1.3325 77
-7.1005 -20.846 -1.3295 96
-7.3645 -20.759 -1.2585 79
...

The first line is the number of points contained in the file, and every other line corresponds to a {x,y,z,intensity} point in a 3D space. This file above is ~11 GB but I have more files to process that can be up to ~50 GB.

Here's the code I use to read this file :

#include <iostream>
#include <chrono>
#include <vector>
#include <algorithm>
#include <tuple>
#include <cmath>

// boost library
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/iostreams/stream.hpp>

struct point
{
    double x;
    double y;
    double z;
};


void readMappedFile()
{
    boost::iostreams::mapped_file_source mmap("my_big_file.pts");
    boost::iostreams::stream<boost::iostreams::mapped_file_source> is(mmap, std::ios::binary);
    std::string line;

    // get rid of the first line
    std::getline(is, line);
    
    while (std::getline(is, line))
    {
        point p;
        sscanf(line.c_str(),"%lf %lf %lf %*d", &(p.x), &(p.y), &(p.z));
        if (p.z > minThreshold && p.z < maxThreshold)
        {
            // do something with p and store it in the vector of tuples
            // O(n) complexity
        }
    }
}

int main ()
{
    readMappedFile();
    return 0;
}

For my 11 GB file, scanning all the lines and storing data in point p takes ~13 minutes to execute. Is there a way to make it way faster ? Because each time I scan a point, I also have to do some stuff with it. Which will make my program to take several hours to execute in the end.

I started looking into using several cores but it seems it could be problematic if some points are linked together for some reason. If you have any advice on how you would proceed, I'll gladly hear about it.

Edit1 : I'm running the program on a laptop with a CPU containing 8 cores - 2.9GHz, ram is 16GB and I'm using an ssd. The program has to run on similar hardware for this purpose.

Edit2 : Here's the complete program so you can tell me what I've been doing wrong. I localize each point in a sort of 2D grid called slab. Each cell will contain a certain amount of points and a z mean value.

#include <iostream>
#include <chrono>
#include <vector>
#include <algorithm>
#include <tuple>
#include <cmath>

// boost library
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/iostreams/stream.hpp>

struct point
{
    double x;
    double y;
    double z;
};

/*
    Compute Slab
*/

float slabBox[6] = {-25.,25.,-25.,25.,-1.,0.};
float dx = 0.1;
float dy = 0.1;
int slabSizeX = (slabBox[1] - slabBox[0]) / dx;
int slabSizeY = (slabBox[3] - slabBox[2]) / dy;

std::vector<std::tuple<double, double, double, int>> initSlab() 
{
    // initialize the slab vector according to the grid size
    std::vector<std::tuple<double, double, double, int>> slabVector(slabSizeX * slabSizeY, {0., 0., 0., 0});

    // fill the vector with {x,y} cells coordinates
    for (int y = 0; y < slabSizeY; y++)
    {
        for (int x = 0; x < slabSizeX; x++)
        {
            slabVector[x + y * slabSizeX] = {x * dx + slabBox[0], y * dy + slabBox[2], 0., 0};
        }
    }
    return slabVector;
}

std::vector<std::tuple<double, double, double, int>> addPoint2Slab(point p, std::vector<std::tuple<double, double, double, int>> slabVector)
{
    // find the region {x,y} in the slab in which coord {p.x,p.y} is
    int x = (int) floor((p.x - slabBox[0])/dx);
    int y = (int) floor((p.y - slabBox[2])/dy);
    
    // calculate the new z value
    double z = (std::get<2>(slabVector[x + y * slabSizeX]) * std::get<3>(slabVector[x + y * slabSizeX]) + p.z) / (std::get<3>(slabVector[x + y * slabSizeX]) + 1);

    // replace the older z
    std::get<2>(slabVector[x + y * slabSizeX]) = z;

    // add + 1 point in the cell
    std::get<3>(slabVector[x + y * slabSizeX])++;
    return slabVector;
}

/*
    Parse the file 
*/

void readMappedFile()
{
    boost::iostreams::mapped_file_source mmap("my_big_file.pts");
    boost::iostreams::stream<boost::iostreams::mapped_file_source> is(mmap, std::ios::binary);

    std::string line;
    std::getline(is, line);

    auto slabVector = initSlab();
    
    while (std::getline(is, line))
    {
        point p;
        sscanf(line.c_str(),"%lf %lf %lf %*d", &(p.x), &(p.y), &(p.z));
        if (p.z > slabBox[4] && p.z < slabBox[5])
        {
            slabVector = addPoint2Slab(p, slabVector);
        }
    }
}

int main ()
{
    readMappedFile();
    return 0;
}

Solution

  • Using memory maps is good. Using IOStreams isn't. Here's a complete take using Boost Spirit to do the parsing:

    An Easy Starter

    I'd suggest some cleanup around the typenames

    using Record = std::tuple<double, double, double, int>;
    
    std::vector<Record> initSlab()
    {
        // initialize the slab vector according to the grid size
        std::vector<Record> slabVector(slabSizeX * slabSizeY, {0., 0., 0., 0});
    
        // fill the vector with {x,y} cells coordinates
        for (int y = 0; y < slabSizeY; y++) {
            for (int x = 0; x < slabSizeX; x++) {
                slabVector[x + y * slabSizeX] = {
                    x * dx + slabBox[0],
                    y * dy + slabBox[2],
                    0.,
                    0,
                };
            }
        }
        return slabVector;
    }
    

    You could just use a struct instead of the tuple, but that's an exercise for the reader

    Don't Copy The SlabVector All The Time

    You had addPoint2Slab taking the slabVector by value (copying) and returning the modified vector. Even if that's optimized to a couple of moves, it's still at least allocating a temporary copy each time addPoint2Slab is called. Instead, make it a mutating function as intended:

    void addPoint2Slab(point const p, std::vector<Record>& slabVector)
    {
        // find the region {x,y} in the slab in which coord {p.x,p.y} is
        int x = (int) floor((p.x - slabBox[0])/dx);
        int y = (int) floor((p.y - slabBox[2])/dy);
        auto& [ix, iy, iz, icount] = slabVector[x + y * slabSizeX];
    
        iz = (iz * icount + p.z) / (icount + 1);
        icount += 1;
    }
    

    Note also that the tuple handling has been greatly simplified with structured bindings. You can even see what the code is doing, which was nearly impossible before - let alone verify.

    ReadMappedFile

    auto readMappedFile(std::string fname)
    {
        auto slabVector = initSlab();
    
        boost::iostreams::mapped_file_source mmap(fname);
    
        auto handle = [&](auto& ctx) {
            using boost::fusion::at_c;
            point p{at_c<0>(_attr(ctx)), at_c<1>(_attr(ctx)), at_c<2>(_attr(ctx))};
            //auto intensity = at_c<3>(_attr(ctx));
    
            if (p.z > slabBox[4] && p.z < slabBox[5])
                addPoint2Slab(p, slabVector);
        };
    
        namespace x3 = boost::spirit::x3;
        static auto const line_ =
            x3::float_ >> x3::float_ >> x3::float_ >> x3::int_;
    
        auto first = mmap.data(), last = first + mmap.size();
        try {
            bool ok = x3::phrase_parse( //
                first, last,
                x3::expect[x3::uint_ >> x3::eol] //
                    >> line_[handle] % x3::eol   //
                    // expect EOF here
                    >> *x3::eol >> x3::expect[x3::eoi], //
                x3::blank);
    
            // ok is true due to the expectation points
            assert(ok);
        } catch (x3::expectation_failure<char const*> const& ef) {
            auto where = ef.where();
            auto till  = std::min(last, where + 32);
            throw std::runtime_error("Expected " + ef.which() + " at #" +
                                     std::to_string(where - mmap.data()) + " '" +
                                     std::string(where, till) + "'...");
        }
    
        return slabVector;
    }
    

    Here we use Boost Spirit X3 to generate a parser that reads the lines and calls handle on each, much like you had before. A modicum of error handling has been added.

    Let's Test It

    Here's the test driver I used

    #include <fmt/ranges.h>
    #include <fstream>
    #include <random>
    #include <ranges>
    using std::ranges::views::filter;
    
    int main()
    {
        std::string const fname = "T032_OSE.pts";
    #if 0 || defined(GENERATE)
        using namespace std;
        // generates a ~12Gib file
        ofstream ofs(fname);
        mt19937  prng{random_device{}()};
        uniform_real_distribution<float> x(-25, 25), y(-25, +25), z(-1, 0);
        uniform_int_distribution<>       n(0, 100);
        auto N = 437288479;
        ofs << N << "\n";
        while (N--)
            ofs << x(prng) << " " << y(prng) << " " << z(prng) << " " << n(prng) << "\n";
    #else
        auto sv        = readMappedFile(fname);
        auto has_count = [](Record const& tup) { return get<3>(tup) > 0; };
        fmt::print("slabVector:\n{}\n", fmt::join(sv | filter(has_count), "\n"));
    #endif
    }
    

    Notice how you can use the conditionally compiled code to generate an input file (because I don't have your large file).

    On this ~13GiB file (compressed copy online) it runs in 1m14s on my machine:

    slabVector:
    (-25, -25, -0.49556059843940164, 1807)
    (-24.899999618530273, -25, -0.48971092838941654, 1682)
    (-24.799999237060547, -25, -0.49731256076256386, 1731)
    (-24.700000762939453, -25, -0.5006042266973916, 1725)
    (-24.600000381469727, -25, -0.5000671732885645, 1784)
    (-24.5, -25, -0.4940826157717386, 1748)
    (-24.399999618530273, -25, -0.5045350563593015, 1720)
    (-24.299999237060547, -25, -0.5088279537549671, 1812)
    (-24.200000762939453, -25, -0.5065565364794715, 1749)
    (-24.100000381469727, -25, -0.4933392542558793, 1743)
    (-24, -25, -0.4947248105973453, 1808)
    (-23.899999618530273, -25, -0.48640208470636714, 1696)
    (-23.799999237060547, -25, -0.4994672590531847, 1711)
    (-23.700000762939453, -25, -0.5033631130808075, 1782)
    (-23.600000381469727, -25, -0.4995593140170436, 1760)
    (-23.5, -25, -0.5009948279948179, 1737)
    (-23.399999618530273, -25, -0.4995986820225158, 1732)
    (-23.299999237060547, -25, -0.49833906199795897, 1764)
    (-23.200000762939453, -25, -0.5013796942594327, 1728)
    (-23.100000381469727, -25, -0.5072275248223541, 1700)
    (-23, -25, -0.4949060352670081, 1749)
    (-22.899999618530273, -25, -0.5026246990689665, 1740)
    (-22.799999237060547, -25, -0.493411989775698, 1746)
    // ... ~25k lines skipped...
    (24.200000762939453, 24.900001525878906, -0.508382879738258, 1746)
    (24.299999237060547, 24.900001525878906, -0.5064457874896565, 1740)
    (24.400001525878906, 24.900001525878906, -0.4990733400392924, 1756)
    (24.5, 24.900001525878906, -0.5063144518978036, 1732)
    (24.60000228881836, 24.900001525878906, -0.49988387744959534, 1855)
    (24.700000762939453, 24.900001525878906, -0.49970549673984693, 1719)
    (24.799999237060547, 24.900001525878906, -0.48656442707683384, 1744)
    (24.900001525878906, 24.900001525878906, -0.49267272688797675, 1705)
    

    Remaining Notes

    Beware of numerical error. You used float in some places, but with data sets this large it's very likely you will get noticeably large numeric errors in the running average calculation. Consider switching to [long] double or use a "professional" accumulator (many existing correlation frameworks or Boost Accumulator will do better).

    Full Code

    Live On Compiler Explorer

    #include <algorithm>
    #include <chrono>
    #include <cmath>
    #include <iostream>
    #include <tuple>
    #include <vector>
    #include <fmt/ranges.h>
    
    // boost library
    #include <boost/iostreams/device/mapped_file.hpp>
    #include <boost/iostreams/stream.hpp>
    
    struct point { double x, y, z; };
    
    /*
        Compute Slab
    */
    using Float = float; //
    
    Float slabBox[6] = {-25.,25.,-25.,25.,-1.,0.};
    Float dx = 0.1;
    Float dy = 0.1;
    int slabSizeX = (slabBox[1] - slabBox[0]) / dx;
    int slabSizeY = (slabBox[3] - slabBox[2]) / dy;
    
    using Record = std::tuple<double, double, double, int>;
    
    std::vector<Record> initSlab()
    {
        // initialize the slab vector according to the grid size
        std::vector<Record> slabVector(slabSizeX * slabSizeY, {0., 0., 0., 0});
    
        // fill the vector with {x,y} cells coordinates
        for (int y = 0; y < slabSizeY; y++) {
            for (int x = 0; x < slabSizeX; x++) {
                slabVector[x + y * slabSizeX] = {
                    x * dx + slabBox[0],
                    y * dy + slabBox[2],
                    0.,
                    0,
                };
            }
        }
        return slabVector;
    }
    
    void addPoint2Slab(point const p, std::vector<Record>& slabVector)
    {
        // find the region {x,y} in the slab in which coord {p.x,p.y} is
        int x = (int) floor((p.x - slabBox[0])/dx);
        int y = (int) floor((p.y - slabBox[2])/dy);
        auto& [ix, iy, iz, icount] = slabVector[x + y * slabSizeX];
    
        iz = (iz * icount + p.z) / (icount + 1);
        icount += 1;
    }
    
    /* Parse the file */
    #include <boost/spirit/home/x3.hpp>
    
    auto readMappedFile(std::string fname)
    {
        auto slabVector = initSlab();
    
        boost::iostreams::mapped_file_source mmap(fname);
    
        auto handle = [&](auto& ctx) {
            using boost::fusion::at_c;
            point p{at_c<0>(_attr(ctx)), at_c<1>(_attr(ctx)), at_c<2>(_attr(ctx))};
            //auto intensity = at_c<3>(_attr(ctx));
    
            if (p.z > slabBox[4] && p.z < slabBox[5])
                addPoint2Slab(p, slabVector);
        };
    
        namespace x3 = boost::spirit::x3;
        static auto const line_ =
            x3::double_ >> x3::double_ >> x3::double_ >> x3::int_;
    
        auto first = mmap.data(), last = first + mmap.size();
        try {
            bool ok = x3::phrase_parse( //
                first, last,
                x3::expect[x3::uint_ >> x3::eol] //
                    >> line_[handle] % x3::eol   //
                    // expect EOF here
                    >> *x3::eol >> x3::expect[x3::eoi], //
                x3::blank);
    
            // ok is true due to the expectation points
            assert(ok);
        } catch (x3::expectation_failure<char const*> const& ef) {
            auto where = ef.where();
            auto till  = std::min(last, where + 32);
            throw std::runtime_error("Expected " + ef.which() + " at #" +
                                     std::to_string(where - mmap.data()) + " '" +
                                     std::string(where, till) + "'...");
        }
    
        return slabVector;
    }
    
    #include <fmt/ranges.h>
    #include <fstream>
    #include <random>
    #include <ranges>
    using std::ranges::views::filter;
    
    int main()
    {
        std::string const fname = "T032_OSE.pts";
    #if 0 || defined(GENERATE)
        using namespace std;
        // generates a ~12Gib file
        ofstream ofs(fname);
        mt19937  prng{random_device{}()};
        uniform_real_distribution<Float> x(-25, 25), y(-25, +25), z(-1, 0);
        uniform_int_distribution<>       n(0, 100);
        auto N = 437288479;
        ofs << N << "\n";
        while (N--)
            ofs << x(prng) << " " << y(prng) << " " << z(prng) << " " << n(prng) << "\n";
    #else
        auto sv        = readMappedFile(fname);
        auto has_count = [](Record const& tup) { return get<3>(tup) > 0; };
        fmt::print("slabVector:\n{}\n", fmt::join(sv | filter(has_count), "\n"));
    #endif
    }