Search code examples
c#.netopenstreetmaplarge-datalarge-files

Processing and Filtering of osm-File(osm.pbf) is taking way too long in C#


Scenario: I want to write my own Autocomplete-API for Addresses, just like the One Google is offering. (Very Basic: Street, Housenumber, City, Postcode, Country). It is intended for private use and training-purposes only. I want to cover about 1 Million Addresses for a Start.

Technology Used: .Net Framework (not Core), C#, Visual Studio, OSMSharp, Microsoft SQL-Server, Web Api 2 (although i will probably switch to ASP.Net Core in the Future.)

Approach:

  • Set Up Project (Web Api 2 or Console Project for Demo-Purposes)
  • Download relevant File from OpenStreetMaps using DownloadClient() (https://download.geofabrik.de/)
  • Read in the File using OSMSharp and Filter out relevant Data.
  • Convert Filtered Data to a DataTable.
  • Use DataTable to feed SQLBulkCopy Method to import Data into Database.

Problem: Step 4 is taking way too long. For a File like "Regierungsbezirk Köln" in the Format osm.pbf which is about 160MB (the uncompressed osm file is about 2.8 GB) where talking about 4-5 Hours. I want to optimize this. The Bulk Copy of the DataTable into the Database on the other Hand (About 1 Million Rows) is taking just about 5 Seconds. (Woah. Amazing.)

Minimal Reproduction: https://github.com/Cr3pit0/OSM2Database-Minimal-Reproduction

What i tried:

  • Use a Stored Procedure in SQL-Server. This comes with a whole different Set of Problems and i didn't quite manage to get it Working (mainly because the uncompressed osm.pbf File is over 2GB and SQL Server doesn't like that)

  • Come up with a different approach to Filter and Convert the Data from the File to a DataTable (or CSV).

  • Use the Overpass-API. Although I read somewhere that the Overpass-API is not intended for DataSets above 10,000 Entries.

  • Ask the Jedi-Grandmasters on StackOverflow for Help. (Currently in Process ... :D)

Code Extract:

public static DataTable getDataTable_fromOSMFile(string FileDownloadPath)
{

    Console.WriteLine("Finished Downloading. Reading File into Stream...");

    using (var fileStream = new FileInfo(FileDownloadPath).OpenRead())
    {
        PBFOsmStreamSource source = new PBFOsmStreamSource(fileStream);

        if (source.Any() == false)
        {
            return new DataTable();
        }

        Console.WriteLine("Finished Reading File into Stream. Filtering and Formatting RawData to Addresses...");
        Console.WriteLine();

        DataTable dataTable = convertAdressList_toDataTable(
                    source.Where(x => x.Type == OsmGeoType.Way && x.Tags.Count > 0 && x.Tags.ContainsKey("addr:street"))
                    .Select(Address.fromOSMGeo)
                    .Distinct(new AddressComparer())
                );

        return dataTable;
    }
};
private static DataTable convertAdressList_toDataTable(IEnumerable<Address> addresses)
{
    DataTable dataTable = new DataTable();

    if (addresses.Any() == false)
    {
        return dataTable;
    }

    dataTable.Columns.Add("Id");
    dataTable.Columns.Add("Street");
    dataTable.Columns.Add("Housenumber");
    dataTable.Columns.Add("City");
    dataTable.Columns.Add("Postcode");
    dataTable.Columns.Add("Country");

    Int32 counter = 0;

    Console.WriteLine("Finished Filtering and Formatting. Writing Addresses From Stream to a DataTable Class for the Database-SQLBulkCopy-Process ");

    foreach (Address address in addresses)
    {
        dataTable.Rows.Add(counter + 1, address.Street, address.Housenumber, address.City, address.Postcode, address.Country);
        counter++;

        if (counter % 10000 == 0 && counter != 0)
        {
            Console.WriteLine("Wrote " + counter + " Rows From Stream to DataTable.");
        }
    }

    return dataTable;
};

Solution

  • Okay i think i got it. Im down to about 12 Minutes for a File-Size of about 600mb and about 3.1 Million Rows of Data after Filtering.

    The first Thing i tried is to replace the logic that populates my DataTable with FastMember. Which worked, but didnt give the Performance Increase i was hoping for (I canceled the Process after 3 Hours...). After more Research i stumbled upon an old Project which is called "osm2mssql" (https://archive.codeplex.com/?p=osm2mssql). I used a little part of the Code which directly read the Data from the osm.pbf File and modified it to my Use-Case ( → which is to extract Address-Data from Ways). I did actually use FastMember to write an IEnumerable<Address> to the Datatable, but i dont need OSM-Sharp and whatever extra Dependencies they have anymore. So thank you very much for the Suggestion of FastMember. I will certainly keep that Library in Mind in future Projects.

    For those who are interested, i updated my Github-Project accordingly (https://github.com/Cr3pit0/OSM2Database-Minimal-Reproduction) (although i didnt thoroughly test it, because i moved on from the Test-Project to the Real Deal, which is a Web Api)

    Im quite sure it can be further optimized but i dont think i care at the Moment. 12 Minutes for a Method which might be called once a month to update the whole Database is fine i guess. Now i can move on to opimizing my Queries for the Autocomplete.

    So thank you very much to whoever wrote "osm2mssql".