Scenario: I want to write my own Autocomplete-API for Addresses, just like the One Google is offering. (Very Basic: Street, Housenumber, City, Postcode, Country). It is intended for private use and training-purposes only. I want to cover about 1 Million Addresses for a Start.
Technology Used: .Net Framework (not Core), C#, Visual Studio, OSMSharp, Microsoft SQL-Server, Web Api 2 (although i will probably switch to ASP.Net Core in the Future.)
Approach:
Problem: Step 4 is taking way too long. For a File like "Regierungsbezirk Köln" in the Format osm.pbf which is about 160MB (the uncompressed osm file is about 2.8 GB) where talking about 4-5 Hours. I want to optimize this. The Bulk Copy of the DataTable into the Database on the other Hand (About 1 Million Rows) is taking just about 5 Seconds. (Woah. Amazing.)
Minimal Reproduction: https://github.com/Cr3pit0/OSM2Database-Minimal-Reproduction
What i tried:
Use a Stored Procedure in SQL-Server. This comes with a whole different Set of Problems and i didn't quite manage to get it Working (mainly because the uncompressed osm.pbf File is over 2GB and SQL Server doesn't like that)
Come up with a different approach to Filter and Convert the Data from the File to a DataTable (or CSV).
Use the Overpass-API. Although I read somewhere that the Overpass-API is not intended for DataSets above 10,000 Entries.
Ask the Jedi-Grandmasters on StackOverflow for Help. (Currently in Process ... :D)
Code Extract:
public static DataTable getDataTable_fromOSMFile(string FileDownloadPath)
{
Console.WriteLine("Finished Downloading. Reading File into Stream...");
using (var fileStream = new FileInfo(FileDownloadPath).OpenRead())
{
PBFOsmStreamSource source = new PBFOsmStreamSource(fileStream);
if (source.Any() == false)
{
return new DataTable();
}
Console.WriteLine("Finished Reading File into Stream. Filtering and Formatting RawData to Addresses...");
Console.WriteLine();
DataTable dataTable = convertAdressList_toDataTable(
source.Where(x => x.Type == OsmGeoType.Way && x.Tags.Count > 0 && x.Tags.ContainsKey("addr:street"))
.Select(Address.fromOSMGeo)
.Distinct(new AddressComparer())
);
return dataTable;
}
};
private static DataTable convertAdressList_toDataTable(IEnumerable<Address> addresses)
{
DataTable dataTable = new DataTable();
if (addresses.Any() == false)
{
return dataTable;
}
dataTable.Columns.Add("Id");
dataTable.Columns.Add("Street");
dataTable.Columns.Add("Housenumber");
dataTable.Columns.Add("City");
dataTable.Columns.Add("Postcode");
dataTable.Columns.Add("Country");
Int32 counter = 0;
Console.WriteLine("Finished Filtering and Formatting. Writing Addresses From Stream to a DataTable Class for the Database-SQLBulkCopy-Process ");
foreach (Address address in addresses)
{
dataTable.Rows.Add(counter + 1, address.Street, address.Housenumber, address.City, address.Postcode, address.Country);
counter++;
if (counter % 10000 == 0 && counter != 0)
{
Console.WriteLine("Wrote " + counter + " Rows From Stream to DataTable.");
}
}
return dataTable;
};
Okay i think i got it. Im down to about 12 Minutes for a File-Size of about 600mb and about 3.1 Million Rows of Data after Filtering.
The first Thing i tried is to replace the logic that populates my DataTable with FastMember. Which worked, but didnt give the Performance Increase i was hoping for (I canceled the Process after 3 Hours...). After more Research i stumbled upon an old Project which is called "osm2mssql" (https://archive.codeplex.com/?p=osm2mssql). I used a little part of the Code which directly read the Data from the osm.pbf File and modified it to my Use-Case ( → which is to extract Address-Data from Ways). I did actually use FastMember to write an IEnumerable<Address>
to the Datatable, but i dont need OSM-Sharp and whatever extra Dependencies they have anymore. So thank you very much for the Suggestion of FastMember. I will certainly keep that Library in Mind in future Projects.
For those who are interested, i updated my Github-Project accordingly (https://github.com/Cr3pit0/OSM2Database-Minimal-Reproduction) (although i didnt thoroughly test it, because i moved on from the Test-Project to the Real Deal, which is a Web Api)
Im quite sure it can be further optimized but i dont think i care at the Moment. 12 Minutes for a Method which might be called once a month to update the whole Database is fine i guess. Now i can move on to opimizing my Queries for the Autocomplete.
So thank you very much to whoever wrote "osm2mssql".