Search code examples
c#entity-frameworkproducer-consumer

How can i make this run faster?


Can I do threads instead of tasks to make this run faster? I'm trying to get 114000 products into the database. As my code is right now I get about 100 products into the database a minute.

My Tasks (producers) each scrape an XML File which contains a product data, packages it in the Product class, then queue's it for the consumer.

my Consumer takes each product from the queue and puts it into the database 1 at a time. I use Entity Framework so it's not safe for threading.

public static void GetAllProductsFromIndexes_AndPutInDB(List<IndexModel> indexes, ProductContext context)
{
    BlockingCollection<IndexModel> inputQueue = CreateInputQueue(indexes);
    BlockingCollection<Product> productsQueue = new BlockingCollection<Product>(5000);

    var consumer = Task.Run(() =>
    {
        foreach (Product readyProduct in productsQueue.GetConsumingEnumerable())
        {
            InsertProductInDB(readyProduct, context);
        }
    });

    var producers = Enumerable.Range(0, 100)
        .Select(_ => Task.Run(() =>
        {
            foreach (IndexModel index in inputQueue.GetConsumingEnumerable())
            {
                Product product = new Product();
                byte[] unconvertedByteArray;
                string xml;
                string url = @"https://data.Icecat.biz/export/freexml.int/en/";

                unconvertedByteArray = DownloadIcecatFile(index.IndexNumber.ToString() + ".xml", url);
                xml = Encoding.UTF8.GetString(unconvertedByteArray);
                XmlDocument xmlDoc = new XmlDocument();
                xmlDoc.LoadXml(xml);

                GetProductDetails(product, xmlDoc, index);

                XmlNodeList nodeList = (xmlDoc.SelectNodes("ICECAT-interface/Product/ProductFeature"));
                product.FeaturesLink = GetProductFeatures(product, nodeList);

                nodeList = (xmlDoc.SelectNodes("ICECAT-interface/Product/ProductGallery/ProductPicture"));
                product.Images = GetProductImages(nodeList);
                productsQueue.Add(product);
            }
        })).ToArray();

    Task.WaitAll(producers);
    productsQueue.CompleteAdding();
    consumer.Wait();
}

Solution

  • A couple of things you must do.

    Detach each Product entity after you isnert it, or they will all accumulate in the Change Tracker.

    Don't call SaveChanges after every product. Batch up a hundred or so. Like this:

    var consumer = Task.Run(() =>
    {
        var batch = new List<Product>();
    
        foreach (Product readyProduct in productsQueue.GetConsumingEnumerable())
        {
            batch.Add(readyProduct);
            if (batch.Count >= 100)
            {
                context.Products.AddRange(batch);
                context.SaveChanges();
                foreach (var p in batch)
                {
                    context.Entry(p).State = EntityState.Detached;
                }
                batch.Clear();
            }
    
        }
        context.Products.AddRange(batch);
        context.SaveChanges();
        foreach (var p in batch)
        {
            context.Entry(p).State = EntityState.Detached;
        }
    
    });
    

    If you're on EF Core and your provider supports it (Like SQL Server), you'll even get statement batching. You should expect several hunderd rows per second using basic best-practices here. If you need more than that, you can switch to a bulk load API ( like SqlBulkCopy for SQL Server).