Search code examples
c#multithreadingitexttask

Multithreading with Task reading PDF files using C#


I'm trying to extract text from my PDF file, which is very large (91914 pages), I'm using ITextSharp for text extraction. This takes a long time so I need to use Task or Threads to improve the speed of the process.

I have something like this:

var processTask = new List<Task>();
using (PdfReader reader = new PdfReader(fileInfo.FullName))
{

   for (int startpage = 1; 
         startpage <= reader.NumberOfPages;
         startpage = startpage + num + 1)
   {
     processTask.Add(Task.Run(() => ProccesSinglePDF(
       reader, 
       sourcePath + "PDFs\\" + (object)startpage + ".pdf",
       startpage,
       startpage + num,
       new FileInfo(
            sourcePath + "PDFs\\" + (object)startpage + ".pdf"), searchText)));
   }

   foreach (var task in processTask)
   {
      await task;
   }
}

Inside the ProcessSinglePDF method is searching for the Text I'm looking for and making some calls to the database (get data and update some values) and it seems like it is not doing it right because it finishes so quickly and doesn't process all the pages (I know it because I put a console.WriteLine(startpage) to confirm)


Solution

  • An observation or two as I am not familiar with ITextSharp

    from the github repo here https://github.com/schourode/iTextSharp-LGPL/blob/master/src/core/iTextSharp/text/pdf/PdfReader.cs it seems that this class is not thread safe (many fields, no locks) by design.

    You are using a single instance shared by many threads (this seems to be your intent, though Task.Run does not necessarily spawn threads)

    The PdfReader does expose an alternative constructor, that you should be using to provide each of your threads a unique instance of the PdfReader (copied from the repo above)

        /** Creates an independent duplicate.
        * @param reader the <CODE>PdfReader</CODE> to duplicate
        */    
        public PdfReader(PdfReader reader) {
    

    create and await a new task for each range of pages with the unique PdfReader instance and then AwaitAll tasks