Search code examples
c#azureitext7azure-blob-storage

How do i extract text from pdf stored on blob storage using itext7?


i'm using itext7 to extract text from pdf . Here is my code to extract the text for local pdf file :

 var pageText = new StringBuilder();  
    using(PdfDocument pdfDocument = new PdfDocument(new PdfReader("E:\\es.pdf"))) {  
        var pageNumbers = pdfDocument.GetNumberOfPages();  
        for (int i = 1; i <= pageNumbers; i++) {  
            LocationTextExtractionStrategy strategy = new LocationTextExtractionStrategy();  
            PdfCanvasProcessor parser = new PdfCanvasProcessor(strategy);  
            parser.ProcessPageContent(pdfDocument.GetFirstPage());  
            pageText.Append(strategy.GetResultantText());  
        }  
    } 

But,i'm not getting how can i parse pdf stored on azure blob storage.


Solution

  • If you want to read pdf form azure blob, please refer to the following code

     string storageAccountName = "andyprivate";
                string accountKey = "";
                var blobServiceClient = new BlobServiceClient(
                    new Uri($"https://{storageAccountName}.blob.core.windows.net"),
                    new StorageSharedKeyCredential(storageAccountName, accountKey),
                    new BlobClientOptions());
    
                var containerClient = blobServiceClient.GetBlobContainerClient("test");
                var blob = containerClient.GetBlobClient("sample.pdf");
                BlobProperties properties = await blob.GetPropertiesAsync();
                var pageText = new StringBuilder();
                using (var stream = await blob.OpenReadAsync(position: 0, bufferSize: (int)properties.ContentLength))
                using (PdfDocument pdfDocument = new PdfDocument(new PdfReader(stream))) {
                    var pageNumbers = pdfDocument.GetNumberOfPages();
                    for (int i = 1; i <= pageNumbers; i++)
                    {
                        LocationTextExtractionStrategy strategy = new LocationTextExtractionStrategy();
                        PdfCanvasProcessor parser = new PdfCanvasProcessor(strategy);
                        parser.ProcessPageContent(pdfDocument.GetPage(i));
                        pageText.Append(strategy.GetResultantText());
                        pageText.Append(Environment.NewLine);
    
    
                    }
    
                    Console.WriteLine(pageText);
                }
    

    enter image description here