c#wpf winforms full-text-search pdf-reader

File content search c#

I'm trying to implement this feature in my application.

Just like in windows, I type into the search box and if the File contents is checked in the settings, than no matter its a text file or pdf/word file, the search returns me the file that contains the string in the search box.

So, I already have come up with a application for files and folder search which works pretty good for the file content search for text files and word file. I'm using interop word for word files.

I know, I can use iTextSharp or some other 3rd party stuff to do this for pdf files. But that doesn't satisfy me. I just wanted to find out how windows does it? Or if anyone else has done it in a different way? I just didn't wanted to use any 3rd party tool but doesn't mean I can't. I just wanted to keep my application light and not dump it with many tools.

Solution

As far as I know, it is not possible to search for pdf content with out having 3rd party tool, software or utility installed. So there are pdfgrep for example. But if you manage to any way make a c# program, I would include a third party library to do the job.

I made a solution for some thing similar in this answer Read specific value based on label name from PDF in C#, with a bit of tweak you can have what you are looking for. The only thing is with PdfClown, it is for .net framework, but at the other hand it is open source, free and has no limitation. But if you are looking for .net core you might find some free (with limitation) or paid pdf libraries.

As you request in the comment here is a sample solution to find text in side pdf pages. I have left comments inside the code:

//The found content
private List<string> _contentList;

//Search for content in a given pdf file
public bool SearchPdf(FileInfo fileInfo, string word)
{
    _contentList = new List<string>();
    ExtractPages(fileInfo.FullName);
    var content = string.Join(" ", _contentList);
    return content.Contains(word);
}

//Extract content for each page of given pdf file
private void ExtractPages(string filePath)
{
    using (var file = new File(filePath))
    {
        var document = file.Document;

        foreach (var page in document.Pages)
        {
            Extract(new ContentScanner(page));
        }
    }
}

//Extract content of pdf page and put the found result inside _contentList
private void Extract(ContentScanner level)
{
    if (level == null)
        return;

    while (level.MoveNext())
    {
        var content = level.Current;
        switch (content)
        {
            case ShowText text:
                {
                    var font = level.State.Font;
                    _contentList.Add(font.Decode(text.Text));
                    break;
                }
            case Text _:
            case ContainerObject _:
                Extract(level.ChildLevel);
                break;
        }
    }
}

Now lets do quick test, so we assume all your invoice are in c:\temp folder:

static void Main(string[] args)
{
    var program = new SearchPdfContent();

    DirectoryInfo d = new DirectoryInfo(@"c:\temp");
    FileInfo[] Files = d.GetFiles("*.pdf");
    var word = "Sushi";
    foreach (FileInfo file in Files)
    {
        var found = program.SearchPdf(file, word);
        if (found)
        {
            Console.WriteLine($"{file.FullName} contains word {word}");
        }
    }
}

In my case I have for example word sushi inside the invoice:

c:\temp\invoice0001.pdf contains word Sushi

All that said, this is an example of solution. You can take it from here bring it to the next level. Enjoy your day.

I leave some links of what I have searched for: