Search code examples
c++linuxpdfpoppler

Is there any way to access page header,page footer and page content separately using libpoppler?


I am using libpoppler to parse PDF file to plain text,and I want to output page header,page footer and content separately,how can I do this?? Is there any structure or class that hold them?

Thanks in advance!!


Solution

  • You can get text in a page with poppler_page_get_text(). Can you parse pure text afterwards? Here is a sample code. It's not a C++ but hope you can see the idea.

    Tested on a Debian Unstable amd64, libpoppler-glib-dev 0.18.4-3, gcc 4.7.1-7

    $ gcc -Wall -g -Wextra get-text.c $(pkg-config --cflags --libs poppler-glib)

    #include <poppler.h>
    #include <glib.h>
    
    int main(int argc, char *argv[])
    {
        GError *error = NULL;
        PopplerDocument *d;
        PopplerPage *p;
        gchar *f;
        gchar *u;
    
        g_type_init();
    
        if (argc < 2)
                g_error("oops: no file name given");
    
        if (g_path_is_absolute(argv[1]))
                f = argv[1];
        else
                f = g_build_filename(g_get_current_dir(), argv[1], NULL);
    
        u = g_filename_to_uri(f, NULL, &error);
        if (!u)
                g_error("oops: %s", error->message);
    
        d = poppler_document_new_from_file(u, NULL, &error);
        if (!d)
                return -1;
    
        p = poppler_document_get_page(d, 1);
        g_print("%s\n", poppler_page_get_text(p));
    
        return 0;
    }