I want to extract text from pdf and relayout it. My code is the following:
BOOL CTextEditorDoc::loadTxt()
{
if(m_strPDFPath.IsEmpty())
return FALSE;
#ifdef _DEBUG
DWORD dwTick = GetTickCount();
CString strLog;
#endif
CString strFile;
fz_context *ctx;
fz_document* doc;
fz_matrix ctm;
fz_page *page;
fz_device *dev;
fz_text_page *text;
fz_text_sheet *sheet;
int i,line,rotation,pagecount;
if(!gb2312toutf8(m_strPDFPath,strFile))
return FALSE;
ctx = fz_new_context(NULL, NULL, FZ_STORE_UNLIMITED);
fz_try(ctx){
doc = fz_open_document(ctx, strFile.GetBuffer(0));
}fz_catch(ctx){
fz_free_context(ctx);
return FALSE;
}
line = 0;
rotation = 0;
pagecount = 0;
pagecount = fz_count_pages(doc);
fz_rotate(&ctm, rotation);
fz_pre_scale(&ctm,1.0f,1.0f);
sheet = fz_new_text_sheet(ctx);
for(i=0;i<pagecount;i++){
page = fz_load_page(doc,i);
text = fz_new_text_page(ctx);
dev = fz_new_text_device(ctx, sheet, text);
#ifdef _DEBUG
dwTick = GetTickCount();
#endif
fz_run_page(doc, page, dev, &ctm, NULL);
#ifdef _DEBUG
strLog.Format("run page:%d ms\n",GetTickCount() - dwTick);
OutputDebugString(strLog);
dwTick = GetTickCount();
#endif
//m_linesInfoVector.push_back(line);
print_text_page(ctx,m_strContent,text,line);
#ifdef _DEBUG
strLog.Format("print text:%d ms\n",GetTickCount() - dwTick);
OutputDebugString(strLog);
dwTick = GetTickCount();
#endif
fz_free_device(dev);
fz_free_text_page(ctx,text);
fz_free_page(doc, page);
}
fz_free_text_sheet(ctx,sheet);
fz_close_document(doc);
fz_free_context(ctx);
return TRUE;
}
This code can extract all the text of pdf but it may be too slow. How to improve it?
Most of time is spent in function fz_run_page
. Maybe just to extract text from pdf, I don't need to execute fz_run_page
?
At a quick glance your code looks fine.
To extract text from a PDF you need to interpret the PDF operator streams. fz_run_page does this. It results in calls to whatever device you specify - in this case the structured text extraction device. This collates the randomly positioned glyphs from all over the page into a more structure form of words/lines/paragraphs/columns etc.
So, in short you're doing the right thing.
There are no current user servicable ways to improve the speed of this. It is possible that we could maybe use a device hint to avoid reading images etc in future versions. I will ponder on this and discuss it with the other devs. But for now you're doing the right thing.
HTH.