Search code examples
.netparsingrtf

How I can split a RTF file


I want to split a RTF file (with C# or VB.Net) in 2 ore more parts by the string [BreakPage]. I have for exemple this file, containing a [BreakPage], which needs to be split in 2 parts:

{\rtf1\ansi\ansicpg1251\uc1\deff0\stshfdbch0\stshfloch0\stshfhich0\stshfbi0\deflang1049\deflangfe1049{\fonttbl{\f0\froman\fcharset204\fprq2{*\panose 02020603050405020304}Times New Roman;}{\f38\froman\fcharset0\fprq2 Times New Roman;} {\f36\froman\fcharset238\fprq2 Times New Roman CE;}{\f39\froman\fcharset161\fprq2 Times New Roman Greek;}{\f40\froman\fcharset162\fprq2 Times New Roman Tur;}{\f41\froman\fcharset177\fprq2 Times New Roman (Hebrew);} {\f42\froman\fcharset178\fprq2 Times New Roman (Arabic);}{\f43\froman\fcharset186\fprq2 Times New Roman Baltic;}{\f44\froman\fcharset163\fprq2 Times New Roman (Vietnamese);}}{\colortbl;\red0\green0\blue0;\red0\green0\blue255;\red0\green255\blue255; \red0\green255\blue0;\red255\green0\blue255;\red255\green0\blue0;\red255\green255\blue0;\red255\green255\blue255;\red0\green0\blue128;\red0\green128\blue128;\red0\green128\blue0;\red128\green0\blue128;\red128\green0\blue0;\red128\green128\blue0; \red128\green128\blue128;\red192\green192\blue192;}{\stylesheet{\ql \li0\ri0\widctlpar\aspalpha\aspnum\faauto\adjustright\rin0\lin0\itap0 \fs24\lang1049\langfe1049\cgrid\langnp1049\langfenp1049 \snext0 Normal;}{*\cs10 \additive \ssemihidden Default Paragraph Font;}{*\ts11\tsrowd\trftsWidthB3\trpaddl108\trpaddr108\trpaddfl3\trpaddft3\trpaddfb3\trpaddfr3\trcbpat1\trcfpat1\tscellwidthfts0\tsvertalt\tsbrdrt\tsbrdrl\tsbrdrb\tsbrdrr\tsbrdrdgl\tsbrdrdgr\tsbrdrh\tsbrdrv \ql \li0\ri0\widctlpar\aspalpha\aspnum\faauto\adjustright\rin0\lin0\itap0 \fs20\lang1024\langfe1024\cgrid\langnp1024\langfenp1024 \snext11 \ssemihidden Normal Table;}}{*\latentstyles\lsdstimax156\lsdlockeddef0}{*\rsidtbl \rsid2111663\rsid7154806 \rsid15558346}{*\generator Microsoft Word 11.0.5604;}{\info{\author Programmer}{\operator Programmer}{\creatim\yr2011\mo8\dy2\hr12\min45}{\revtim\yr2011\mo8\dy5\hr12\min34}{\version3}{\edmins1}{\nofpages1}{\nofwords5}{\nofchars34}{\nofcharsws38} {\vern24689}}\margl1701\margr850\margt1134\margb1134 \widowctrl\ftnbj\aenddoc\noxlattoyen\expshrtn\noultrlspc\dntblnsbdb\nospaceforul\hyphcaps0\horzdoc\dghspace120\dgvspace120\dghorigin1701\dgvorigin1984\dghshow0\dgvshow3 \jcompress\viewkind1\viewscale100\nolnhtadjtbl\rsidroot15558346 \fet0\sectd \linex0\sectdefaultcl\sftnbj {*\pnseclvl1\pnucrm\pnstart1\pnindent720\pnhang {\pntxta .}}{*\pnseclvl2\pnucltr\pnstart1\pnindent720\pnhang {\pntxta .}}{*\pnseclvl3 \pndec\pnstart1\pnindent720\pnhang {\pntxta .}}{*\pnseclvl4\pnlcltr\pnstart1\pnindent720\pnhang {\pntxta )}}{*\pnseclvl5\pndec\pnstart1\pnindent720\pnhang {\pntxtb (}{\pntxta )}}{*\pnseclvl6\pnlcltr\pnstart1\pnindent720\pnhang {\pntxtb (}{\pntxta )}} {*\pnseclvl7\pnlcrm\pnstart1\pnindent720\pnhang {\pntxtb (}{\pntxta )}}{*\pnseclvl8\pnlcltr\pnstart1\pnindent720\pnhang {\pntxtb (}{\pntxta )}}{*\pnseclvl9\pnlcrm\pnstart1\pnindent720\pnhang {\pntxtb (}{\pntxta )}}\pard\plain \ql \li0\ri0\nowidctlpar\faauto\rin0\lin0\itap0 \fs24\lang1049\langfe1049\cgrid\langnp1049\langfenp1049 {\b\insrsid7154806\charrsid7154806 Line 1 \par }{\insrsid7154806 \par }{\i\insrsid7154806\charrsid7154806 Line3}{\lang1048\langfe1049\langnp1048\insrsid7154806 \par }{\lang1048\langfe1049\langnp1048\insrsid2111663 [BreakPage] \par }{\insrsid7154806 Line4 \par \par Line5 \par }}

Can anyone help me?

Thanks!


Solution

  • The problem is that RTF has some (but not necessarily all) formatting information in a global header. In order to split the RTF text so that the results are again valid RTF with formatting applied you essentially need to know where the header information is, and replicate it across a splits.

    There are two ways of doing this:

    1. Write an RTF parser
    2. Use an existing RTF parser

    (1) is doable, but will take time. Luckily, RTF parsers already exist, for example this one on CodeProject.

    Alternatively, you can also load the RTF text into a RichTextBox, then search for the split text "[BreakPage]" inside the RichTextBox, programmatically select the first and second part and retrieve the RTF text using the SelectedRtf property.