Search code examples
c#pdfitextmetadataxmp

How to avoid adding double quotes to the metadata keywords in PDF file using iTextSharp in C#?


Using the iTextSharp Library I'm able to insert metadata in a PDF file using the various Schemas.

The keywords in the keywords metadata are for my purposes delimited by a comma and enclosed in double quotes. Once the script I've written runs, the keywords are enclosed in triple quotes.

Any ideas on how to avoid this or any advice on working with XMP?

Example of required metadata : "keyword1","keyword2","keyword3"

Example of current metadata : """keyword1"",""keyword2"",""keyword3"""

Coding:

string _keywords = meta_line.Split(',')[1] + ","
                             + meta_line.Split(',')[2] + ","
                             + meta_line.Split(',')[3] + ","
                             + meta_line.Split(',')[4] + ","
                             + meta_line.Split(',')[5] + ","
                             + meta_line.Split(',')[6] + ","
                             + meta_line.Split(',')[7];
            _keywords = _keywords.Replace('~', ',');

            Console.WriteLine(metaFile);

            foreach (string inputFile in Directory.GetFiles(source, "*.pdf", SearchOption.TopDirectoryOnly))
            {
                if (Path.GetFileName(metaFile) == Path.GetFileName(inputFile))
                {
                    string outputFile = source + @"\output\" + Path.GetFileName(inputFile);
                    PdfReader reader = new PdfReader(inputFile);

                    using (FileStream fs = new FileStream(outputFile, FileMode.Create, FileAccess.Write, FileShare.None))
                    {

                        PdfStamper stamper = new PdfStamper(reader, fs);
                        Dictionary<String, String> info = reader.Info;
                        stamper.MoreInfo = info;

                        PdfWriter writer = stamper.Writer;

                        byte[] buffer = new byte[65536];

                        System.IO.MemoryStream ms = new System.IO.MemoryStream(buffer, true);
                        try
                        {
                            iTextSharp.text.xml.xmp.XmpSchema dc = new iTextSharp.text.xml.xmp.DublinCoreSchema();

                            dc.SetProperty(iTextSharp.text.xml.xmp.DublinCoreSchema.TITLE, new iTextSharp.text.xml.xmp.LangAlt(_title));

                            iTextSharp.text.xml.xmp.XmpArray subject = new iTextSharp.text.xml.xmp.XmpArray(iTextSharp.text.xml.xmp.XmpArray.ORDERED);
                            subject.Add(_subject);
                            dc.SetProperty(iTextSharp.text.xml.xmp.DublinCoreSchema.SUBJECT, subject);

                            iTextSharp.text.xml.xmp.XmpArray author = new iTextSharp.text.xml.xmp.XmpArray(iTextSharp.text.xml.xmp.XmpArray.ORDERED);
                            author.Add(_author);
                            dc.SetProperty(iTextSharp.text.xml.xmp.DublinCoreSchema.CREATOR, author);

                            PdfSchemaAdvanced pdf = new PdfSchemaAdvanced();

                            pdf.AddKeywords(_keywords);


                            iTextSharp.text.xml.xmp.XmpWriter xmp = new iTextSharp.text.xml.xmp.XmpWriter(ms);
                            xmp.AddRdfDescription(dc);
                            xmp.AddRdfDescription(pdf);
                            xmp.Close();

                            int bufsize = buffer.Length;
                            int bufcount = 0;
                            foreach (byte b in buffer)
                            {
                                if (b == 0) break;
                                bufcount++;
                            }
                            System.IO.MemoryStream ms2 = new System.IO.MemoryStream(buffer, 0, bufcount);
                            buffer = ms2.ToArray();

                            foreach (char buff in buffer)
                            {
                                Console.Write(buff);
                            }
                            writer.XmpMetadata = buffer;
                        }
                        catch (Exception ex)
                        {
                            throw ex;
                        }
                        finally
                        {
                            ms.Close();
                            ms.Dispose();
                        }

                        stamper.Close();
                     // writer.Close();

                    }

                    reader.Close();
                }
            }

The below method didn't add any metadata - not sure why (point 3 in the comments):

iTextSharp.text.xml.xmp.XmpArray keywords = new     iTextSharp.text.xml.xmp.XmpArray(iTextSharp.text.xml.xmp.XmpArray.ORDERED);
                            keywords.Add("keyword1");
                            keywords.Add("keyword2");
                            keywords.Add("keyword3");


                            pdf.SetProperty(iTextSharp.text.xml.xmp.PdfSchema.KEYWORDS, keywords);

Solution

  • I currently don't have the newest iTextSharp version. I have a itextsharp 5.1.1.0. It does not contain PdfSchemaAdvanced class, but it has PdfSchema and its base class XmpSchema. I bet the PdfSchemaAdvanced in your lib also derives from XmpSchema.

    The PdfSchema.AddKeyword only does one thing:

    base["pdf:Keywords"] = keywords;
    

    and XmpSchema.[].set in turn does:

    base[key] = XmpSchema.Escape(value);
    

    so it's very clear that the value is being, well, 'Escaped', to ensure that special characters are not interfering with the storage format.

    Now, the Escape function, what what I see, performs simple character-by-character scanning and performs substitutions:

    " -> &quot;
    & -> &amp;
    ' -> &apos;
    < -> &lt;
    > -> &gt;
    

    and that's all. Seems like a typical html-entites processing. At least in my version of the library. So, it would not duplicate the quotes, just change their encoding.

    Then, AddRdfDescription seems to simply iterate over the stored keys and just wraps them in tags with no furhter processing. So, it'd emit something like that:

    Escaped"Contents&OfThis"Key
    

    as:

    <pdf:Keywords>Escaped&quot;Contents&amp;OfThis&quot;Key</pdf:Keywords>
    

    Aside from the AddKeywords method, you should also see AddProperty method. It acts similarly to add-keywords except for the fact that it receives key and does not Escape() its input value.

    So, if you are perfectly sure that your _keywords are formatted properly, you might try:

    AddProperty("pdf:Keywords", _keywords)
    

    but I discourage you from doing that. At least in my version of itextsharp, the library seems to properly process the 'keywords' and format it safely as RDF.

    Heh, you may also try using the PdfSchema class that I just checked instead of the Advanced one. I bet it still is present in the library.

    But, in general, I think the problem lies elsewhere.

    Double or triple-check the contents of _keywords variable and then also check the binary contents of the generated PDF. Look into it with some hexeditor or simple plain-text editor like Notepad and look for the <pdf:Keywords> tag. Check what it actually contains. It might be all OK and it might be your pdf-metadata-reader that adds those quotes.