Search code examples
c#.netopenxmldiagrampresentation

Extract text from SmartArt in Presentation OpenXml


I need to extract all text from a PowerPoint OpenXml format (pptx) in .net (VB.NET or C#) Like you all know, pptx is a zipped file whith some folders inside. I'm using OpenXmlPowerTools. I managed to extract the text from the slides folder (slidespart), but I noticed that there is more text in my powerpoint document sample that is not being extracted. This text is included in smartart and is not included in slidespart xml file. I know the text is in the folder "/diagrams" of the pptx, in the file "data1.xml", but I don't know how to call this part. An example: To call the file where the text is in the slides, I refer to OpenXmldocument.PresentationPart.SlideParts and then I parse the XML myself. But I don't know the name of the part in which is the folder "diagram". Could you help me? Thank you in advance.

XML sample of the smartart

-<dgm:pt type="doc" modelId="{8B66E7B5-44C3-4667-8338-7D4A025BD5D4}">

<dgm:prSet phldr="1" csCatId="accent2" csTypeId="urn:microsoft.com/office/officeart/2005/8/colors/accent2_3" qsCatId="simple" qsTypeId="urn:microsoft.com/office/officeart/2005/8/quickstyle/simple5" loCatId="process" loTypeId="urn:microsoft.com/office/officeart/2005/8/layout/hProcess11"/>

<dgm:spPr/>


-<dgm:t>

<a:bodyPr/>

<a:lstStyle/>


-<a:p>

<a:endParaRPr lang="pt-PT"/>

</a:p>

</dgm:t>

</dgm:pt>


-<dgm:pt modelId="{2E9EC466-8028-4506-882E-42A5A1CC8163}">

<dgm:prSet custT="1" phldrT="[Texto]"/>

<dgm:spPr/>


-<dgm:t>

<a:bodyPr anchor="t"/>

<a:lstStyle/>


-<a:p>


-<a:r>


-<a:rPr lang="pt-PT" smtClean="0" dirty="0" sz="1200">

<a:latin typeface="+mn-lt"/>

<a:cs typeface="Arial" charset="0" pitchFamily="34"/>

</a:rPr>

<a:t>Apresentação Conceptual - Plano</a:t>

</a:r>


-<a:endParaRPr lang="pt-PT" dirty="0" sz="1200">

<a:latin typeface="+mn-lt"/>

<a:cs typeface="Arial" charset="0" pitchFamily="34"/>

</a:endParaRPr>

</a:p>

</dgm:t>

</dgm:pt>


Solution

  • I had this same problem and found your question, I worked it out to be:

    static void Main(string[] args)
    {
        using (var p = PresentationDocument.Open(@"SmartArt.pptx", true))
        {
            foreach (var slide in p.PresentationPart.GetPartsOfType<SlidePart>().Where(sp => IsVisible(sp)))
            {
                foreach(var diagramPart in slide.DiagramDataParts)
                {
                    foreach(var text in diagramPart.RootElement.Descendants<Run>().Select(d => d.Text.Text))
                    {
                        Console.WriteLine(text);
                    }
                }
            }
        }
    
        Console.ReadLine();
    }
    
    private static bool IsVisible(SlidePart s)
    {
        return (s.Slide != null) &&
          ((s.Slide.Show == null) || (s.Slide.Show.HasValue &&
          s.Slide.Show.Value));
    }