Search code examples
c#aspose.words.doc

Extract bullets from word document using aspose.words in C#


I need to extract the text with the bullet style from a word document in C#. I am using the aspose.words library but a solution with a different library is also welcome. I can already upload documents and extract the text with heading1 styling. but when I try the same with the bullet styling I get nothing.

I am using the code below to get the text with Heading1 styling and that works.

var heading1 = doc
    .GetChildNodes(NodeType.Paragraph, true)
    .Cast<Aspose.Words.Paragraph>()
    .ToArray()
    .Where(p => p.ParagraphFormat.StyleIdentifier == StyleIdentifier.Heading1);
    
foreach (var head1 in heading1)
{
    listBox11.Items.Add(head1.gettext()tostring());
}

I am trying to use the code below to get the text with bullet styling and this does NOT work.

var bullets = doc
    .GetChildNodes(NodeType.Paragraph, true)
    .Cast<Aspose.Words.Paragraph>()
    .ToArray()
    .Where(p => p.ParagraphFormat.StyleIdentifier == StyleIdentifier.ListBullet);
    
foreach (var bullet in bullets)
{
    listBox19.Items.Add(bullet.GetText().ToString());
}
    
listBox19.Items.Add(bullet1.GetText().ToString());

I also tried using the listbullet1,2,3,4 and 5 styleIdentifiers but that also does not fix the problem.


Solution

  • I am now using this to succesfully extract the list items from a word file and put them into a listbox.

           string fileName = listBox1.Items.Cast<string>().FirstOrDefault();
                    // Open the document.
                    Document doc = new Document(fileName);
    
                    doc.UpdateListLabels();
    
                    NodeCollection paras = doc.GetChildNodes(NodeType.Paragraph, true);
    
                    // Find if we have the paragraph list. In our document, our list uses plain Arabic numbers,
                    // which start at three and ends at six.
                    foreach (Aspose.Words.Paragraph paragraph in paras.OfType<Aspose.Words.Paragraph>().Where(p => p.ListFormat.IsListItem))
                    {
                        //listBox19.Items.Add($"List item paragraph #{paras.IndexOf(paragraph)}");
    
                        // This is the text we get when getting when we output this node to text format.
                        // This text output will omit list labels. Trim any paragraph formatting characters. 
                        string paragraphText = paragraph.ToString(SaveFormat.Text).Trim();
                        //remove the dot in front of the bullet
                        string bullet = paragraphText.Remove(0, 2);
    
                        listBox19.Items.Add(bullet);
    
                        ListLabel label = paragraph.ListLabel;
                    }