Search code examples
apache-tika

How to extracting only text from the .ppt using Apache Tika


public class Test {

    public static void main(String[] args) throws Exception{
        String data;
        TikaConfig tikaConfig = TikaConfig.getDefaultConfig();
        Metadata metadata = new Metadata();
        ContentHandler handler;
        try (InputStream stream = new BufferedInputStream(new FileInputStream(new File("E:\\AllTypes\\PPT\\Presentation1.pptx")))) {
            Detector detector = tikaConfig.getDetector();
            Parser parser = tikaConfig.getParser();
            MediaType type = detector.detect(stream, metadata);
            metadata.set(Metadata.CONTENT_TYPE, type.toString());
            handler = new BodyContentHandler(-1);
            parser.parse(stream, handler, metadata, new ParseContext());
            data = handler.toString();
            System.out.println(data);
        }
    }
}

I have only Hello world! in the input ppt So i want only Hello world! Output: [Content_Types].xml

_rels/.rels

ppt/slides/_rels/slide1.xml.rels

ppt/_rels/presentation.xml.rels

ppt/presentation.xml

ppt/slides/slide1.xml Hello world!

ppt/slideLayouts/_rels/slideLayout6.xml.rels

ppt/slideLayouts/_rels/slideLayout7.xml.rels

ppt/slideLayouts/_rels/slideLayout9.xml.rels

ppt/slideLayouts/_rels/slideLayout10.xml.rels

ppt/slideLayouts/_rels/slideLayout8.xml.rels

ppt/slideLayouts/_rels/slideLayout11.xml.rels

ppt/slideLayouts/_rels/slideLayout1.xml.rels

ppt/slideLayouts/_rels/slideLayout2.xml.rels

ppt/slideLayouts/_rels/slideLayout3.xml.rels

ppt/slideLayouts/_rels/slideLayout4.xml.rels

ppt/slideMasters/_rels/slideMaster1.xml.rels

ppt/slideLayouts/slideLayout11.xml Click to edit Master title style Click to edit Master text styles Second level Third level Fourth level Fifth level 1/30/2018 ‹#›

ppt/slideLayouts/slideLayout10.xml Click to edit Master title style Click to edit Master text styles Second level Third level Fourth level Fifth level 1/30/2018 ‹#›

ppt/slideLayouts/slideLayout3.xml Click to edit Master title style Click to edit Master text styles 1/30/2018 ‹#›

ppt/slideLayouts/slideLayout2.xml Click to edit Master title style Click to edit Master text styles Second level Third level Fourth level Fifth level 1/30/2018 ‹#›

ppt/slideLayouts/slideLayout1.xml Click to edit Master title style Click to edit Master subtitle style 1/30/2018 ‹#›

ppt/slideMasters/slideMaster1.xml Click to edit Master title style Click to edit Master text styles Second level Third level Fourth level Fifth level 1/30/2018 ‹#›

ppt/slideLayouts/slideLayout4.xml Click to edit Master title style Click to edit Master text styles Second level Third level Fourth level Fifth level Click to edit Master text styles Second level Third level Fourth level Fifth level 1/30/2018 ‹#›

ppt/slideLayouts/slideLayout5.xml Click to edit Master title style Click to edit Master text styles Click to edit Master text styles Second level Third level Fourth level Fifth level Click to edit Master text styles Click to edit Master text styles Second level Third level Fourth level Fifth level 1/30/2018 ‹#›

ppt/slideLayouts/slideLayout6.xml Click to edit Master title style 1/30/2018 ‹#›

ppt/slideLayouts/slideLayout7.xml 1/30/2018 ‹#›

ppt/slideLayouts/slideLayout8.xml Click to edit Master title style Click to edit Master text styles Second level Third level Fourth level Fifth level Click to edit Master text styles 1/30/2018 ‹#›

ppt/slideLayouts/slideLayout9.xml Click to edit Master title style Click to edit Master text styles 1/30/2018 ‹#›

ppt/slideLayouts/_rels/slideLayout5.xml.rels

ppt/theme/theme1.xml

docProps/thumbnail.jpeg

ppt/presProps.xml

ppt/tableStyles.xml

ppt/viewProps.xml

docProps/core.xml PowerPoint Presentation srinuk srinuk 1 2018-01-30T10:19:34Z 2018-01-30T10:22:05Z

docProps/app.xml 2 3 Microsoft Office PowerPoint Widescreen 1 1 0 0 0 false Fonts Used 3 Theme 1 Slide Titles 1 Arial Calibri Calibri Light Office Theme PowerPoint Presentation false false false 15.0000


Solution

  • You can try to use tika-app.jar.Just use a Tika extract text function.

    Tika tika = new Tika(); File file = new File("path"); String str = tika.parseToString(file);

    This code just parses text content from the file.