Search code examples
javaapache-poidoc

doc URL cannot be read: Unable to read entire header; 6 bytes read; expected 32 bytes


I am trying to read a Word document from a web URL using POI version 3.6. Non-working code:

String url = "http://prevention.cancer.gov/sites/default/files/uploads/clinical_trial/Master-DMP-Template.doc";
InputStream inputStream = new URL(urlString).openStream();
HWPFDocument doc = new HWPFDocument(inputStream);
WordExtractor extractor = new WordExtractor(doc);
String text = extractor.getText();

Above code results in java.io.IOException: Unable to read entire header; 6 bytes read; expected 32 bytes

Attempt 2: the interesting part is that downloading the file (just pasting the URL in the browser address bar), and then executing similar code for reading the doc locally does work:

InputStream inputStream = new FileInputStream("C:\\Users\\me\\Downloads\\Master-DMP-Template (2).doc");
HWPFDocument doc = new HWPFDocument(inputStream);
WordExtractor extractor = new WordExtractor(doc);
System.out.println(extractor.getText());

Attempt 3: and now the strangest part. I thought that the file needs to be downloaded first. So I downloaded it first using Java, and then executed the previous code for reading the doc locally. Fails like the first case!

final String url = "http://prevention.cancer.gov/sites/default/files/uploads/clinical_trial/Master-DMP-Template.doc";
String localPath  = FileUtils.downloadFile("C:\\Users\\me\\Downloads", url);
InputStream inputStream = new FileInputStream(localPath);
HWPFDocument doc = new HWPFDocument(inputStream);
WordExtractor extractor = new WordExtractor(doc);
System.out.println(extractor.getText());

public static String downloadFile(String targetDir, String sourceUrl) throws IOException {
    sourceUrl = StringUtils.removeEnd(sourceUrl, "/");
    String fileName = sourceUrl.substring(sourceUrl.lastIndexOf("/") + 1);
    String targetPath = targetDir + FileUtils.SEPARATOR + fileName;
    InputStream in = new URL(sourceUrl).openStream();
    Files.copy(in, Paths.get(targetPath), StandardCopyOption.REPLACE_EXISTING);
    System.out.println("Downloaded " + sourceUrl + " to " + targetPath);
    return targetPath;
}

Any idea what is going on here?

An update: I created a separate project for trying with POI 4.1.0. Same code (of first attempt) results in org.apache.poi.EmptyFileException: The supplied file was empty (zero bytes long)

I tried pasting the URL in the browser after hitting F12 and observing the Network tab. The message that appears there is: Resource interpreted as Document but transferred with MIME type application/msword: "https://prevention.cancer.gov/sites/default/files/uploads/clinical_trial/Master-DMP-Template.doc".

I am still stuck...

An update: as https://stackoverflow.com/users/3915431/axel-richter pointed out, there is a 301 redirecto to https://prevention.cancer.gov/sites/default/files/uploads/clinical_trial/Master-DMP-Template.doc . However, now I am running into strange problems that are not related to Word. Followig code fails:

public static void main(String[] args) {
    try {
        if (args.length > 0 && args[0].equals("disableCertValidation")) {
            SSLUtil.disableCertificateValidation(); // redirect is https
        }
        final String stringURL = "https://prevention.cancer.gov/sites/default/files/uploads/clinical_trial/Master-DMP-Template.doc";
        URL url = new URL(stringURL);
        HttpURLConnection con = (HttpURLConnection) url.openConnection();
        int responseCode = con.getResponseCode();
        System.out.println("Response code: " + responseCode); //301 Moved Permanently
        InputStream in = con.getInputStream();
        HWPFDocument doc = new HWPFDocument(in);
        WordExtractor extractor = new WordExtractor(doc);
        String text = extractor.getText();
        System.out.println(text);
        in.close();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

When running main without an argument, the line

int responseCode = con.getResponseCode();

fails with following exception: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

When running the code with the disableCertificateValidation argument, the response code is 404 and I am getting following exception:

java.io.FileNotFoundException: https://prevention.cancer.gov/sites/default/files/uploads/clinical_trial/Master-DMP-Template.doc at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:1890) at sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:1885) at java.security.AccessController.doPrivileged(Native Method) at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1884) at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1457) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1441) at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:254) at com.keywords.control.util.TestHTMLParser.main(TestHTMLParser.java:472) Caused by: java.io.FileNotFoundException: https://prevention.cancer.gov/sites/default/files/uploads/clinical_trial/Master-DMP-Template.doc at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1836) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1441) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480) at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:338) at com.keywords.control.util.TestHTMLParser.main(TestHTMLParser.java:470)

Any ideas?


Solution

  • The initial HTTP request to your URL leads to a redirect 301 Moved Permanently. So we need handling this and reading the new location.

    Complete example:

    import java.io.InputStream;
    import java.net.URL;
    import java.net.HttpURLConnection;
    
    import org.apache.poi.hwpf.HWPFDocument;
    import org.apache.poi.hwpf.extractor.WordExtractor;
    
    public class OpenHWPFFromURL {
    
     public static void main(String[] args) throws Exception {
    
      String stringURL = "http://prevention.cancer.gov/sites/default/files/uploads/clinical_trial/Master-DMP-Template.doc";
    
      URL url = new URL(stringURL);
      HttpURLConnection con = (HttpURLConnection)url.openConnection();
    
      int responseCode = con.getResponseCode();
      System.out.println(responseCode); //301 Moved Permanently
    
      if (responseCode != HttpURLConnection.HTTP_OK) {
       if (responseCode == HttpURLConnection.HTTP_MOVED_TEMP
           || responseCode == HttpURLConnection.HTTP_MOVED_PERM
           || responseCode == HttpURLConnection.HTTP_SEE_OTHER) {
        url = new URL(con.getHeaderField("Location")); //get new location
        con = (HttpURLConnection)url.openConnection();
       }   
      }
    
      InputStream in = con.getInputStream();
      HWPFDocument doc = new HWPFDocument(in);
      WordExtractor extractor = new WordExtractor(doc);
      String text = extractor.getText();
    
      System.out.println(text);
    
     }
    }
    

    Note: Simply setting HttpURLConnection.setFollowRedirects to true (what is the default as well) will not help if the redirect also changes the protocol (from HTTP to HTTPS for example). Exactly this is the case here too. So we need getting the new location manually as shown in my code.