Search code examples
javakotlinapache-poipdfboxhttp4k

Delete metadata from doc, docx, pdf files which are stored as blob on GCS (Google cloud services)


Current Problem :- I have a setup of GCS to which i am uploading some files such as doc, docx, pdf. With the file upload the default metadata is also getting uploaded. THE FILES ARE GETTING UPLOADED AS A Blob. When we try to access the file I am getting a InputStream from which we cannot delete the metadata directly.

What I want ? I want to delete the default metadata ( Which may reveal the personal info of uploaded users ) while uploading or downloading the file from GCS server.

What problems I am facing ? While downloading the file the file is in blog type, or I am getting the file as Input stream from which we cannot delete the metadata directly.

What steps I need to follow to remove the Metadata from the files while downloading and uploading ?

How can we read the file metadata from Input stream and delete it ?

Tools and programming languages used :- Kotlin, http4k, Apache POI, PDFBox

        val opc = OPCPackage.open("demoDox.docx")
        val pp = opc.packageProperties

        println(pp.creatorProperty)
        pp.setCreatorProperty("Shubham") //we can update the core properties like this
        println(pp.creatorProperty)
        opc.close()

We can remove the docx metadata only when we know the file path. But as of now I am getting a InputStream from GCS.


Solution

  • Solved:

    I was able to solve the problem using the code below:

    val doc =  HWPFDocument(response.body.stream)
    println("Current author  = ${doc.summaryInformation.author}")
                
    val pp = doc.summaryInformation.removeAuthor()
         
    println("Removed author = ${doc.summaryInformation.author}")
    
    doc.close()