hadoop hadoop-streaming hadoop-plugins hadoopy

how to access and manipulate pdf file's datas in Hadoop?

I want to read the PDF file using hadoop, how it is possible? I only know that hadoop can process only txt files, so is there anyway to parse the PDF files to txt.

Give me some suggestion.

Solution

An easy way would be to create a SequenceFile to contain the PDF files. SequenceFile is a binary file format. You could make each record in the SequenceFile a PDF. To do this you would create a class derived from Writable which would contain the PDF and any metadata that you needed. Then you could use any java PDF library such as PDFBox to manipulate the PDFs.

how to convert date 2017-sep-12 To 2017-09-12 in HIVE
pySpark Hadoop AWS s3 requester-pays.enabled config doesn't work
HBase Shell - org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is not running yet
UnsatisfiedLinkError while writing to S3 using Staging S3A Committer on Windows
Why do I need to source bash_profile every time
Apache Spark: Get number of records per partition
Unable to exit Hive
can Configuration.set be used in the Mapper?
Loading Files in UDF
Error: `callbackHandler` may not be null when connecting to HDFS using Kerberos in Jakarta EE
how to tune out of memory exception spark
Can't connect from Spark to S3 - AmazonS3Exception Status Code: 400
How to delete and update a record in Hive
What is Google's Dremel? How is it different from Mapreduce?
how to set "api-version" dynamically in fs.azure.account.oauth2.msi.endpoint
NoClassDefFoundError: org/apache/parquet/conf/ParquetConfiguration
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
Missing PutHDFS Processor in Apache NiFi 2.0.0
Apache Nifi: PutHDFS Processor issue - PutHDFS Failed to write to HDFS java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configurable
how to check which HDFS datanode ip is returned by namenode to spark?
How to use hadoop with laravel 5.2
java.lang.UnsupportedOperationException: 'posix:permissions'
What is the principle of "code moving to data" rather than data to code?
java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
How to understand the result of yarn queue status
Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
connect to host localhost port 22: Connection refused
Where is yarn.nodemanager.log-dirs in spark?
How to change date format in hive?
Parquet without Hadoop?