I am using HttpClient origin to stream a file from an HTTP url to Hadoop destination, but the file name in the destination is appended with some random uuid. I want the file name to be as it is from the source.
Example: source file name is README.txt , destination file name is README_112e5d4b-4d85-4764-ab81-1d7b6e0237b2.txt
I want the destination file name to be README.txt
I'll show you my configuration.
HTTP Client :
General
Name : HTTP Client 1
Description :
On Record Error : Send to Error
HTTP
Resource URL : http://files.data.gouv.fr/sirene/README.txt
Headers :
Mode : Streaming
Per-Status Actions
HTTP Statis Code : 500 | Action for status : Retry with exponential backoff |
Base Backoff Interval (ms) : 1000 | Max Retries : 10
HTTP Method : GET
Body Time Zone : UTC (UTC)
Request Transfert Encoding : BUFFERED
HTTP Compression : None
Connect Timeout : 0
Read Timeout : 0
Authentication Type : None
Use OAuth 2
Use Proxy
Max Batch Size (records) : 1000
Batch Wait Time (ms) : 2000
Pagination
Pagination Mode : None
TLS
UseTLS
Timeout Handling
Action for timeout : Retry immediately
Max Retries : 10
Data Format
Date Format : Text
Compression Format : None
Max Line Length : 1024
Use Custom Delimiter
Charset : UTF-8
Ignore Control Characters
Logging
Enable Request Logging
Hadoop FS Destination :
General
Name : Hadoop FS 1
Description : Writing into HDFS
Stage Library : CDH 5.10.1
Produce Events
Required Fields
Preconditions
On Record Error : Send to Error
Output Files
File Type : Text Files
Files Prefix : README
File Suffix : txt
Directory in Header
Directory Template : /user/username/
Data Time Zone : UTC (UTC)
Time Basis : ${time:now()}
Max Records in File : 0
Max File Size (MB) : 0
Idle Timeout : ${1 * HOURS}
Compression Codec : None
Use Roll Attribute
Validate HDFS Permissions : ON
Skip file recovery
Late Records
Late Record Time Limit (secs) : ${1 * HOURS}
Late Record Handling : Send to error
Data Format
Data Format : Text
Text Field Path : /text
Record Separator : \n
On Missing Field : Report Error
Charset : UTF-8
You can configure a filename prefix and suffix, but it is not possible to remove the UUID.
In many circumstances, the directory is the smallest useful filesystem entity in Hadoop. Since files may be being written concurrently by multiple clients, and files may be 'rolled' (the current output file closed and a new file opened) for operational reasons such as file size passing a given threshold, Data Collector ensures that filenames are unique to avoid accidental data loss.
There is a workaround if you really want to do this: enable events on the Hadoop destination and use a HDFS File Metadata Executor to rename the file. See this case study on output file management for more.