Search code examples
httphadoopclientstreamsets

Appending UUID in file name when streaming via StreamSets Data Collector


I am using HttpClient origin to stream a file from an HTTP url to Hadoop destination, but the file name in the destination is appended with some random uuid. I want the file name to be as it is from the source.

Example: source file name is README.txt , destination file name is README_112e5d4b-4d85-4764-ab81-1d7b6e0237b2.txt

I want the destination file name to be README.txt

I'll show you my configuration.

HTTP Client :

General

Name : HTTP Client 1

Description : 

On Record Error : Send to Error

HTTP

Resource URL : http://files.data.gouv.fr/sirene/README.txt

Headers : 

Mode : Streaming

Per-Status Actions

HTTP Statis Code : 500 | Action for status : Retry with exponential backoff |

Base Backoff Interval (ms) : 1000 | Max Retries : 10

HTTP Method : GET

Body Time Zone : UTC (UTC)

Request Transfert Encoding : BUFFERED

HTTP Compression : None

Connect Timeout : 0

Read Timeout : 0

Authentication Type : None

Use OAuth 2

Use Proxy

Max Batch Size (records) : 1000

Batch Wait Time (ms) : 2000

Pagination

Pagination Mode : None

TLS

UseTLS

Timeout Handling

Action for timeout : Retry immediately

Max Retries : 10

Data Format

Date Format : Text

Compression Format : None

Max Line Length : 1024

Use Custom Delimiter

Charset : UTF-8

Ignore Control Characters

Logging 

Enable Request Logging

Hadoop FS Destination :

General

Name : Hadoop FS 1

Description : Writing into HDFS

Stage Library : CDH 5.10.1

Produce Events

Required Fields

Preconditions

On Record Error : Send to Error

Output Files

File Type : Text Files

Files Prefix : README

File Suffix : txt

Directory in Header

Directory Template : /user/username/

Data Time Zone : UTC (UTC)

Time Basis : ${time:now()}

Max Records in File : 0

Max File Size (MB) : 0

Idle Timeout : ${1 * HOURS}

Compression Codec : None

Use Roll Attribute

Validate HDFS Permissions : ON

Skip file recovery

Late Records

Late Record Time Limit (secs) : ${1 * HOURS}

Late Record Handling : Send to error

Data Format

Data Format : Text

Text Field Path : /text

Record Separator : \n

On Missing Field : Report Error

Charset : UTF-8

Solution

  • You can configure a filename prefix and suffix, but it is not possible to remove the UUID.

    In many circumstances, the directory is the smallest useful filesystem entity in Hadoop. Since files may be being written concurrently by multiple clients, and files may be 'rolled' (the current output file closed and a new file opened) for operational reasons such as file size passing a given threshold, Data Collector ensures that filenames are unique to avoid accidental data loss.

    There is a workaround if you really want to do this: enable events on the Hadoop destination and use a HDFS File Metadata Executor to rename the file. See this case study on output file management for more.