Search code examples
apache-sparkdatabricksazure-databricks

Databricks Spark Write Behavior - Files like _committed and _start


I'm facing an issue when writing data to S3 using Databricks Spark, where I'm observing additional files such as _committed and _start alongside my actual data files. My implementation doesn't involve Delta Lake, and I'm using a standard Spark write operation.

Environment:

  • Databricks Runtime: 9.1
  • Apache Spark: 3.1.2
  • Storage: S3

Observations:

  • Files like _committed and _start are present alongside my actual data files in the S3 output location.
  • I'm not using Delta Lake in my implementation.

Question:

  • Can someone explain why these files (_committed and _start) are being generated, and what is their purpose in a standard Spark write operation on Databricks?
  • Are there any specific Databricks Runtime or Databricks File System (DBFS) features that might be causing this behaviour?
  • Is there any way to omit the files ?

Solution

  • The _committed and _start commands alongside your actual data files in a standard Spark write operation on Databricks maybe related to the transactional behavior of the Databricks File System when writing to external storage like S3.

    Let me help you ansser the questions, the purpose of these files might be specific to the internal workings of Databricks and DBFS during write operations. but I checked and Unfortunately, Databricks does not always expose detailed documentation.

    Maybe Upgrading or changing the Databricks Runtime version might lead to variations in this behavior.

    I think there might not be a straightforward way to omit these files, you could consider post-processing steps after the write operation to filter out these auxiliary files from your final dataset.

    import requests
    
    # Databricks REST API endpoint for listing files
    api_url = "https://<your-databricks-instance>/api/2.0/dbfs/list"
    
    # Databricks personal access token for authentication
    token = "<your-personal-access-token>"
    
    # DBFS directory path
    directory_path = "/path/to/your/directory"
    
    # Make API request to list files in the directory
    response = requests.get(
        api_url,
        headers={"Authorization": f"Bearer {token}"},
        params={"path": directory_path}
    )
    
    # Check if the request was successful
    if response.status_code == 200:
        # Parse the response JSON
        files = response.json().get("files", [])
    
        # Filter files based on your criteria
        filtered_files = [file for file in files if not file["path"].endswith(("_committed", "_start"))]
    
        # Perform actions with the filtered files (e.g., delete, move, etc.)
        for file in filtered_files:
            print(f"Processing file: {file['path']}")
    
    else:
        print(f"Error: {response.status_code} - {response.text}")