Search code examples
palantir-foundrypalantir-foundry-api

Palantir Foundry - How to load PDF files from Compass folder into code repository transform


In Palantir Foundry, my goal is:

  1. Find all PDFs in a Compass folder
  2. In transform, shutil / copy each PDF from Compass to a dataset file system

I have retrieved a list of PDF files stored in a Compass folder from this endpoint (compass/api/folders/{compass_rid}/children), and also successfully set up a Compass File Lister. I'm stuck on where to go from either option, as I haven't figured out how to use any of the information to actually read a blobster file from a transform.

Is it possible to read these PDFs in a transform to be able to copy them to an unstructured dataset file system?

Based on other SO questions, I read through read files in a repository but this seems to rely on the files actually being imported to the repository, so I'm not following if this would help me.

I also read through the Compass endpoints but I don't see a way to move/copy files from Compass to a dataset filesystem, only potentially from one Compass folder to another.


Solution

  • Sharing an updated version here that pulls all blobster files from a specified folder and writes them to one single dataset.

    from transforms.api import transform, Output, configure
    from transforms.external.systems import (
        EgressPolicy,
        Credential,
        use_external_systems
    )
    import requests
    
    
    @configure(profile=["KUBERNETES_NO_EXECUTORS_SMALL"])
    @use_external_systems(
        egress_policy=EgressPolicy("<POLICY_RID>"),
        creds=Credential("<SAVED_CREDENTIALS_RID>")
    )
    @transform(
        output=Output("<OUTPUT_RID>"),
    )
    def compute(egress_policy, creds, output):
    
        url_root = '<ROOT_URL>'
        compass_folder_read = '<FOLDER_RID>'
        resources_lister_url = f'{url_root}/compass/api/folders/{compass_folder_read}/children'
        get_blobster_url_root = f'{url_root}/blobster/api/salt/'
        TOKEN = creds.get("token")
    
        headers_compass = {
            'Authorization': f'Bearer {TOKEN}',
            'Content-Type': 'application/json',
        }
    
        headers_blobster = {
            'cookie': f'PALANTIR_TOKEN={TOKEN}'
        }
    
        files_response = requests.get(resources_lister_url, headers=headers_compass)
        # Get only blobster files.
        rid_filename_map = {f.get('rid'): f.get('name') for f in files_response.json().get('values') if 'blobster' in f.get('rid')}
    
        for blobster_rid, filename in rid_filename_map.items():
            url = get_blobster_url_root + blobster_rid
            file_contents_reponse = requests.get(url, headers=headers_blobster)
            with output.filesystem().open(filename, 'wb') as f:
                f.write(file_contents_reponse.content)
                f.close()