Search code examples
amazon-web-servicesamazon-s3amazon-cloudfrontdirectory-listing

Cloudfront backed S3 bucket: listing bucket contents over https with prefix


I have a Cloudfront distribution that is backed by an S3 bucket. I'm trying to list the contents of a subdirectory (prefix) within the bucket, but I'm getting a list of the entire bucket contents rather than just the objects with a given prefix.

It's important to note that I have to do this by making a direct https request to the Cloudfront domain directly (not by using the AWS cli or the AWS S3 APIs).

I am able to successfully download/upload objects from the Cloudfront domain using signed cookies to authenticate. And I can clearly list all the objects on the bucket (the auth works), but I'm not able to return only objects with a given prefix.

I also have a bucket policy that allows the following on the bucket itself:

  • s3:ListBucket
  • s3:ListBucketMultipartUploads

And the following on the bucket objects:

s3:GetObject, s3:GetObjectAttributes, s3:GetObjectVersion, s3:GetObjectTagging, s3:GetObjectVersionTagging, s3:PutObject, s3:PutObjectVersionTagging, s3:PutObjectTagging, s3:DeleteObject, s3:DeleteObjectTagging, s3:DeleteObjectVersionTagging, s3:AbortMultipartUpload, s3:ListMultipartUploadParts

When I try this query, the response returns the entire contents of the directory:

import requests
cookies = {...} # cookies needed to authenticate with cloudfront; this is working correctly
res = requests.get("https://example.cloudfront.net/?prefix=myprefix/", cookies=cookies)

... it gives me a big xml document.

<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
  <Name>mybucket</Name>
  <Prefix/>
  <Marker/>
  <MaxKeys>1000</MaxKeys>
  <IsTruncated>true</IsTruncated>
  <Contents>
    <Key>some-other-prefix-I-do-not-want/myfile.text</Key>
    ...
  </Contents>
  <Contents>
  ...
</ListBucketResult>

Or if I try this query, the params have no effect and I get the entire bucket contents (I think this is using the S3 api, so maybe that's why this particular query doesn't work)

params = {"list-type": "2", "delimiter": "/", "prefix": "myprefix"}
res = requests.get("https://example.cloudfront.net", cookies=cookies, params=params)

When I try this query, I get a 404.

res = requests.get("https://example.cloudfront.net/myprefix/", cookies=cookies)

Does anyone know how I can list only the objects with a given prefix when listing the files from a Cloudfront distribution?


Solution

  • I found the answer to this and am posting it for anyone who encounters it. I needed to add a custom Origin Request Policy on my Cloudfront distribution. The policy is a copy of the AWS-provided CORS-S3Origin policy, with the additional query strings added to allow those query parameters to pass through Cloudfront to the S3 API.

    So, for example, under Cloudfront > Policies > Origin request > Create origin request policy, I created a policy with all of the headers from the CORS-S3Origin policy plus I added Query strings for list-type, max-keys, delimiter, prefix, and start-after. Basically, I added any parameters that I wanted to pass though Cloudfront to S3 as documented on AWS's S3 API Reference here

    enter image description here

    Then, I just needed to attach that policy to my Cloudfront distribution by going to Cloudfront > Distributions > My Distribution and Edit my distribution. Then, I select the "Behaviors" tab on my distribution, edit that behavior, and change the Origin request policy to my custom policy that I just created.

    enter image description here

    Once I made that change, I was able to filter the response as follows:

    import requests
    res = requests.get("https://example.cloudfront.net/?prefix=path/inside/my/bucket", cookies=cookies)
    
    from lxml import objectify
    root = objectify.fromstring(res.text.encode())
    for contents in root.Contents:
        print(contents.Key)