Search code examples
amazon-s3botoriakriak-cs

Riak-CS Update ACL sometimes not working using Boto


I have a virtualised cluster of 5 Riak-CS nodes. Stanchion is installed on the first node. These nodes are behind an Nginx reverse proxy.

When I upload a JPG file, using my Python Script, which is using the boto library, it works fine:

cf=OrdinaryCallingFormat()
conn=S3Connection(aws_access_key_id=apikey,aws_secret_access_key=secretkey,is_secure=False,host=s3Host,port=s3Port,calling_format=cf)
b = conn.get_bucket(bucketName)
k = b.new_key(fileName)
k.set_contents_from_filename(fileName, policy='public-read')

However, if I do it this way, it sometimes will not set the ACL to be public but other times will (Note: I'm uploading the file first, then setting the ACL):

cf=OrdinaryCallingFormat()
conn=S3Connection(aws_access_key_id=apikey,aws_secret_access_key=secretkey,is_secure=False,host=s3Host,port=s3Port,calling_format=cf)
b = conn.get_bucket(bucketName)
k = b.new_key(fileName)
k.set_contents_from_filename(fileName)
k.set_acl('public-read')

I have checked the logs files on Nginx and have seen that in the first case we have the following:

"HEAD /test/ HTTP/1.1" 200 0 "-" "Boto/2.29.1 Python/2.7.3 Windows/7"
"PUT /test/1.jpg HTTP/1.1" 200 25 "-" "Boto/2.29.1 Python/2.7.3 Windows/7"

and in the second case, we we get:

"HEAD /test/ HTTP/1.1" 200 0 "-" "Boto/2.29.1 Python/2.7.3 Windows/7"
"PUT /test/1.jpg HTTP/1.1" 200 25 "-" "Boto/2.29.1 Python/2.7.3 Windows/7"
"PUT /test/1.jpg?acl HTTP/1.1" 200 0 "-" "Boto/2.29.1 Python/2.7.3 Windows/7"

Both of which are to be expected.

I'm using "s3cmd info s3://test/1.jpg" to find out what the ACL is on the file. It seems that depending on which Riak-CS server the PUT acl is sent to, sometimes the file is changed to public, other times it isn't. I've checked the network traffic coming out of my machine which is running the script and the command to PUT the new ACL is exactly the same each time regardless of success of failure. The message through NGINX also is exactly the same each time and even when it doesn't update the ACL to public it still returns 200.

I've monitored the Riak-CS log files on each of the nodes during the upload and it seems that it only happens in two of the 5 different upload scenarios. Here are the details:

The file is PUT on node 4 and the ACL is PUT on node 3. The query if the file's ACL (S3Cmd Info) is done against node 1 and the result is Success, the ACL has public access set. Here are some more cases ->

Obj PUT Node: 4  ACL PUT Node: 3  Read Node: 1 = Success
Obj PUT Node: 3  ACL PUT Node: 2  Read Node: 5 = Success
Obj PUT Node: 2  ACL PUT Node: 1  Read Node: 4 = Fail
Obj PUT Node: 1  ACL PUT Node: 5  Read Node: 3 = Success
Obj PUT Node: 5  ACL PUT Node: 4  Read Node: 2 = Fail
Obj PUT Node: 4  ACL PUT Node: 3  Read Node: 1 = Success
Obj PUT Node: 3  ACL PUT Node: 2  Read Node: 5 = Success
Obj PUT Node: 2  ACL PUT Node: 1  Read Node: 4 = Fail
Obj PUT Node: 1  ACL PUT Node: 5  Read Node: 3 = Success
Obj PUT Node: 5  ACL PUT Node: 4  Read Node: 2 = Fail

As you can see, some of the times the ACL "Sticks", other times it doesn't. I've checked the configuration of all the nodes, especially 1 & 4 and cannot see any issues.

Does anyone know why sometimes this isn't working or have any ideas how I might continue to investigate what is going on here?


Solution

  • This is caused by the bug of Riak CS [1] and not-synchronized clock between servers. For the detailed bug description, please see [1].

    Current workaround is to synchronize server clocks. It will be low possibility if you can synchronize them in the order of 100 milli second, I guess (apparently, it depends on the interval between PUT Object and PUT Acl at client and also on the network latency between client and riak cs). If it does not work, please add some wait after PUT Object in client code :P

    Thank you very much for the detailed Success/Failure pattern analysis, Mark. It led to quick bug identification :)

    [1] https://github.com/basho/riak_cs/issues/879