tl;dr; It's faster to list objects with prefix being the full key path, than to use HEAD to find out of a object is in an S3 bucket.
Background
I have a piece of code that opens up a user uploaded .zip
file and extracts its content. Then it uploads each file into an AWS S3 bucket if the file size is different or if the file didn't exist at all before.
It looks like this:
for filename, filesize, fileobj in extract(zip_file):
size = _size_in_s3(bucket, filename)
if size is None or size != filesize:
upload_to_s3(bucket, filename, fileobj)
print('Updated!' if size else 'New!')
else:
print('Ignored')
I'm using the boto3 S3 client so there are two ways to ask if the object exists and get its metadata.
Option 1: client.head_object
Option 2: client.list_objects_v2 with Prefix=${keyname}
.
But why the two different approaches?
The problem with client.head_object
is that it's odd in how it works. Sane but odd. If the object does not exist, boto3 raises a botocore.exceptions.ClientError
which contains a response
and in it you can look for exception.response['Error']['Code'] == '404'
.
What I noticed was that if you use a try:except ClientError:
approach to figure out if an object exists, you reset the client's connection pool in urllib3
. So after an exception has happened, any other operations on the client causes it to have to, internally, create a new HTTPS connection. That can cost time.
I wrote and filed this issue on github.com/boto/boto3.
So I wrote two different functions to return an object's size if it exists:
def _key_existing_size__head(client, bucket, key):
"""return the key's size if it exist, else None"""
try:
obj = client.head_object(Bucket=bucket, Key=key)
return obj['ContentLength']
except ClientError as exc:
if exc.response['Error']['Code'] != '404':
raise
And the contender...:
def _key_existing_size__list(client, bucket, key):
"""return the key's size if it exist, else None"""
response = client.list_objects_v2(
Bucket=bucket,
Prefix=key,
)
for obj in response.get('Contents', []):
if obj['Key'] == key:
return obj['Size']
They both work. That was easy to test. But which is fastest?
Before we begin, which do you think is fastest? The head_object
feels like it'll be able to send an operation to S3 internally to do a key lookup directly. But S3 isn't a normal database.
Here's the script partially cleaned up but should be easy to run.
The results
So I wrote a loop that ran 1,000 times and I made sure the bucket was empty so that 1,000 times the result of the iteration is that it sees that the file doesn't exist and it has to do a client.put_object
.
Here are the results:
FUNCTION: _key_existing_size__list Used 511 times SUM 148.2740752696991 MEAN 0.2901645308604679 MEDIAN 0.2569708824157715 STDEV 0.17742598775696436 FUNCTION: _key_existing_size__head Used 489 times SUM 249.79622673988342 MEAN 0.510830729529414 MEDIAN 0.4780092239379883 STDEV 0.14352671121877011
Because it's network bound, it's really important to avoid the 'MEAN' and instead look at the 'MEDIAN'. My home broadband can cause temporary spikes.
Clearly, using client.list_objects_v2
is faster. It's 90% faster than client.head_object
.
But note! this was 1,000 times of B) "does the file already exist?" and B) "No? Ok upload it". So the times there include all the client.put_object
calls.
So why did I measure both? I.e. _key_existing_size__list
+client.put_object
versus. _key_existing_size__head
+client.put_object
? The reason is that the approach of using try:except ClientError:
followed by a client.put_object
causes boto3
to create a new HTTPS connection in its pool. Again, see the issue which demonstrates this in different words.
What if the object always exists?
So, I simply run the benchmark again. The first time, it uploaded all 1,000 uniquely named objects. So running it a second time, every time the answer is that the object exists, and its size hasn't changed, so it never triggers the client.put_object
.
Here are the results this time:
FUNCTION: _key_existing_size__list Used 495 times SUM 54.60546112060547 MEAN 0.11031406286991004 MEDIAN 0.08583354949951172 STDEV 0.06339202669609442 FUNCTION: _key_existing_size__head Used 505 times SUM 44.59347581863403 MEAN 0.0883039125121466 MEDIAN 0.07310152053833008 STDEV 0.054452842190700346
In this case, using client.head_object
is faster. By 20% but the median time is 0.08 seconds! Even on a home broadband connection. In other words, I don't think that difference is significant.
One more time, excluding the client.put_object
The point of using client.list_objects_v2
instead of client.head_object
was to avoid breaking the connection pool in urllib3
that boto3
manages somehow. Having to create a new HTTPS connection (and adding it to the pool) costs time, but what if we disregard that and compare the two functions "purely" on how long they take when the file does NOT exist? Remember, the second measurement above was when every object exists.
So we know it took 0.09 seconds and 0.07 seconds respectively for the two functions to figure out that the object does exist. How long does it take to figure out that the object does not exist independent of any other op. I.e. just try each one without doing a client.put_object
afterwards. That means we avoid the bug so the comparison is fair.
The results:
FUNCTION: _key_existing_size__list Used 499 times SUM 123.57429671287537 MEAN 0.247643881188127 MEDIAN 0.2196049690246582 STDEV 0.18622877427652743 FUNCTION: _key_existing_size__head Used 501 times SUM 112.99495434761047 MEAN 0.22553883103315464 MEDIAN 0.2828958034515381 STDEV 0.15342842113446084
The client.list_objects_v2
beats client.head_object
by 30%. And it matters. Above I said that 20% difference didn't matter but now it does. That's because the time difference when it always finds the object was 0.013 seconds. When it comes to figuring out that the object did not exist the time difference is 0.063 seconds. That's still a pretty small number but, hey, you gotto draw the line somewhere.
In conclusion
Using client.list_objects_v2
is a better alternative to using client.head_object
.
If you think you'll often find that the object doesn't exist and needs a client.put_object
then using client.list_objects_v2
is 90% faster. If you think you'll rarely need client.put_object
(i.e. that most objects don't change) then client.list_objects_v2
is almost the same performance.
Comments
Post your own commentI just searching for the solution.
I think list object is not matching for buckets with large amount of files.
Really? What makes you think that? Are you saying it the result might different between HeadObject vs. ListObjectsV2 ...when the bucket is huuuge?
Was this ever investigated?
I believe you should use markers to iterate over list of files in the bucket by using Prefix.
I think I understand the comment but it's not entirely applicable. If you use a "vague" prefix like "myprefix/files/" that might yield so many results that, due to pagination I guess, you might miss the file you're looking for.
In my case I use the whole file name as the prefix.
Imagine you have thousands of other objects like 'keya', 'keyb', 'keyc' that are also returned when you list for prefix 'key'.... are you guaranteed that the object 'key' you are searching for will come on the first request, and that you don't need to paginate through?
for future users:
'key' is promised to appear first in this case because "List results are always returned in UTF-8 binary order." [https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysUsingAPIs.html]
so exact match will always be higher than a match as only a prefix for the search term
Is there a pricing difference between the 2 for large data sets?
No idea myself. But it's 1 request in both cases.