Fastest way to find out if a file exists in S3 (with boto3)

Friday, Jun 16, 2017
9 comments Python, Web development

URL: https://gist.github.com/peterbe/25b9b7a7a9c859e904c305ddcf125f90

tl;dr; It's faster to list objects with prefix being the full key path, than to use HEAD to find out of a object is in an S3 bucket.

Background

I have a piece of code that opens up a user uploaded .zip file and extracts its content. Then it uploads each file into an AWS S3 bucket if the file size is different or if the file didn't exist at all before.

It looks like this:


for filename, filesize, fileobj in extract(zip_file):
    size = _size_in_s3(bucket, filename)
    if size is None or size != filesize:
        upload_to_s3(bucket, filename, fileobj)
        print('Updated!' if size else 'New!')
    else:
        print('Ignored')

I'm using the boto3 S3 client so there are two ways to ask if the object exists and get its metadata.

Option 1: client.head_object

Option 2: client.list_objects_v2 with Prefix=${keyname}.

But why the two different approaches?

The problem with client.head_object is that it's odd in how it works. Sane but odd. If the object does not exist, boto3 raises a botocore.exceptions.ClientError which contains a response and in it you can look for exception.response['Error']['Code'] == '404'.

What I noticed was that if you use a try:except ClientError: approach to figure out if an object exists, you reset the client's connection pool in urllib3. So after an exception has happened, any other operations on the client causes it to have to, internally, create a new HTTPS connection. That can cost time.

I wrote and filed this issue on github.com/boto/boto3.

So I wrote two different functions to return an object's size if it exists:


def _key_existing_size__head(client, bucket, key):
    """return the key's size if it exist, else None"""
    try:
        obj = client.head_object(Bucket=bucket, Key=key)
        return obj['ContentLength']
    except ClientError as exc:
        if exc.response['Error']['Code'] != '404':
            raise

And the contender...:


def _key_existing_size__list(client, bucket, key):
    """return the key's size if it exist, else None"""
    response = client.list_objects_v2(
        Bucket=bucket,
        Prefix=key,
    )
    for obj in response.get('Contents', []):
        if obj['Key'] == key:
            return obj['Size']

They both work. That was easy to test. But which is fastest?

Before we begin, which do you think is fastest? The head_object feels like it'll be able to send an operation to S3 internally to do a key lookup directly. But S3 isn't a normal database.

Here's the script partially cleaned up but should be easy to run.

The results

So I wrote a loop that ran 1,000 times and I made sure the bucket was empty so that 1,000 times the result of the iteration is that it sees that the file doesn't exist and it has to do a client.put_object.

Here are the results:

FUNCTION: _key_existing_size__list Used 511 times
    SUM    148.2740752696991
    MEAN   0.2901645308604679
    MEDIAN 0.2569708824157715
    STDEV  0.17742598775696436

FUNCTION: _key_existing_size__head Used 489 times
    SUM    249.79622673988342
    MEAN   0.510830729529414
    MEDIAN 0.4780092239379883
    STDEV  0.14352671121877011

Because it's network bound, it's really important to avoid the 'MEAN' and instead look at the 'MEDIAN'. My home broadband can cause temporary spikes.

Clearly, using client.list_objects_v2 is faster. It's 90% faster than client.head_object.

But note! this was 1,000 times of B) "does the file already exist?" and B) "No? Ok upload it". So the times there include all the client.put_object calls.

So why did I measure both? I.e. _key_existing_size__list+client.put_object versus. _key_existing_size__head+client.put_object? The reason is that the approach of using try:except ClientError: followed by a client.put_object causes boto3 to create a new HTTPS connection in its pool. Again, see the issue which demonstrates this in different words.

What if the object always exists?

So, I simply run the benchmark again. The first time, it uploaded all 1,000 uniquely named objects. So running it a second time, every time the answer is that the object exists, and its size hasn't changed, so it never triggers the client.put_object.

Here are the results this time:

FUNCTION: _key_existing_size__list Used 495 times
    SUM    54.60546112060547
    MEAN   0.11031406286991004
    MEDIAN 0.08583354949951172
    STDEV  0.06339202669609442

FUNCTION: _key_existing_size__head Used 505 times
    SUM    44.59347581863403
    MEAN   0.0883039125121466
    MEDIAN 0.07310152053833008
    STDEV  0.054452842190700346

In this case, using client.head_object is faster. By 20% but the median time is 0.08 seconds! Even on a home broadband connection. In other words, I don't think that difference is significant.

One more time, excluding the `client.put_object`

The point of using client.list_objects_v2 instead of client.head_object was to avoid breaking the connection pool in urllib3 that boto3 manages somehow. Having to create a new HTTPS connection (and adding it to the pool) costs time, but what if we disregard that and compare the two functions "purely" on how long they take when the file does NOT exist? Remember, the second measurement above was when every object exists.

So we know it took 0.09 seconds and 0.07 seconds respectively for the two functions to figure out that the object does exist. How long does it take to figure out that the object does not exist independent of any other op. I.e. just try each one without doing a client.put_object afterwards. That means we avoid the bug so the comparison is fair.

The results:

FUNCTION: _key_existing_size__list Used 499 times
    SUM    123.57429671287537
    MEAN   0.247643881188127
    MEDIAN 0.2196049690246582
    STDEV  0.18622877427652743

FUNCTION: _key_existing_size__head Used 501 times
    SUM    112.99495434761047
    MEAN   0.22553883103315464
    MEDIAN 0.2828958034515381
    STDEV  0.15342842113446084

The client.list_objects_v2 beats client.head_object by 30%. And it matters. Above I said that 20% difference didn't matter but now it does. That's because the time difference when it always finds the object was 0.013 seconds. When it comes to figuring out that the object did not exist the time difference is 0.063 seconds. That's still a pretty small number but, hey, you gotto draw the line somewhere.

In conclusion

Using client.list_objects_v2 is a better alternative to using client.head_object.

If you think you'll often find that the object doesn't exist and needs a client.put_object then using client.list_objects_v2 is 90% faster. If you think you'll rarely need client.put_object (i.e. that most objects don't change) then client.list_objects_v2 is almost the same performance.

Comments

Post your own comment

Sergey Titievsky August 20, 2017

I just searching for the solution.
I think list object is not matching for buckets with large amount of files.

Peter Bengtsson August 25, 2017

Really? What makes you think that? Are you saying it the result might different between HeadObject vs. ListObjectsV2 ...when the bucket is huuuge?

Cameron Spiller August 30, 2017

Was this ever investigated?

Alexandr February 1, 2018

I believe you should use markers to iterate over list of files in the bucket by using Prefix.

Peter Bengtsson February 1, 2018

I think I understand the comment but it's not entirely applicable. If you use a "vague" prefix like "myprefix/files/" that might yield so many results that, due to pagination I guess, you might miss the file you're looking for.

In my case I use the whole file name as the prefix.

Anonymous September 29, 2017

Imagine you have thousands of other objects like 'keya', 'keyb', 'keyc' that are also returned when you list for prefix 'key'.... are you guaranteed that the object 'key' you are searching for will come on the first request, and that you don't need to paginate through?

Oz Yadgar May 28, 2020

for future users:
'key' is promised to appear first in this case because "List results are always returned in UTF-8 binary order." [https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysUsingAPIs.html]
so exact match will always be higher than a match as only a prefix for the search term

Anonymous May 4, 2022

Is there a pricing difference between the 2 for large data sets?

Peter Bengtsson May 4, 2022

No idea myself. But it's 1 request in both cases.