Filtered by Web development

Page 8

Reset

cache_memoize - a pretty decent cache decorator for Django

September 11, 2017
4 comments Python, Web development, Django

UPDATE - Oct 27, 2017 This snippet did now become its own PyPI package. See https://pypi.python.org/pypi/django-cache-memoize

This is something that's grown up organically when working on Mozilla Symbol Server. It has served me very well and perhaps it's worth extracting into its own lib.

Usage

Basically, you are probably used to this in Django:


from django.core.cache import cache

def compute_something(user, special=False):
    cache_key = 'meatycomputation:{}:special={}'.format(user.id, special)
    value = cache.get(cache_key)
    if value is None:
        value = _call_the_meat(user.id, special)  # some really slow function
        cache.set(cache_key, value, 60 * 5)
    return value

Here's instead how you can do exactly the same with cache_memoize:


from wherever.decorators import cache_memoize

@cache_memoize(60 * 5)
def compute_something(user, special=False):
    return _call_the_meat(user.id, special)  # some really slow function

Cache invalidation

If you ever need to do non-trivial caching you know it's important to be able to invalidate the cache. Usually, to be able to do that you need to involved in how the cache key was created.

Consider our two examples above, here's first the common thing to do:


def save_user(user):
    do_something_that_will_need_to_cache_invalidate(user)

    cache_key = 'meatycomputation:{}:special={}'.format(user.id, False)
    cache.delete(cache_key)
    # And when it was special=True
    cache_key = 'meatycomputation:{}:special={}'.format(user.id, True)
    cache.delete(cache_key)

This works but it involves repeating the code that generates the cache key. You could extract that into its own function of course.

Here's how you do it with the cache_memoize decorator:


def save_user(user):
    do_something_that_will_need_to_cache_invalidate(user)

    compute_something.invalidate(user, special=False)
    compute_something.invalidate(user, special=True)    

Other features

There are actually two ways to "invalidate" the cache. Calling the new myoriginalfunction.invalidate(...) function or passing a custom extra keyword argument called _refresh. For example: compute_something(user, _refresh=True).

You can pass callables that get called when the cache works in your favor or when it's a cache miss. For example:


def increment_hits(user, special=None):
    # use your imagination
    metrics.incr(user.email)

def cache_miss(user, special=None):
    print("cache miss on {}".format(user.email))

@cache_memoize(
    60 * 5,
    hit_callable=increment_hits,
    miss_callable=cache_miss,
)
def compute_something(user, special=False):
    return _call_the_meat(user.id, special)  # some really slow function

Sometimes you just want to use the memoizer to make sure something only gets called "once" (or once per time interval). In that case it might be smart to not flood your cache backend with the value of the function output if there is one. For example:


@cache_memoize(60 * 60, store_result=False)  # idempotent guard
def calculate_and_update(user):
    # do something expensive here that is best to only do once per hour

Internally cache_memoize will basically try to convert every argument and keyword argument to a string with, kinda, str(). That might not always be appropriate because you might know that you have two distinct objects whose __str__ will yield the same result. For that you can use the args_rewrite parameter. For example:


def simplify_special_objects(obj):
    # use your imagination
    return obj.hostname 

@cache_memoize(60 * 5, args_rewrite=simplify_special_objects)
def compute_something(special_obj):
    return _call_the_meat(special_obj.hostname)

In conclusion

I've uploaded the code as a gist.

It's quite possible that there's already a perfectly good lib that does exactly this. If so, thanks for letting me know. If not, perhaps I ought to wrap this up and publish it on PyPI. Again, that's for letting me know.

UPDATE

I found a bug in the original gist. Updated 2017-10-05.
The bug was that the calling of miss_callable and hit_callable was reversed.

Mozilla Symbol Server (aka. Tecken) load testing

September 6, 2017
0 comments Python, Web development, Django, Mozilla

(Thanks Miles Crabil not only for being an awesome Ops person but also for reviewing this blog post!)

My project over the summer, here at Mozilla, has been a project called Mozilla Symbol Server. It's a web service that uploads C++ symbol files, downloads C++ symbol files and symbolicates C++ crash stacktraces. It went into production last week which was fun but there's still lots of work to do on adding beyond-parity features and more optimizations.

What Is Mozilla Symbol Server?

The code name for this project is Tecken and it's written in Python (Django, Gunicorn) and uses PostgreSQL, Redis and Celery. The frontend is entirely static and developed (almost) as a separate project within. The frontend is written in React (using create-react-app and react-router). Everything is run as Docker containers. And if you ask me more details about how it's configured/deployed I'm afraid I have to defer to the awesome Mozilla CloudOps team.

One the challenges I faces developing Tecken is that symbol downloads need to be fast to handle high volumes of traffic. Today I did some load testing on our stage deployment and managed to start 14 concurrent clients that bombarded our staging server with realistic HTTPS GET queries based on log files. It's actually 7 + 1 + 4 + 2 concurrent clients. 7 of them from a m3.2xlarge EC2 node (8 vCPUs), 1 from a m3.large EC2 node (1 vCPU), 2 from two separate NYC based DigitalOcean personal servers and 2 clients here from my laptop on my home broadband. Basically, each loadtest script process got its own CPU.

Total req/s
It's hard to know how much more each client could push if it wasn't slowed down. Either way, the server managed to sustain about 330 requests per second. Our production baseline goal is to able to handle at least 40 requests per second.

After running for a while the caches started getting warm but about 1-5% of requests do have to make a boto3 roundtrip to an S3 bucket located on the other side of America in Oregon. There is also a ~5% penalty in that some requests trigger a write to a central Redis ElastiCache server. That's cheaper than the boto3 S3 call but still hefty latency costs to pay.

The ELB in our staging environment spreads the load between 2 c4.large (2 vCPUs, 3.75GB RAM) EC2 web heads. Each running with preloaded Gunicorn workers between Nginx and Django. Each web head has its own local memcached server to share memory between each worker but only local to the web head.

Is this a lot?

How long is a rope? Hard to tell. Tecken's performance is certainly more than enough and by the sheer fact that it was only just production deployed last week tells me we can probably find a lot of low-hanging fruit optimizations on the deployment side over time.

One way of answering that is to compare it with our lightest endpoint. One that involves absolutely no external resources. It's just pure Python in the form of ELB → Nginx → Gunicorn → Django. If I run hey from the same server I did the load testing I get a topline of 1,300 requests per second.

$ hey -n 10000 -c 10 https://symbols.stage.mozaws.net/__lbheartbeat__
Summary:
  Total:    7.6604 secs
  Slowest:  0.0610 secs
  Fastest:  0.0018 secs
  Average:  0.0075 secs
  Requests/sec: 1305.4199
...

That basically means that all the extra "stuff" (memcache key prep, memcache key queries and possible other high latency network requests) it needs to do in the Django view takes up roughly 3x the time it takes the absolute minimal Django request-response rendering.

Also, if I use the same technique to bombard a single URL, but one that actually involves most code steps but is definitely able to not require any slow ElastiCache writes or boto3 S3 reads you I get 800 requests per second:

$ hey -n 10000 -c 10 https://symbols.stage.mozaws.net/advapi32.pdb/5EFB9BF42CC64024AB64802E467394642/advapi32.sy
Summary:
  Total:    12.4160 secs
  Slowest:  0.0651 secs
  Fastest:  0.0024 secs
  Average:  0.0122 secs
  Requests/sec: 805.4150
  Total data:   300000 bytes
  Size/request: 30 bytes
...

Lesson learned

Max CPU Used
It's a recurring reminder that performance is almost all about latency. If not RAM or disk it's networking. See the graph of the "Max CPU Used" which basically shows that CPU of user, system and stolen ("CPU spent waiting for the hypervisor to service another virtual CPU") never sum totalling over 50%.

A neat trick to zip a git repo with a version number

September 1, 2017
4 comments Linux, Web development

I have this WebExtension addon. It's not very important. Just a web extension that does some hacks to GitHub pages when I open them in Firefox. The web extension is a folder with a manifest.json, icons/icon-48.png, tricks.js, README.md etc. To upload it to addons.mozilla.org I first have to turn the whole thing into a .zip file that I can upload.

So I discovered a neat way to make that zip file. It looks like this:

#!/bin/bash

DESTINATION=build-`cat manifest.json | jq -r .version`.zip
git archive --format=zip master > $DESTINATION

echo "Created..."
ls -lh $DESTINATION

You run it and it creates a build-1.0.zip file containing all the files that are checked into the git repo. So it discards my local "junk" such as backup files or other things that are mentioned in .gitignore (and .git/info/exclude).

I bet someone's going to laugh and say "Duhh! Of course!" but I didn't know you can do that easily. Hopefully posting this it'll help someone trying to do something similar.

Note; this depends on jq which is an amazing little program.

Ultrafast loading of CSS

September 1, 2017
3 comments Web development, JavaScript

tl;dr; The ideal web performance, with regards to CSS, is to inline the minimal CSS and lazy load the rest after load.

Two key things to understand/appreciate:

  1. The fastest performing web page is one that isn't blocked on rendering.

  2. You use some CSS framework kitchen sink because you're not a CSS guru.

How to deal with this?

Things like HTTP2 and CDNs and preload are nice because they make the network lookup for your main.88c468ef.css file as fast as possible. But what's even faster is to include the CSS with the HTML that the server responds in the first place. Why? Because when the browser downloads your HTML (e.g. GET /) as it parses the HTML document it sees that <link rel="stylesheet" href="/main.88c468ef.css"> there and decides to not render any DOM to screen until that CSS file has been downloaded and parsed. It does this because it doesn't want to have to paint the DOM (as it would look like without CSS) and then repaint the DOM again, this time with CSS rules.

Point number 2 basically boils down to the likely fact that your app depends on somecssframework.min.css like Bootstrap, Bulma or Foundation. They're large blobs of CSS for doing all sorts of types of HTML (e.g. cards, tables, navbar menus etc.). These CSS frameworks are super useful because they make your app look pretty. But they're usually big. Really big.

Popular CSS frameworks:

Framework Size Gzipped
bootstrap.min.css 122K 18K
foundation.min.css 115K 16K
semantic.min.css 553K 93K
bulma.min.css 141K 18K

Actually the size difference isn't hugely important. What's important is that it's yet another thing that needs to be downloaded before the page can start to render. If the URL is in the user's cache, great. Even better, if it's cached by a service worker. However if you care about loading performance (judging by the fact that you're still reading), you know that a large majority of your visitors only come to your site sometimes (according to Google Analytics, 92.7% of my visitors are "new visitors"). Perhaps from a Google search. Or perhaps they visit sometimes but rarely enough that by the time they return their browser cache will have "moved on" and reset (to save disk space) what was previously cached.

CSS is a render blocking resource

With and without render blocking CSS
See Ilya Grigorik's primer on Render Blocking CSS.

It's also easy to demonstrate. Check out this Webpagetest Visual Comparison that compares two pages that are both styled with bootstrap.min.css except one of them uses a piece of JavaScript at the bottom of the page that enables the stylesheet after the page has loaded.

So if it's blocking. What to do about it? Well, make it not blocking. But how?

Solution 1

The simplest solution is to simply move any <link rel="stylesheet" href="bootstrap.min.css"> out of the <head> and put them just before the </body> tag. Here's an example of that.

It's valid HTML5 and seems to work just fine in Safari iOS. The only problem is that pesky "Flash of Unstyled Content" (aka. "FOUT") effect where the user is presented with the page very briefly without any styling, then the whole page re-renders onces the stylesheets have loaded. Chrome and iOS actually block the rendering still. So it's not like JavaScript whereby putting it late in the DOM. In other words, not really a good solution at all.

You can see in this Webpagetest that the "Start render" happens after the .css files have been loaded and parsed.

Solution 2

With JavaScript you can put in code that's definitely going to be executed after the rendering starts and, also, after the first rendering is finished (i.e. "DOM Content Loaded").

This technique is best done with loadCSS which can be done really well if you tune it. In particular the rel="preload" feature is getting more and more established. It used to only work in Chrome and Opera but will soon work in Firefox and iOS Safari too. Note, loadCSS contains a polyfill solution to the rel="preload" thing.

The basics is that you load a piece of JavaScript late which, as soon as it can, puts the <link rel="stylesheet" href="bootstrap.min.css"> into the DOM. You still have the Flash of Unstyled Content effect to confront and that's annoying.

Here's an example implementation. It uses the scripts and techniques laid out by filamentgroup's loadCSS.

It works and the rel="preload" is a bonus for Chrome and Opera users because once the JavaScript "kicks in" the network loading is quite possible already done. As seen in this Webpagetest using Chrome the .css files start downloading before the lazyloadcss.js file has even started downloading.

It's not as hot in Firefox because all the .css files downloading is delayed until after the lazyloadcss.js has loaded and executed.

Solution 3

Just inline all the CSS. Instead of <link rel="stylesheet" href="bootstrap.min.css"> you just make it inline. Like:


<style type="text/css">
/*!
 * Bootstrap v4.0.0-beta (https://getbootstrap.com)
 * Copyright 2011-2017 The Bootstrap Authors
 * Copyright 2011-2017 Twitter, Inc.
 * Licensed under MIT (https://github.com/twbs/bootstrap/blob/master/LICENSE)
 */@media print{*,::after,::before{text-shadow:none!important;box-shadow:none!important}......
</style>

All 123KB of it. Why not?! It has to be downloaded sooner or later anyway, might as well nip it in the bud straight away. The Flash of Unstyled Content problem goes away. So does the problem of having to load JavaScript tricks to make the CSS loading non-blocking.

The obvious and immediate caveat is that now the whole HTML document is huge! In this example page the whole HTML document is 127KB (20KB gzipped) whence the regular one is 4.1KB (1.4KB gzipped). And if your visitors, if you're so lucky, click on any other internal link that's another 127KB that has to be downloaded again.

The biggest caveat is that downloading a large HTML document is bad because no other resources (images for example) can be downloaded in parallel whilst the browser is working on rendering the page with what it's downloaded so far. If you compare this Webpagetest with the regular traditional one, you can see that it takes almost 354ms to download the HTML with all CSS inlined compared to 262ms when the CSS was linked. That's roughly 100ms wasted where the browser could start download other resources, like images.

Solution 4

Solution 3 was kinda good because it avoided the Flash of Unstyled Content and it avoided all extra resource loading. However, we can do better.

Instead of inlining all CSS, how about we take out exactly only the CSS we need out of bootstrap.min.css and just inline that. Then, after the page has loaded, we can download the rest of bootstrap.min.css and that way it's ready with all the other selectors and stuff it needs as the page probably changes and morphs depending on interactive JavaScript which is stuff that can and will happen after the initial load.

But how do you know exactly which CSS you need for that initial load? Really, you don't. You have two options:

  1. Manually inspected what DOM elements you have in your initial HTML and start slowly plucking that out of the Bootstrap CSS file.

  2. Automate the inspection of what DOM elements you have in your initial HTML.

Before we dig deeper into the how to automate the inspection let's look at what it'd look like: This page and when Webpagetested. What's cool here is that the DOM is ready in 265ms (it was 262ms when there was no linked CSS).

Notice that there's no Flash of Unstyled Content. No external dependencies. It's basically an inline <style> block with exactly the selectors that are needed and nothing more. The HTML is larger, at 13KB (3.3KB gzipped), but remember it was 4.1KB when we started and the solution where we inlined everything was 127KB.

The immediate problem with this is that we're missing some nice CSS for things that haven't been needed yet. For example, there might be some JavaScript that changes the DOM based on something the user does with the page. For example, clicking on something that adds more elements to the DOM. Or, equally likely, after the the DOM has loaded, an XHR query is made to download some data and display it in a way that needs CSS selectors that weren't included in the minimal set.

By the way, this very blog post builds on this solution. If you're on your desktop browser you can view source and see that there's only inline style blocks.

Solution 5

This builds on Solution 4. The HTML contains the minimal CSS needed for that first render and as soon as possible we additionally download the whole bootstrap.min.css so that it's available if/when the DOM mutates and needs the full CSS not in the minimal CSS.

Basically, let's take Solution 2 (JavaScript lazy loads in the CSS) + Solution 4 (the minimal CSS inlined). Here is one such solution

And there we have it! The ideal solution. The only thing remaining is to verify that it actually makes a difference.

The Webpagetest Final Showdown

We have 5 solutions. Each one different from the next. Let's compare them against each other.

Here it is in its full glory

Visual comparison on WebPagetest.org
(image if you can't open the Webpagetest page right now)

What we notice:

  1. The regular do-nothing solution is 50% slower than the best solution. 3.2 seconds verus 2.2 sceonds.
  2. Putting the <link rel="stylesheet" ...> tags at the bottom of the document doesn't work in Chrome and doesn't do anything good.
  3. Lazy loading the CSS with JavaScript (with no initial CSS) displays content very early but the repaint means it takes unnecessarily longer to load the whole thing.
  4. The ideal solution (Solution 5) loads as fast, visually, as Solution 4 but has the advantage that all CSS is there, eventually.
  5. Inlining all CSS (Solution 3) is only 23% slower than the ideal solution (Solution 5). But, it's much easier to implement. Seriously consider this if your tooling is limited.

Conclusion

One humbling thing to notice is that the difference isn't actually that huge. In this particular example we managed to go from 3.2 seconds to 2.2 seconds (using a 3G connection). The example playground used in this experiment is very far from a real site. Most possibly, a real site is a lot more complex and full of lots more potential bottlenecks that slows things down. For example, instead of obsessing over the CSS payload, perhaps you can make a bigger impact by simply dropping some excessive JavaScript plugins that might not necessarily be needed. Or you can focus on your 2.5MB total of big images.

However, a key ingredient to web performance is to leverge the loading time the best possible way. If you get the CSS un-blocking rendering right, your users' browsers can spend more time, sooner, on other resources such as images and XHR.

UPDATE March 2018

A lot of this work of figuring out the minimal CSS from a DOM has now been put in a rapidly maturing and well tested Nodejs project called minimalcss.

React lifecycle hooks must-have

August 13, 2017
1 comment Web development, JavaScript, React

I don't know who made this flowchart originally, but whoever you are: Thank you!

At this point, in my React learning I think I've memorized much of this but it's taken me a lot of time and having to dig up the documentation again. (Also, not to mention the number of times I've typo'ed componentWillReciveProps and componentWillRecevieProps etc.)

Remember this; You don't need to know all of these by heart to be good at React. In fact, there's several of these that I almost never use.

React lifecycle hooks flowchart

UPDATE

The above link is dead. Use this blog post instead.

UPDATE April 2018

Here's an even better one from @dan_abramov:

React life-cycle hooks

Fastest *local* cache backend possible for Django

August 4, 2017
11 comments Python, Web development, Django

I did another couple of benchmarks of different cache backends in Django. This is an extension/update on Fastest cache backend possible for Django published a couple of months ago. This benchmarking isn't as elaborate as the last one. Fewer tests and fewer variables.

I have another app where I use a lot of caching. This web application will run its cache server on the same virtual machine. So no separation of cache server and web head(s). Just one Django server talking to localhost:11211 (memcached's default port) and localhost:6379 (Redis's default port).

Also in this benchmark, the keys were slightly smaller. To simulate my applications "realistic needs" I made the benchmark fall on roughly 80% cache hits and 20% cache misses. The cache keys were 1 to 3 characters long and the cache values lists of strings always 30 items long (e.g. len(['abc', 'def', 'cba', ... , 'cab']) == 30).

Also, in this benchmark I was too lazy to test all different parsers, serializers and compressors that django-redis supports. I only test python-memcached==1.58 versus django-redis==4.8.0 versus django-redis==4.8.0 && msgpack-python==0.4.8.

The results are quite "boring". There's basically not enough difference to matter.

Config Average Median Compared to fastest
memcache 4.51s 3.90s 100%
redis 5.41s 4.61s 84.7%
redis_msgpack 5.16s 4.40s 88.8%

UPDATE

As Hal pointed out in the comment, when you know the web server and the memcached server is on the same computer you should use UNIX sockets. They're "obviously" faster since the lack of HTTP overhead at the cost of it doesn't work over a network.

Because running memcached on a socket on OSX is a hassle I only have one benchmark. Note! This basically compares good old django.core.cache.backends.memcached.MemcachedCache with two different locations.

Config Average Median Compared to fastest
127.0.0.1:11211 3.33s 3.34s 81.3%
unix:/tmp/memcached.sock 2.66s 2.71s 100%

But there's more! Another option is to use pylibmc which is a Python client written in C. By the way, my Python I use for these microbenchmarks is Python 3.5.

Unfortunately I'm too lazy/too busy to do a matrix comparison of pylibmc on TCP versus UNIX socket. Here are the comparison results of using python-memcached versus pylibmc:

Client Average Median Compared to fastest
python-memcached 3.52s 3.52s 62.9%
pylibmc 2.31s 2.22s 100%

UPDATE 2

https://plot.ly/~jensens/36.embed

Seems my luck someone else has done the matrix comparison of python-memcached vs pylibmc on TCP vs UNIX socket:

https://plot.ly/~jensens/36.embed

Why I'm ditching AdBuff on Songsear.ch

July 20, 2017
0 comments Web development

I'm a performance nerd and if something isn't as fast as it can be it hurts my soul.
I have this side project called SongSear.ch. It's a lyrics search engine with over 2 million songs.

To try to make a buck to pay for the hosting cost, I put in ads. The only one I could find that does NOT use document.write is AdBuff. But their technical implementation, pardon my French, sucks! It's redirects upon redirects in an iframe over HTTP. Granted, it loads async but it's still dragging down performance for people on low-end devices and on mobile networks.

Today I decided to take the AdBuff ads off. Let's see if it made a difference performance-wise:

Before

Load waterfall WITH ads

After

Load waterfall WITHOUT ads

Basically, 2.7 seconds (on LTE) instead of 14.3 seconds. And 211Kb of data instead of 1Mb.

So how much money do I stand to lose for ditching these ads? Well, I've earned a grand total of $10.82 in total for 1,214,072 impressions. That's what I spend on hosting this project every 6 days. Clearly this isn't working out.

UPDATE

Adversal.com emailed me that they rolled out some new features including "Blazing fast ad code". Sure. Let's try it then.
Elements of it might be fast but nothing is fast when it needs some ~80 extra requests.

Adversal downloads a LOT of stuff

Fastest way to find out if a file exists in S3 (with boto3)

June 16, 2017
9 comments Python, Web development

tl;dr; It's faster to list objects with prefix being the full key path, than to use HEAD to find out of a object is in an S3 bucket.

Background

I have a piece of code that opens up a user uploaded .zip file and extracts its content. Then it uploads each file into an AWS S3 bucket if the file size is different or if the file didn't exist at all before.

It looks like this:


for filename, filesize, fileobj in extract(zip_file):
    size = _size_in_s3(bucket, filename)
    if size is None or size != filesize:
        upload_to_s3(bucket, filename, fileobj)
        print('Updated!' if size else 'New!')
    else:
        print('Ignored')

I'm using the boto3 S3 client so there are two ways to ask if the object exists and get its metadata.

Option 1: client.head_object

Option 2: client.list_objects_v2 with Prefix=${keyname}.

But why the two different approaches?

The problem with client.head_object is that it's odd in how it works. Sane but odd. If the object does not exist, boto3 raises a botocore.exceptions.ClientError which contains a response and in it you can look for exception.response['Error']['Code'] == '404'.

What I noticed was that if you use a try:except ClientError: approach to figure out if an object exists, you reset the client's connection pool in urllib3. So after an exception has happened, any other operations on the client causes it to have to, internally, create a new HTTPS connection. That can cost time.

I wrote and filed this issue on github.com/boto/boto3.

So I wrote two different functions to return an object's size if it exists:


def _key_existing_size__head(client, bucket, key):
    """return the key's size if it exist, else None"""
    try:
        obj = client.head_object(Bucket=bucket, Key=key)
        return obj['ContentLength']
    except ClientError as exc:
        if exc.response['Error']['Code'] != '404':
            raise

And the contender...:


def _key_existing_size__list(client, bucket, key):
    """return the key's size if it exist, else None"""
    response = client.list_objects_v2(
        Bucket=bucket,
        Prefix=key,
    )
    for obj in response.get('Contents', []):
        if obj['Key'] == key:
            return obj['Size']

They both work. That was easy to test. But which is fastest?

Before we begin, which do you think is fastest? The head_object feels like it'll be able to send an operation to S3 internally to do a key lookup directly. But S3 isn't a normal database.

Here's the script partially cleaned up but should be easy to run.

The results

So I wrote a loop that ran 1,000 times and I made sure the bucket was empty so that 1,000 times the result of the iteration is that it sees that the file doesn't exist and it has to do a client.put_object.

Here are the results:

FUNCTION: _key_existing_size__list Used 511 times
    SUM    148.2740752696991
    MEAN   0.2901645308604679
    MEDIAN 0.2569708824157715
    STDEV  0.17742598775696436

FUNCTION: _key_existing_size__head Used 489 times
    SUM    249.79622673988342
    MEAN   0.510830729529414
    MEDIAN 0.4780092239379883
    STDEV  0.14352671121877011

Because it's network bound, it's really important to avoid the 'MEAN' and instead look at the 'MEDIAN'. My home broadband can cause temporary spikes.

Clearly, using client.list_objects_v2 is faster. It's 90% faster than client.head_object.

But note! this was 1,000 times of B) "does the file already exist?" and B) "No? Ok upload it". So the times there include all the client.put_object calls.

So why did I measure both? I.e. _key_existing_size__list+client.put_object versus. _key_existing_size__head+client.put_object? The reason is that the approach of using try:except ClientError: followed by a client.put_object causes boto3 to create a new HTTPS connection in its pool. Again, see the issue which demonstrates this in different words.

What if the object always exists?

So, I simply run the benchmark again. The first time, it uploaded all 1,000 uniquely named objects. So running it a second time, every time the answer is that the object exists, and its size hasn't changed, so it never triggers the client.put_object.

Here are the results this time:

FUNCTION: _key_existing_size__list Used 495 times
    SUM    54.60546112060547
    MEAN   0.11031406286991004
    MEDIAN 0.08583354949951172
    STDEV  0.06339202669609442

FUNCTION: _key_existing_size__head Used 505 times
    SUM    44.59347581863403
    MEAN   0.0883039125121466
    MEDIAN 0.07310152053833008
    STDEV  0.054452842190700346

In this case, using client.head_object is faster. By 20% but the median time is 0.08 seconds! Even on a home broadband connection. In other words, I don't think that difference is significant.

One more time, excluding the client.put_object

The point of using client.list_objects_v2 instead of client.head_object was to avoid breaking the connection pool in urllib3 that boto3 manages somehow. Having to create a new HTTPS connection (and adding it to the pool) costs time, but what if we disregard that and compare the two functions "purely" on how long they take when the file does NOT exist? Remember, the second measurement above was when every object exists.

So we know it took 0.09 seconds and 0.07 seconds respectively for the two functions to figure out that the object does exist. How long does it take to figure out that the object does not exist independent of any other op. I.e. just try each one without doing a client.put_object afterwards. That means we avoid the bug so the comparison is fair.

The results:

FUNCTION: _key_existing_size__list Used 499 times
    SUM    123.57429671287537
    MEAN   0.247643881188127
    MEDIAN 0.2196049690246582
    STDEV  0.18622877427652743

FUNCTION: _key_existing_size__head Used 501 times
    SUM    112.99495434761047
    MEAN   0.22553883103315464
    MEDIAN 0.2828958034515381
    STDEV  0.15342842113446084

The client.list_objects_v2 beats client.head_object by 30%. And it matters. Above I said that 20% difference didn't matter but now it does. That's because the time difference when it always finds the object was 0.013 seconds. When it comes to figuring out that the object did not exist the time difference is 0.063 seconds. That's still a pretty small number but, hey, you gotto draw the line somewhere.

In conclusion

Using client.list_objects_v2 is a better alternative to using client.head_object.

If you think you'll often find that the object doesn't exist and needs a client.put_object then using client.list_objects_v2 is 90% faster. If you think you'll rarely need client.put_object (i.e. that most objects don't change) then client.list_objects_v2 is almost the same performance.

Experimenting with Guetzli

May 24, 2017
0 comments Linux, Web development, macOS

tl;dr; Guetzli, the new JPEG compression program from Google can save a bytes with little loss of quality.

Inspired by this blog post about Guetzli I thought I'd try it out with something that's relevant to my project, 300x300 JPGs that can be heavily compressed.

So I installed it (with Homebrew) on my MacBook Pro (late 2013) and picked 7 JPGs I had, and use in SongSearch. Which is interesting because these JPEGs have already been compressed once. They are taken from converting from much larger PNGs with PIL (Pillow) at quality rating 80%. In other words, this is Guetzli on top of PIL.

I ran one iteration for every image for the following qualities: 85%, 90%, 95%, 99%, 100%.

The results on the size are as follows:

Image Average Size (bytes) % Smaller
original 23497.0 0
85% 16025.4 32%
90% 18829.4 20%
95% 21338.1 9.2%
99% 22705.3 3.4%
100% 22919.7 2.5%

So, for example, if you choose the 90% quality you save, on average, 4,667B (4.6KB).

As you might already know, Guetzli is incredibly memory hungry and very very slow. On average each image compression took on average 4-6 seconds (higher quality, shorter times). Meaning, if you like Guetzli you probably need to build around it so that the compression happens in a build step or async somewhere and ideally you don't want to run too many compressions in parallel as it might cause CPU and memory overloading.

Now, how does it look?

Go to https://codepen.io/peterbe/pen/rmPMpm and stare at the screen to see if you can A) see which one is more compressed and B) if the one that is more compressed is too low quality.

What do you think?

Is it worth it?

Is the quality drop too much to save 10% on image sizes?

Please share your thoughts. Perhaps we can re-do this experiment with some slightly larger JPGs.

Fastest Redis configuration for Django

May 11, 2017
1 comment Python, Linux, Web development, Django

I have an app that does a lot of Redis queries. It all runs in AWS with ElastiCache Redis. Due to the nature of the app, it stores really large hash tables in Redis. The application then depends on querying Redis for these. The question is; What is the best configuration possible for the fastest service possible?

Note! Last month I wrote Fastest cache backend possible for Django which looked at comparing Redis against Memcache. Might be an interesting read too if you're not sold on Redis.

Options

All options are variations on the compressor, serializer and parser which are things you can override in django-redis. All have an effect on the performance. Even compression, for if the number of bytes between Redis and the application is smaller, then it should have better network throughput.

Without further ado, here are the variations:


CACHES = {
    "default": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": config('REDIS_LOCATION', 'redis://127.0.0.1:6379') + '/0',
        "OPTIONS": {
            "CLIENT_CLASS": "django_redis.client.DefaultClient",
        }
    },
    "json": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": config('REDIS_LOCATION', 'redis://127.0.0.1:6379') + '/1',
        "OPTIONS": {
            "CLIENT_CLASS": "django_redis.client.DefaultClient",
            "SERIALIZER": "django_redis.serializers.json.JSONSerializer",
        }
    },
    "ujson": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": config('REDIS_LOCATION', 'redis://127.0.0.1:6379') + '/2',
        "OPTIONS": {
            "CLIENT_CLASS": "django_redis.client.DefaultClient",
            "SERIALIZER": "fastestcache.ujson_serializer.UJSONSerializer",
        }
    },
    "msgpack": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": config('REDIS_LOCATION', 'redis://127.0.0.1:6379') + '/3',
        "OPTIONS": {
            "CLIENT_CLASS": "django_redis.client.DefaultClient",
            "SERIALIZER": "django_redis.serializers.msgpack.MSGPackSerializer",
        }
    },
    "hires": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": config('REDIS_LOCATION', 'redis://127.0.0.1:6379') + '/4',
        "OPTIONS": {
            "CLIENT_CLASS": "django_redis.client.DefaultClient",
            "PARSER_CLASS": "redis.connection.HiredisParser",
        }
    },
    "zlib": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": config('REDIS_LOCATION', 'redis://127.0.0.1:6379') + '/5',
        "OPTIONS": {
            "CLIENT_CLASS": "django_redis.client.DefaultClient",
            "COMPRESSOR": "django_redis.compressors.zlib.ZlibCompressor",
        }
    },
    "lzma": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": config('REDIS_LOCATION', 'redis://127.0.0.1:6379') + '/6',
        "OPTIONS": {
            "CLIENT_CLASS": "django_redis.client.DefaultClient",
            "COMPRESSOR": "django_redis.compressors.lzma.LzmaCompressor"
        }
    },
}

As you can see, they each have a variation on the OPTIONS.PARSER_CLASS, OPTIONS.SERIALIZER or OPTIONS.COMPRESSOR.

The default configuration is to use redis-py and to pickle the Python objects to a bytestring. Pickling in Python is pretty fast but it has the disadvantage that it's Python specific so you can't have a Ruby application reading the same Redis database.

The Experiment

Note how I have one LOCATION per configuration. That's crucial for the sake of testing. That way one database is all JSON and another is all gzip etc.

What the benchmark does is that it measures how long it takes to READ a specific key (called benchmarking). Then, once it's done that it appends that time to the previous value (or [] if it was the first time). And lastly it writes that list back into the database. That way, towards the end you have 1 key whose value looks something like this: [0.013103008270263672, 0.003879070281982422, 0.009411096572875977, 0.0009970664978027344, 0.0002830028533935547, ..... MANY MORE ....].

Towards the end, each of these lists are pretty big. About 500 to 1,000 depending on the benchmark run.

In the experiment I used wrk to basically bombard the Django server on the URL /random (which makes a measurement with a random configuration). On the EC2 experiment node, it finalizes around 1,300 requests per second which is a decent number for an application that does a fair amount of writes.

The way I run the Django server is with uwsgi like this:

uwsgi --http :8000 --wsgi-file fastestcache/wsgi.py --master --processes 4 --threads 2

And the wrk command like this:

wrk -d30s  "http://127.0.0.1:8000/random"

(that, by default, runs 2 threads on 10 connections)

At the end of starting the benchmarking, I open http://localhost:8000/summary which spits out a table and some simple charts.

An Important Quirk

Time measurements over time
One thing I noticed when I started was that the final numbers' average was very different from the medians. That would indicate that there are spikes. The graph on the right shows the times put into that huge Python list for the default configuration for the first 200 measurements. Note that there are little spikes but generally quite flat over time once it gets past the beginning.

Sure enough, it turns out that in almost all configurations, the time it takes to make the query in the beginning is almost order of magnitude slower than the times once the benchmark has started running for a while.

So in the test code you'll see that it chops off the first 10 times. Perhaps it should be more than 10. After all, if you don't like the spikes you can simply look at the median as the best source of conclusive truth.

The Code

The benchmarking code is here. Please be aware that this is quite rough. I'm sure there are many things that can be improved, but I'm not sure I'm going to keep this around.

The Equipment

The ElastiCache Redis I used was a cache.m3.xlarge (13 GiB, High network performance) with 0 shards and 1 node and no multi-zone enabled.

The EC2 node was a m4.xlarge Ubuntu 16.04 64-bit (4 vCPUs and 16 GiB RAM with High network performance).

Both the Redis and the EC2 were run in us-west-1c (North Virginia).

The Results

Here are the results! Sorry if it looks terrible on mobile devices.

root@ip-172-31-2-61:~# wrk -d30s  "http://127.0.0.1:8000/random" && curl "http://127.0.0.1:8000/summary"
Running 30s test @ http://127.0.0.1:8000/random
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     9.19ms    6.32ms  60.14ms   80.12%
    Req/Sec   583.94    205.60     1.34k    76.50%
  34902 requests in 30.03s, 2.59MB read
Requests/sec:   1162.12
Transfer/sec:     88.23KB
                         TIMES        AVERAGE         MEDIAN         STDDEV
json                      2629        2.596ms        2.159ms        1.969ms
msgpack                   3889        1.531ms        0.830ms        1.855ms
lzma                      1799        2.001ms        1.261ms        2.067ms
default                   3849        1.529ms        0.894ms        1.716ms
zlib                      3211        1.622ms        0.898ms        1.881ms
ujson                     3715        1.668ms        0.979ms        1.894ms
hires                     3791        1.531ms        0.879ms        1.800ms

Best Averages (shorter better)
###############################################################################
██████████████████████████████████████████████████████████████   2.596  json
█████████████████████████████████████                            1.531  msgpack
████████████████████████████████████████████████                 2.001  lzma
█████████████████████████████████████                            1.529  default
███████████████████████████████████████                          1.622  zlib
████████████████████████████████████████                         1.668  ujson
█████████████████████████████████████                            1.531  hires
Best Medians (shorter better)
###############################################################################
███████████████████████████████████████████████████████████████  2.159  json
████████████████████████                                         0.830  msgpack
████████████████████████████████████                             1.261  lzma
██████████████████████████                                       0.894  default
██████████████████████████                                       0.898  zlib
████████████████████████████                                     0.979  ujson
█████████████████████████                                        0.879  hires


Size of Data Saved (shorter better)
###############################################################################
█████████████████████████████████████████████████████████████████  60K  json
██████████████████████████████████████                             35K  msgpack
████                                                                4K  lzma
█████████████████████████████████████                              35K  default
█████████                                                           9K  zlib
████████████████████████████████████████████████████               48K  ujson
█████████████████████████████████████                              34K  hires

Discussion Points

  • There is very little difference once you avoid the json serialized one.
  • msgpack is the fastest by a tiny margin. I prefer median over average because it's more important how it over a long period of time.
  • The default (which is pickle) is fast too.
  • lzma and zlib compress the strings very well. Worth thinking about the fact that zlib is a very universal tool and makes the app "Python agnostic".
  • You probably don't want to use the json serializer. It's fat and slow.
  • Using hires makes very little difference. That's a bummer.
  • Considering how useful zlib is (since you can fit so much much more data in your Redis) it's impressive that it's so fast too!
  • I quite like zlib. If you use that on the pickle serializer you're able to save ~3.5 times as much data.
  • Laugh all you want but until today I had never heard of lzma. So based on that odd personal fact, I'm pessmistic towards that as a compression choice.

Conclusion

This experiment has lead me to the conclusion that the best serializer is msgpack and the best compression is zlib. That is the best configuration for django-redis.

msgpack has implementation libraries for many other programming languages. Right now that doesn't matter for my application but if msgpack is both faster and more versatile (because it supports multiple languages) I conclude that to be the best serializer instead.