To enable Tracking Protection for performance

November 1, 2017
0 comments Web development, Mozilla

In Firefox it's really easy to switch Tracking Protection on and off. I usually have it off in my main browser because, as a web developer, it often gives me an insight into what others would see and that's often helpful.

Tracking Protection as a means for performance boost as been mentioned many times before to not just avoid leaving digital footprints but also a way to browse faster.

An Example of Performance

I just wanted to show such an example. In both examples I load a blog post on seriouseats.com which has real content and lots of images (11 non-ad images, totalling ~600KB).

First, without Tracking Protection

Without Tracking Protection

Next, with Tracking Protection

With Tracking Protection

This is the Network waterfall view of all the requests needed to make up the page. The numbers at the very bottom are the interesting ones.

Without Tracking Protection

  • 555 requests
  • 8.61MB transferred
  • Total loading time 59 seconds

With Tracking Protection

  • 48 requests
  • 1.87MB transferred
  • Total loading time 3.3 seconds

It should be pointed out that the first version, without Tracking Protection, actually is document complete at about 20 seconds. That means that page loading icon in the browser toolbar stops spinning. That's because internally it starts triggering more downloads (for ads) by JavaScript that executes when the page load event is executed. So you can already start reading the main content and all content related images are ready at this point but it's still ~30 seconds of excessive browser and network activity just to download the ads.

Enable Tracking Protection

Enabling Tracking Protection in Firefox is really easy. It's not an addon or anything else that needs to be installed. Just click Preferences, then click the Privacy & Security tab, scroll down a little and look for Tracking Protection. There choose the Always option. That's it.

Discussion

Tracking Protection is big and involved topic. It digs into the realm of privacy and the right to your digital footprint. It also digs into the realm of making it harder (or easier, depending on how you phrase it) for content creators to generate revenue to be able to keep create content.

The takeaway is that it can mean many things but for people who just want to browse the web much much faster it can be just about performance.

django-cache-memoize

October 27, 2017
3 comments Python, Django

Released a new package today: django-cache-memoize

It's actually quite simple; a Python memoize function that uses Django's cache plus the added trick that you can invalidate the cache my doing the same function call with the same parameters if you just add .invalidate to your function.

The history of it is from my recent Mozilla work on Symbols.

I originally copy and pasted the snippet out that in a blog post and today I extracted it out into its own project with tests, docs, CI and a setup.py.

I'm still amazed how long it takes to make a package with all the "fluff" around it. A lot of the bits in here (like setup.py and pytest.ini etc) are copied from other nicely maintained Python packages. For example, I straight up copied the tox.ini from Jannis Leidel's python-dockerflow. The ratio of actual code writing (including tests!) is far overpowered by the package sit-ups. But I "complain with a pinch of salt" because a lot of time spent was writing documentation and that's equally as important as the code probably.

Concurrent Gzip in Python

October 13, 2017
11 comments Python, Linux, Docker

Suppose you have a bunch of files you need to Gzip in Python; what's the optimal way to do that? In serial, to avoid saturating the GIL? In multiprocessing, to spread the load across CPU cores? Or with threads?

I needed to know this for symbols.mozilla.org since it does a lot of Gzip'ing. In symbols.mozilla.org clients upload a zip file full of files. A lot of them are plain text and when uploaded to S3 it's best to store them gzipped. Basically it does this:


def upload_sym_file(s3_client, payload, bucket_name, key_name):
    file_buffer = BytesIO()
    with gzip.GzipFile(fileobj=file_buffer, mode='w') as f:
        f.write(payload)
    file_buffer.seek(0, os.SEEK_END)
    size = file_buffer.tell()
    file_buffer.seek(0)
    s3_client.put_object(
        Bucket=bucket_name,
        Key=key_name,
        Body=file_buffer
    )
    print(f"Uploaded {size}")

Another important thing to consider before jumping into the benchmark is to appreciate the context of this application; the bundles of files I need to gzip are often many but smallish. The average file size of the files that need to be gzip'ed is ~300KB. And each bundle is between 5 to 25 files.

The Benchmark

For the sake of the benchmark, here, all it does it figure out the size of each gzipped buffer and reports that as a list.

f1 - Basic serial


def f1(payloads):
    sizes = []
    for payload in payloads:
        sizes.append(_get_size(payload))
    return sizes

f2 - Using multiprocessing.Pool


def f2(payloads):  # multiprocessing
    sizes = []
    with multiprocessing.Pool() as p:
        sizes = p.map(_get_size, payloads)
    return sizes

f3 - Using concurrent.futures.ThreadPoolExecutor


def f3(payloads):  # concurrent.futures.ThreadPoolExecutor
    sizes = []
    futures = []
    with concurrent.futures.ThreadPoolExecutor() as executor:
        for payload in payloads:
            futures.append(
                executor.submit(
                    _get_size,
                    payload
                )
            )
        for future in concurrent.futures.as_completed(futures):
            sizes.append(future.result())
    return sizes

f4 - Using concurrent.futures.ProcessPoolExecutor


def f4(payloads):  # concurrent.futures.ProcessPoolExecutor
    sizes = []
    futures = []
    with concurrent.futures.ProcessPoolExecutor() as executor:
        for payload in payloads:
            futures.append(
                executor.submit(
                    _get_size,
                    payload
                )
            )
        for future in concurrent.futures.as_completed(futures):
            sizes.append(future.result())
    return sizes

Note that when using asynchronous methods like this, the order of items returned is not the same as they're submitted. An easy remedy if you need the results back in order is to not use a list but to use a dictionary. Then you can track each key (or index if you like) to a value.

The Results

I ran this on three different .zip files of different sizes. To get some sanity in the benchmark I made it print out how many bytes it has to process and how many bytes the gzip will manage to do.

# files 66
Total bytes to gzip 140.69MB
Total bytes gzipped 14.96MB
Total bytes shaved off by gzip 125.73MB

# files 103
Total bytes to gzip 331.57MB
Total bytes gzipped 66.90MB
Total bytes shaved off by gzip 264.67MB

# files 26
Total bytes to gzip 86.91MB
Total bytes gzipped 8.28MB
Total bytes shaved off by gzip 78.63MB

Sorry for being eastetically handicapped when it comes to using Google Docs but here goes...


This demonstrates the median times it takes each function to complete, each of the three different files.

In all three files I tested, clearly doing it serially (f1) is the worst. Supposedly since my laptop has more than one CPU core and the others are not being used. Another pertinent thing to notice is that when the work is really big, (the middle 4 bars) the difference isn't as big doing things serially compared to concurrently.

That second zip file contained a single file that was 80MB. The largest in the other two files were 18MB and 22MB.


This is the mean across all medians grouped by function and each compared to the slowest.

I call this the "bestest graph". It's a combination across all different sizes and basically concludes which one is the best, which clearly is function f3 (the one using concurrent.futures.ThreadPoolExecutor).

CPU Usage

This is probably the best way to explain how the CPU is used; I ran each function repeatedly, then opened gtop and took a screenshot of the list of processes sorted by CPU percentage.

f1 - Serially

f1
No distractions but it takes 100% of one CPU to work.

f2 - multiprocessing.Pool

f2
My laptop has 8 CPU cores, but I don't know why I see 9 Python processes here.
I don't know why each CPU isn't 100% but I guess there's some administrative overhead to start processes by Python.

f3 - concurrent.futures.ThreadPoolExecutor

f3
One process, with roughly 5 x 8 = 40 threads GIL swapping back and forth but all in all it manages to keep itself very busy since threads are lightweight to share data to.

f4 - concurrent.futures.ProcessPoolExecutor

f4
This is actually kinda like multiprocessing.Pool but with a different (arguably easier) API.

Conclusion

By a small margin concurrent.futures.ThreadPoolExecutor won. That's despite not being able to use all CPU cores. This, pseudo scientifically, proves that the overhead of starting the threads is (remember average number of files in each .zip is ~65) more worth it than being able to use all CPUs.

Discussion

There's an interesting twist to this! At least for my use case...

In the application I'm working on, there's actually a lot more that needs to be done other than just gzip'ping some blobs of files. For each file I need to a HEAD query to AWS S3 and an PUT query to AWS S3 too. So what I actually need to do is create an instance of client = botocore.client.S3 that I use to call client.list_objects_v2 and client.put_object.

When you create an instance of botocore.client.S3, automatically botocore will instanciate itself with credentials from os.environ['AWS_ACCESS_KEY_ID'] etc. (or read from some /.aws file). Once created, if you ask it to do many different network operations, internally it relies on urllib3.poolmanager.PoolManager which is a list of 10 HTTP connections that get reused.

So when you run the serial version you can re-use the client instance for every file you process but you can only use one HTTP connection in the pool. With the concurrent.futures.ThreadPoolExecutor it can not only re-use the same instance of botocore.client.S3 it can cycle through all the HTTP connections in the pool.

The process based alternatives like multiprocessing.Pool and concurrent.futures.ProcessPoolExecutor can not re-use the botocore.client.S3 instance since it's not pickle'able. And it has to create a new HTTP connection for every single file.

So, the conclusion of the above rambling is that concurrent.futures.ThreadPoolExecutor is really awesome! Not only did it perform excellently in the Gzip benchmark, it has the added bonus that it can share instance objects and HTTP connections.

Simple or fancy UPSERT in PostgreSQL with Django

October 11, 2017
7 comments Python, Web development, Django, PostgreSQL

As of PostgreSQL 9.5 we have UPSERT support. Technically, it's ON CONFLICT, but it's basically a way to execute an UPDATE statement in case the INSERT triggers a conflict on some column value. By the way, here's a great blog post that demonstrates how to use ON CONFLICT.

In this Django app I have a model that has a field called hash which has a unique=True index on it. What I want to do is either insert a row, or if the hash is already in there, it should increment the count and the modified_at timestamp instead.

The Code(s)

Here's the basic version in "pure Django ORM":


if MissingSymbol.objects.filter(hash=hash_).exists():
    MissingSymbol.objects.filter(hash=hash_).update(
        count=F('count') + 1,
        modified_at=timezone.now()
    )
else:
    MissingSymbol.objects.create(
        hash=hash_,
        symbol=symbol,
        debugid=debugid,
        filename=filename,
        code_file=code_file or None,
        code_id=code_id or None,
    )

Here's that same code rewritten in "pure SQL":


from django.db import connection


with connection.cursor() as cursor:
    cursor.execute("""
        INSERT INTO download_missingsymbol (
            hash, symbol, debugid, filename, code_file, code_id,
            count, created_at, modified_at
        ) VALUES (
            %s, %s, %s, %s, %s, %s,
            1, CLOCK_TIMESTAMP(), CLOCK_TIMESTAMP()
          )
        ON CONFLICT (hash)
        DO UPDATE SET
            count = download_missingsymbol.count + 1,
            modified_at = CLOCK_TIMESTAMP()
        WHERE download_missingsymbol.hash = %s
        """, [
            hash_, symbol, debugid, filename,
            code_file or None, code_id or None,
            hash_
        ]
    )

Both work.

Note the use of CLOCK_TIMESTAMP() instead of NOW(). Since Django wraps all writes in transactions if you use NOW() it will be evaluated to the same value for the whole transaction, thus making unit testing really hard.

But which is fastest?

The Results

First of all, this hard to test locally because my Postgres is running locally in Docker so the network latency in talking to a network Postgres means that the latency is less and having to do two different executions would cost more if the network latency is more.

I ran a simple benchmark where it randomly picked one of the two code blocks (above) depending on a 50% chance.
The results are:

METHOD     MEAN       MEDIAN
SQL        6.99ms     6.61ms
ORM        10.28ms    9.86ms

So doing it with a block of raw SQL instead is 1.5 times faster. But this difference would surely grow when the network latency is higher.

Discussion

There's an alternative and that's to use django-postgres-extra but I'm personally hesitant. The above little raw SQL hack is the only thing I need and adding more libraries makes far-future maintenance harder.

Beyond the time optimization of being able to send only 1 SQL instruction to PostgreSQL, the biggest benefit is avoiding concurrency race conditions. From the documentation:

"ON CONFLICT DO UPDATE guarantees an atomic INSERT or UPDATE outcome; provided there is no independent error, one of those two outcomes is guaranteed, even under high concurrency. This is also known as UPSERT — "UPDATE or INSERT"."

I'm going to keep this little hack. It's not beautiful but it works and saves time and gives me more comfort around race conditions.

"No space left on device" on OSX Docker

October 3, 2017
13 comments Web development, macOS, Docker

UPDATE 2020

As Greg Brown pointed out, the new way is:

docker container prune
docker image prune

Original blog post...


 

If you run out of disk space in your Docker containers on OSX, this is probably the best thing to run:

docker rm $(docker ps -q -f 'status=exited')
docker rmi $(docker images -q -f "dangling=true")

The Problem

This isn't the first time it's happened so I'm blogging about it to not forget. My postgres image in my docker-compose.yml didn't start and since it's linked its problem is "hidden". Running it in the foreground instead you can see what the problem is:

▶ docker-compose run db
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/postgresql/data ... ok
initdb: could not create directory "/var/lib/postgresql/data/pg_xlog": No space left on device
initdb: removing contents of data directory "/var/lib/postgresql/data"

Docker on OSX

I admit that I have so much to learn about Docker and the learning is slow. Docker is amazing but I think I'm slow to learn because I'm just not that interested as long as it works and I can work on my apps.

It seems to me that there's a cap of all storage of all Docker containers in one big file in OSX. It's capped to 64GB:

▶ cd ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/

com.docker.docker/Data/com.docker.driver.amd64-linux
▶ ls -lh Docker.qcow2
-rw-r--r--@ 1 peterbe  staff    63G Oct  3 08:51 Docker.qcow2

If you run the above mentioned commands (docker rm ...) this file does not shrink but space is freed up. Just like how MongoDB (used to) allocates much more disk space than it actually uses.

If you delete that Docker.qcow2 and restart Docker the space problem goes away but then the problem is that you lose all your active containers which is especially annoying if you have useful data in database containers.

cache_memoize - a pretty decent cache decorator for Django

September 11, 2017
4 comments Python, Web development, Django

UPDATE - Oct 27, 2017 This snippet did now become its own PyPI package. See https://pypi.python.org/pypi/django-cache-memoize

This is something that's grown up organically when working on Mozilla Symbol Server. It has served me very well and perhaps it's worth extracting into its own lib.

Usage

Basically, you are probably used to this in Django:


from django.core.cache import cache

def compute_something(user, special=False):
    cache_key = 'meatycomputation:{}:special={}'.format(user.id, special)
    value = cache.get(cache_key)
    if value is None:
        value = _call_the_meat(user.id, special)  # some really slow function
        cache.set(cache_key, value, 60 * 5)
    return value

Here's instead how you can do exactly the same with cache_memoize:


from wherever.decorators import cache_memoize

@cache_memoize(60 * 5)
def compute_something(user, special=False):
    return _call_the_meat(user.id, special)  # some really slow function

Cache invalidation

If you ever need to do non-trivial caching you know it's important to be able to invalidate the cache. Usually, to be able to do that you need to involved in how the cache key was created.

Consider our two examples above, here's first the common thing to do:


def save_user(user):
    do_something_that_will_need_to_cache_invalidate(user)

    cache_key = 'meatycomputation:{}:special={}'.format(user.id, False)
    cache.delete(cache_key)
    # And when it was special=True
    cache_key = 'meatycomputation:{}:special={}'.format(user.id, True)
    cache.delete(cache_key)

This works but it involves repeating the code that generates the cache key. You could extract that into its own function of course.

Here's how you do it with the cache_memoize decorator:


def save_user(user):
    do_something_that_will_need_to_cache_invalidate(user)

    compute_something.invalidate(user, special=False)
    compute_something.invalidate(user, special=True)    

Other features

There are actually two ways to "invalidate" the cache. Calling the new myoriginalfunction.invalidate(...) function or passing a custom extra keyword argument called _refresh. For example: compute_something(user, _refresh=True).

You can pass callables that get called when the cache works in your favor or when it's a cache miss. For example:


def increment_hits(user, special=None):
    # use your imagination
    metrics.incr(user.email)

def cache_miss(user, special=None):
    print("cache miss on {}".format(user.email))

@cache_memoize(
    60 * 5,
    hit_callable=increment_hits,
    miss_callable=cache_miss,
)
def compute_something(user, special=False):
    return _call_the_meat(user.id, special)  # some really slow function

Sometimes you just want to use the memoizer to make sure something only gets called "once" (or once per time interval). In that case it might be smart to not flood your cache backend with the value of the function output if there is one. For example:


@cache_memoize(60 * 60, store_result=False)  # idempotent guard
def calculate_and_update(user):
    # do something expensive here that is best to only do once per hour

Internally cache_memoize will basically try to convert every argument and keyword argument to a string with, kinda, str(). That might not always be appropriate because you might know that you have two distinct objects whose __str__ will yield the same result. For that you can use the args_rewrite parameter. For example:


def simplify_special_objects(obj):
    # use your imagination
    return obj.hostname 

@cache_memoize(60 * 5, args_rewrite=simplify_special_objects)
def compute_something(special_obj):
    return _call_the_meat(special_obj.hostname)

In conclusion

I've uploaded the code as a gist.

It's quite possible that there's already a perfectly good lib that does exactly this. If so, thanks for letting me know. If not, perhaps I ought to wrap this up and publish it on PyPI. Again, that's for letting me know.

UPDATE

I found a bug in the original gist. Updated 2017-10-05.
The bug was that the calling of miss_callable and hit_callable was reversed.

Mozilla Symbol Server (aka. Tecken) load testing

September 6, 2017
0 comments Python, Web development, Django, Mozilla

(Thanks Miles Crabil not only for being an awesome Ops person but also for reviewing this blog post!)

My project over the summer, here at Mozilla, has been a project called Mozilla Symbol Server. It's a web service that uploads C++ symbol files, downloads C++ symbol files and symbolicates C++ crash stacktraces. It went into production last week which was fun but there's still lots of work to do on adding beyond-parity features and more optimizations.

What Is Mozilla Symbol Server?

The code name for this project is Tecken and it's written in Python (Django, Gunicorn) and uses PostgreSQL, Redis and Celery. The frontend is entirely static and developed (almost) as a separate project within. The frontend is written in React (using create-react-app and react-router). Everything is run as Docker containers. And if you ask me more details about how it's configured/deployed I'm afraid I have to defer to the awesome Mozilla CloudOps team.

One the challenges I faces developing Tecken is that symbol downloads need to be fast to handle high volumes of traffic. Today I did some load testing on our stage deployment and managed to start 14 concurrent clients that bombarded our staging server with realistic HTTPS GET queries based on log files. It's actually 7 + 1 + 4 + 2 concurrent clients. 7 of them from a m3.2xlarge EC2 node (8 vCPUs), 1 from a m3.large EC2 node (1 vCPU), 2 from two separate NYC based DigitalOcean personal servers and 2 clients here from my laptop on my home broadband. Basically, each loadtest script process got its own CPU.

Total req/s
It's hard to know how much more each client could push if it wasn't slowed down. Either way, the server managed to sustain about 330 requests per second. Our production baseline goal is to able to handle at least 40 requests per second.

After running for a while the caches started getting warm but about 1-5% of requests do have to make a boto3 roundtrip to an S3 bucket located on the other side of America in Oregon. There is also a ~5% penalty in that some requests trigger a write to a central Redis ElastiCache server. That's cheaper than the boto3 S3 call but still hefty latency costs to pay.

The ELB in our staging environment spreads the load between 2 c4.large (2 vCPUs, 3.75GB RAM) EC2 web heads. Each running with preloaded Gunicorn workers between Nginx and Django. Each web head has its own local memcached server to share memory between each worker but only local to the web head.

Is this a lot?

How long is a rope? Hard to tell. Tecken's performance is certainly more than enough and by the sheer fact that it was only just production deployed last week tells me we can probably find a lot of low-hanging fruit optimizations on the deployment side over time.

One way of answering that is to compare it with our lightest endpoint. One that involves absolutely no external resources. It's just pure Python in the form of ELB → Nginx → Gunicorn → Django. If I run hey from the same server I did the load testing I get a topline of 1,300 requests per second.

$ hey -n 10000 -c 10 https://symbols.stage.mozaws.net/__lbheartbeat__
Summary:
  Total:    7.6604 secs
  Slowest:  0.0610 secs
  Fastest:  0.0018 secs
  Average:  0.0075 secs
  Requests/sec: 1305.4199
...

That basically means that all the extra "stuff" (memcache key prep, memcache key queries and possible other high latency network requests) it needs to do in the Django view takes up roughly 3x the time it takes the absolute minimal Django request-response rendering.

Also, if I use the same technique to bombard a single URL, but one that actually involves most code steps but is definitely able to not require any slow ElastiCache writes or boto3 S3 reads you I get 800 requests per second:

$ hey -n 10000 -c 10 https://symbols.stage.mozaws.net/advapi32.pdb/5EFB9BF42CC64024AB64802E467394642/advapi32.sy
Summary:
  Total:    12.4160 secs
  Slowest:  0.0651 secs
  Fastest:  0.0024 secs
  Average:  0.0122 secs
  Requests/sec: 805.4150
  Total data:   300000 bytes
  Size/request: 30 bytes
...

Lesson learned

Max CPU Used
It's a recurring reminder that performance is almost all about latency. If not RAM or disk it's networking. See the graph of the "Max CPU Used" which basically shows that CPU of user, system and stolen ("CPU spent waiting for the hypervisor to service another virtual CPU") never sum totalling over 50%.

A neat trick to zip a git repo with a version number

September 1, 2017
4 comments Linux, Web development

I have this WebExtension addon. It's not very important. Just a web extension that does some hacks to GitHub pages when I open them in Firefox. The web extension is a folder with a manifest.json, icons/icon-48.png, tricks.js, README.md etc. To upload it to addons.mozilla.org I first have to turn the whole thing into a .zip file that I can upload.

So I discovered a neat way to make that zip file. It looks like this:

#!/bin/bash

DESTINATION=build-`cat manifest.json | jq -r .version`.zip
git archive --format=zip master > $DESTINATION

echo "Created..."
ls -lh $DESTINATION

You run it and it creates a build-1.0.zip file containing all the files that are checked into the git repo. So it discards my local "junk" such as backup files or other things that are mentioned in .gitignore (and .git/info/exclude).

I bet someone's going to laugh and say "Duhh! Of course!" but I didn't know you can do that easily. Hopefully posting this it'll help someone trying to do something similar.

Note; this depends on jq which is an amazing little program.

Ultrafast loading of CSS

September 1, 2017
3 comments Web development, JavaScript

tl;dr; The ideal web performance, with regards to CSS, is to inline the minimal CSS and lazy load the rest after load.

Two key things to understand/appreciate:

  1. The fastest performing web page is one that isn't blocked on rendering.

  2. You use some CSS framework kitchen sink because you're not a CSS guru.

How to deal with this?

Things like HTTP2 and CDNs and preload are nice because they make the network lookup for your main.88c468ef.css file as fast as possible. But what's even faster is to include the CSS with the HTML that the server responds in the first place. Why? Because when the browser downloads your HTML (e.g. GET /) as it parses the HTML document it sees that <link rel="stylesheet" href="/main.88c468ef.css"> there and decides to not render any DOM to screen until that CSS file has been downloaded and parsed. It does this because it doesn't want to have to paint the DOM (as it would look like without CSS) and then repaint the DOM again, this time with CSS rules.

Point number 2 basically boils down to the likely fact that your app depends on somecssframework.min.css like Bootstrap, Bulma or Foundation. They're large blobs of CSS for doing all sorts of types of HTML (e.g. cards, tables, navbar menus etc.). These CSS frameworks are super useful because they make your app look pretty. But they're usually big. Really big.

Popular CSS frameworks:

Framework Size Gzipped
bootstrap.min.css 122K 18K
foundation.min.css 115K 16K
semantic.min.css 553K 93K
bulma.min.css 141K 18K

Actually the size difference isn't hugely important. What's important is that it's yet another thing that needs to be downloaded before the page can start to render. If the URL is in the user's cache, great. Even better, if it's cached by a service worker. However if you care about loading performance (judging by the fact that you're still reading), you know that a large majority of your visitors only come to your site sometimes (according to Google Analytics, 92.7% of my visitors are "new visitors"). Perhaps from a Google search. Or perhaps they visit sometimes but rarely enough that by the time they return their browser cache will have "moved on" and reset (to save disk space) what was previously cached.

CSS is a render blocking resource

With and without render blocking CSS
See Ilya Grigorik's primer on Render Blocking CSS.

It's also easy to demonstrate. Check out this Webpagetest Visual Comparison that compares two pages that are both styled with bootstrap.min.css except one of them uses a piece of JavaScript at the bottom of the page that enables the stylesheet after the page has loaded.

So if it's blocking. What to do about it? Well, make it not blocking. But how?

Solution 1

The simplest solution is to simply move any <link rel="stylesheet" href="bootstrap.min.css"> out of the <head> and put them just before the </body> tag. Here's an example of that.

It's valid HTML5 and seems to work just fine in Safari iOS. The only problem is that pesky "Flash of Unstyled Content" (aka. "FOUT") effect where the user is presented with the page very briefly without any styling, then the whole page re-renders onces the stylesheets have loaded. Chrome and iOS actually block the rendering still. So it's not like JavaScript whereby putting it late in the DOM. In other words, not really a good solution at all.

You can see in this Webpagetest that the "Start render" happens after the .css files have been loaded and parsed.

Solution 2

With JavaScript you can put in code that's definitely going to be executed after the rendering starts and, also, after the first rendering is finished (i.e. "DOM Content Loaded").

This technique is best done with loadCSS which can be done really well if you tune it. In particular the rel="preload" feature is getting more and more established. It used to only work in Chrome and Opera but will soon work in Firefox and iOS Safari too. Note, loadCSS contains a polyfill solution to the rel="preload" thing.

The basics is that you load a piece of JavaScript late which, as soon as it can, puts the <link rel="stylesheet" href="bootstrap.min.css"> into the DOM. You still have the Flash of Unstyled Content effect to confront and that's annoying.

Here's an example implementation. It uses the scripts and techniques laid out by filamentgroup's loadCSS.

It works and the rel="preload" is a bonus for Chrome and Opera users because once the JavaScript "kicks in" the network loading is quite possible already done. As seen in this Webpagetest using Chrome the .css files start downloading before the lazyloadcss.js file has even started downloading.

It's not as hot in Firefox because all the .css files downloading is delayed until after the lazyloadcss.js has loaded and executed.

Solution 3

Just inline all the CSS. Instead of <link rel="stylesheet" href="bootstrap.min.css"> you just make it inline. Like:


<style type="text/css">
/*!
 * Bootstrap v4.0.0-beta (https://getbootstrap.com)
 * Copyright 2011-2017 The Bootstrap Authors
 * Copyright 2011-2017 Twitter, Inc.
 * Licensed under MIT (https://github.com/twbs/bootstrap/blob/master/LICENSE)
 */@media print{*,::after,::before{text-shadow:none!important;box-shadow:none!important}......
</style>

All 123KB of it. Why not?! It has to be downloaded sooner or later anyway, might as well nip it in the bud straight away. The Flash of Unstyled Content problem goes away. So does the problem of having to load JavaScript tricks to make the CSS loading non-blocking.

The obvious and immediate caveat is that now the whole HTML document is huge! In this example page the whole HTML document is 127KB (20KB gzipped) whence the regular one is 4.1KB (1.4KB gzipped). And if your visitors, if you're so lucky, click on any other internal link that's another 127KB that has to be downloaded again.

The biggest caveat is that downloading a large HTML document is bad because no other resources (images for example) can be downloaded in parallel whilst the browser is working on rendering the page with what it's downloaded so far. If you compare this Webpagetest with the regular traditional one, you can see that it takes almost 354ms to download the HTML with all CSS inlined compared to 262ms when the CSS was linked. That's roughly 100ms wasted where the browser could start download other resources, like images.

Solution 4

Solution 3 was kinda good because it avoided the Flash of Unstyled Content and it avoided all extra resource loading. However, we can do better.

Instead of inlining all CSS, how about we take out exactly only the CSS we need out of bootstrap.min.css and just inline that. Then, after the page has loaded, we can download the rest of bootstrap.min.css and that way it's ready with all the other selectors and stuff it needs as the page probably changes and morphs depending on interactive JavaScript which is stuff that can and will happen after the initial load.

But how do you know exactly which CSS you need for that initial load? Really, you don't. You have two options:

  1. Manually inspected what DOM elements you have in your initial HTML and start slowly plucking that out of the Bootstrap CSS file.

  2. Automate the inspection of what DOM elements you have in your initial HTML.

Before we dig deeper into the how to automate the inspection let's look at what it'd look like: This page and when Webpagetested. What's cool here is that the DOM is ready in 265ms (it was 262ms when there was no linked CSS).

Notice that there's no Flash of Unstyled Content. No external dependencies. It's basically an inline <style> block with exactly the selectors that are needed and nothing more. The HTML is larger, at 13KB (3.3KB gzipped), but remember it was 4.1KB when we started and the solution where we inlined everything was 127KB.

The immediate problem with this is that we're missing some nice CSS for things that haven't been needed yet. For example, there might be some JavaScript that changes the DOM based on something the user does with the page. For example, clicking on something that adds more elements to the DOM. Or, equally likely, after the the DOM has loaded, an XHR query is made to download some data and display it in a way that needs CSS selectors that weren't included in the minimal set.

By the way, this very blog post builds on this solution. If you're on your desktop browser you can view source and see that there's only inline style blocks.

Solution 5

This builds on Solution 4. The HTML contains the minimal CSS needed for that first render and as soon as possible we additionally download the whole bootstrap.min.css so that it's available if/when the DOM mutates and needs the full CSS not in the minimal CSS.

Basically, let's take Solution 2 (JavaScript lazy loads in the CSS) + Solution 4 (the minimal CSS inlined). Here is one such solution

And there we have it! The ideal solution. The only thing remaining is to verify that it actually makes a difference.

The Webpagetest Final Showdown

We have 5 solutions. Each one different from the next. Let's compare them against each other.

Here it is in its full glory

Visual comparison on WebPagetest.org
(image if you can't open the Webpagetest page right now)

What we notice:

  1. The regular do-nothing solution is 50% slower than the best solution. 3.2 seconds verus 2.2 sceonds.
  2. Putting the <link rel="stylesheet" ...> tags at the bottom of the document doesn't work in Chrome and doesn't do anything good.
  3. Lazy loading the CSS with JavaScript (with no initial CSS) displays content very early but the repaint means it takes unnecessarily longer to load the whole thing.
  4. The ideal solution (Solution 5) loads as fast, visually, as Solution 4 but has the advantage that all CSS is there, eventually.
  5. Inlining all CSS (Solution 3) is only 23% slower than the ideal solution (Solution 5). But, it's much easier to implement. Seriously consider this if your tooling is limited.

Conclusion

One humbling thing to notice is that the difference isn't actually that huge. In this particular example we managed to go from 3.2 seconds to 2.2 seconds (using a 3G connection). The example playground used in this experiment is very far from a real site. Most possibly, a real site is a lot more complex and full of lots more potential bottlenecks that slows things down. For example, instead of obsessing over the CSS payload, perhaps you can make a bigger impact by simply dropping some excessive JavaScript plugins that might not necessarily be needed. Or you can focus on your 2.5MB total of big images.

However, a key ingredient to web performance is to leverge the loading time the best possible way. If you get the CSS un-blocking rendering right, your users' browsers can spend more time, sooner, on other resources such as images and XHR.

UPDATE March 2018

A lot of this work of figuring out the minimal CSS from a DOM has now been put in a rapidly maturing and well tested Nodejs project called minimalcss.

Fastest way to match a filename's extension in Python

August 31, 2017
4 comments Python

tl;dr; By a slim margin, the fastest way to check a filename matching a list of extensions is filename.endswith(extensions)

This turned out to be premature optimization. The context is that I want to check if a filename matches the file extension in a list of 6.

The list being ['.sym', '.dl_', '.ex_', '.pd_', '.dbg.gz', '.tar.bz2']. Meaning, it should return True for foo.sym or foo.dbg.gz. But it should return False for bar.exe or bar.gz.

I put together a litte benchmark, ran it a bunch of times and looked at the results. Here are the functions I wrote:


def f1(filename):
    for each in extensions:
        if filename.endswith(each):
            return True
    return False


def f2(filename):
    return filename.endswith(extensions_tuple)


regex = re.compile(r'({})$'.format(
    '|'.join(re.escape(x) for x in extensions)
))


def f3(filename):
    return bool(regex.findall(filename))


def f4(filename):
    return bool(regex.search(filename))

The results are boring. But I guess that's a result too:

FUNCTION             MEDIAN               MEAN
f1 9543 times        0.0110ms             0.0116ms
f2 9523 times        0.0031ms             0.0034ms
f3 9560 times        0.0041ms             0.0045ms
f4 9509 times        0.0041ms             0.0043ms

For a list of ~40,000 realistic filenames (with result True 75% of the time), I ran each function 10 times. So, it means it took on average 0.0116ms to run f1 10 times here on my laptop with Python 3.6.

More premature optimization

Upon looking into the data and thinking about this will be used. If I reorder the list of extensions so the most common one is first, second most common second etc. Then the performance improves a bit for f1 but slows down slightly for f3 and f4.

Conclusion

That .endswith(some_tuple) is neat and it's hair-splittingly faster. But really, this turned out to not make a huge difference in the grand scheme of things. On average it takes less than 0.001ms to do one filename match.