How to deploy a create-react-app

November 4, 2016
2 comments Web development, JavaScript, React

First of all, create-react-app is an amazing kit. It's a zero configuration bundle that gives you a react app boilerplate with a dev server, linting and a deployment tool. All are awesome but not perfect.

I could go on giving this project praise but if you're here reading this you might be convinced already.

Anyway, the way you deploy a create-react-app project is actually stunningly simple, but there is one major caveat to look out for. Basically running yarn run build will first delete existing files in the ./build/ directory. Files that it indents to replace. For example your ./build/index.html or your ./build/static/js/main.94a86fe3.js.

So, what I suggest is that you deploy it like this:

#!/bin/bash

# Go into the project where the package.json exists
cd myproject
# Upgrade any libraries
yarn
# Use 
yarn run build
mv build build_final

Note! This tip is only applicable if you deploy "in place" as opposed to building a whole new container/image and swapping an old container/image for a new one.

Now, for your Nginx point to the ./build_final directory instead. For example:

# /etc/nginx/sites-enabled/mysite.conf
server {
    server_name mydomain.example.com;
    root /full/path/to/myproject/build_final;

    location / {
        try_files $uri /index.html;
        add_header   Cache-Control public;
        expires      1d;
    }
}

The whole point of this tip is that it's a good idea to not point Nginx to the ./build directory (but to a copy of it instead) because otherwise, during the seconds that yarn run build runs (1-5 seconds) a bunch of files will be missing and Nginx will send 404 errors to the clients unlucky enought to connect during the deployment.

Optimization of QuerySet.get() with or without select_related

November 3, 2016
1 comment Python, Django, PostgreSQL

If you know you're going to look up a related Django ORM object from another one, Django automatically takes care of that for you.

To illustrate, imaging a mapping that looks like this:


class Artist(models.Models):
    name = models.CharField(max_length=200)
    ...


class Song(models.Models):
    artist = models.ForeignKey(Artist)
    ...

And with that in mind, suppose you do this:


>>> Song.objects.get(id=1234567).artist.name
'Frank Zappa'

Internally, what Django does is that it looks the Song object first, then it does a look up automatically on the Artist. In PostgreSQL it looks something like this:

SELECT "main_song"."id", "main_song"."artist_id", ... FROM "main_song" WHERE "main_song"."id" = 1234567
SELECT "main_artist"."id", "main_artist"."name", ... FROM "main_artist" WHERE "main_artist"."id" = 111

Pretty clear. Right.

Now if you know you're going to need to look up that related field you can ask Django to make a join before the lookup even happens. It looks like this:


>>> Song.objects.select_related('artist').get(id=1234567).artist.name
'Frank Zappa'

And the SQL needed looks like this:

SELECT "main_song"."id", ... , "main_artist"."name", ... 
FROM "main_song" INNER JOIN "main_artist" ON ("main_song"."artist_id" = "main_artist"."id") WHERE "main_song"."id" = 1234567

The question is; which is fastest?

Well, there's only one way to find out and that is to measure with some relatistic data.

Here's the benchmarking code:


def f1(id):
    try:
        return Song.objects.get(id=id).artist.name
    except Song.DoesNotExist:
        pass

def f2(id):
    try:
        return Song.objects.select_related('artist').get(id=id).artist.name
    except Song.DoesNotExist:
        pass

def _stats(r):
    #returns the median, average and standard deviation of a sequence
    tot = sum(r)
    avg = tot/len(r)
    sdsq = sum([(i-avg)**2 for i in r])
    s = list(r)
    s.sort()
    return s[len(s)//2], avg, (sdsq/(len(r)-1 or 1))**.5

times = defaultdict(list)
functions = [f1, f2]
for id in range(100000, 103000):
    for f in functions:
        t0 = time.time()
        r = f(id)
        t1 = time.time()
        if r:
            times[f.__name__].append(t1-t0)
    # Shuffle the order so that one doesn't benefit more
    # from deep internal optimizations/caching in Postgre.
    random.shuffle(functions)

for k, values in times.items():
    print(k, [round(x * 1000, 2) for x in _stats(values)])

For the record, here are the parameters of this little benchmark:

  • The server is a 4Gb DigitialOcean server that is being used but not super busy
  • The PostgreSQL server is running on the same virtual machine as the Python process used in the benchmark
  • Django 1.10, Python 3.5, PostgreSQL 9.5.4
  • 1,588,258 "Song" objects, 102,496 "Artist" objects
  • Note that the range of IDs has gaps but they're not counted

The Result

Function Median Average Std Dev
f1 3.19ms 9.17ms 19.61ms
f2 2.28ms 6.28ms 15.30ms

The Conclusion

If you use the median, using select_related is 30% faster and if you use the average, using select_related is 46% faster.

So, if you know you're going to need to do that lookup put in .select_related(relation) before every .get(id=...) in your Django code.

Deep down in PostgreSQL, the inner join is ultimately two ID-by-index lookups. And that's what the first method is too. It's likely that the reason the inner join approach is faster is simply because there's less connection overheads.

Lastly, YOUR MILEAGE WILL VARY. Every benchmark is flawed but this quite realistic because it's not trying to be optimized in either way.

Django test optimization with no-op PIL engine

October 27, 2016
6 comments Python, Django

The Air Mozilla project is a regular Django webapp. It's reasonably big for a more or less one man project. It's ~200K lines of Python and ~100K lines of JavaScript. There are 816 "unit tests" at the time of writing. Most of them are kinda typical Django tests. Like:


def test_some_feature(self):
    thing = MyModel.objects.create(key='value')
    url = reverse('namespace:name', args=(thing.id,))
    response = self.client.get(url)
    ....

Also, the site uses sorl.thumbnail to automatically generate thumbnails from uploaded images. It's a great library.

However, when running tests, you almost never actually care about the image itself. Your eyes will never feast on them. All you care about is that there is an image, that it was resized and that nothing broke. You don't write tests that checks the new image dimensions of a generated thumbnail. If you need tests that go into that kind of detail, it best belongs somewhere else.

So, I thought, why not fake ALL operations that are happening inside sorl.thumbnail to do with resizing and cropping images.

Here's the changeset that does it. Note, that the trick is to override the default THUMBNAIL_ENGINE that sorl.thumbnail loads. It usually defaults to sorl.thumbnail.engines.pil_engine.Engine and I just wrote my own that does no-ops in almost every instance.

I admittedly threw it together quite quickly just to see if it was possible. Turns out, it was.


# Depends on setting something like:
#    THUMBNAIL_ENGINE = 'airmozilla.base.tests.testbase.FastSorlEngine'
# in your settings specifically for running tests.


from sorl.thumbnail.engines.base import EngineBase


class _Image(object):
    def __init__(self):
        self.size = (1000, 1000)
        self.mode = 'RGBA'
        self.data = '\xa0'


class FastSorlEngine(EngineBase):

    def get_image(self, source):
        return _Image()

    def get_image_size(self, image):
        return image.size

    def _colorspace(self, image, colorspace):
        return image

    def _scale(self, image, width, height):
        image.size = (width, height)
        return image

    def _crop(self, image, width, height, x_offset, y_offset):
        image.size = (width, height)
        return image

    def _get_raw_data(self, image, *args, **kwargs):
        return image.data

    def is_valid_image(self, raw_data):
        return bool(raw_data)

So, was it much faster?

It's hard to measure because the time it takes to run the whole test suite depends on other stuff going on on my laptop during the long time it takes to run the tests. So I ran them 8 times with the old code and 8 times with this new hack.

Iteration Before After
1 82.789s 73.519s
2 82.869s 67.009s
3 77.100s 60.008s
4 74.642s 58.995s
5 109.063s 80.333s
6 100.452s 81.736s
7 85.992s 61.119s
8 82.014s 73.557s
Average 86.865s 69.535s
Median 82.869s 73.519s
Std Dev 11.826s 9.0757s

So rougly 11% faster. Not a lot but it adds up when you're doing test-driven development or debugging where you run a suite or a test over and over as you're saving the files/tests you're working on.

Room for improvement

In my case, it just worked with this simple solution. Your site might do fancier things with the thumbnails. Perhaps we can combine forces on this and finalize a working solution into a standalone package.

hashin 0.7.0 and multiple packages

August 30, 2016
0 comments Python

My colleague Andrew Halberstadt stepped up with a great contribution on hashin (on PyPI). Now you can install multiple packages in one sweep. Like this:

$ hashin requests Django premailer mincss

And if you need to specify a different requirements file than the default (./requirements.txt) or a different algorithm than the default (sha256) you can do that like this:

$ hashin requests Django premailer mincss --algorithm=sha512 --requirements-file=dev/reqs.txt

or

$ hashin requests Django premailer mincss -a sha512 -r dev/reqs.txt

This is an important change if you were used to typing:

$ hashin somepackage dev/reqs.txt

...because if you continue to do that it's going to try to fetch the hash for a PyPI package supposedly called "dev/reqs.txt".

Thanks @ahal!

Note! The operation is not atomic. So if you do hashin requests somejunk it will hash in the latest requests to your requirements.txt and error on the second one.

django-html-validator - now locally, fast!

August 12, 2016
1 comment Python, Web development, Django

A couple of years ago I released a project called django-html-validator (GitHub link) and it's basically a Django library that takes the HTML generated inside Django and sends it in for HTML validation.

The first option is to send the HTML payload, over HTTPS, to https://validator.nu/. Not only is this slow but it also means sending potentially revealing HTML. Ideally you don't have any passwords in your HTML and if you're doing HTML validation you're probably testing against some test data. But... it sucked.

The other alternative was to download a vnu.jar file from the github.com/validator/validator project and executing it in a subprocess with java -jar vnu.jar /tmp/file.html. Problem with this is that it's really slow because java programs take such a long time to boot up.

But then, at the beginning of the year some contributors breathed fresh life into the project. Python 3 support and best of all; the ability to start the vnu.jar as a local server on http://localhost:8888 and HTTP post HTML over to that. Now you don't have to pay the high cost of booting up a java program and you don't have to rely on a remote HTTP call.

Now it becomes possible to have HTML validation checked on every rendered HTML response in the Django unit tests.

To try it, check out the new instructions on "Setting the vnu.jar path".

The contributor who's made this possible is Ville "scop" Skyttä, as well as others. Thanks!!

How to identify/classify what language a piece of text is

August 9, 2016
0 comments Misc. links, Python

Suppose you have a piece of text but you don't know what language it is. If you speak English and the text looks English, it's easy. But what about "Den snabba bruna räven hoppar över den lata hunden" or "haraka kahawia mbweha anaruka juu ya mbwa wavivu" or "A ligeira raposa marrom ataca o cão preguiçoso"? Can you guess?

MeaningCloud can guess. They have a Language Identification API that you can use for free. Their freemium plan allows for 40,000 API requests per month.

So to get started, you have to register, verify your email and sig in to get your "license key". Now when you have that you simply use it like this:

>>> import requests
>>> url = 'http://api.meaningcloud.com/lang-1.1'
>>> payload={'key': 'b49....................ee',
... 'txt': 'Den snabba bruna räven hoppar över den lata hunden'}
>>>
>>> requests.post(url, data=payload).json()
{'status': {'remaining_credits': '39999', 'credits': '1', 'msg': 'OK', 'code': '0'}, 'lang_list': ['sv', 'da', 'no', 'es']}
>>>

If you look at the lang_list list, the first one is sv for Swedish.

If you want the full name of a language code, look it up in the "ISO 639-1 Code" table.

Let's do the other ones too:

>>> payload['txt'] = 'A ligeira raposa marrom ataca o cão preguiçoso'
>>> # Portugese
>>> requests.post(url, data=payload).json()
{'status': {'remaining_credits': '39998', 'credits': '1', 'msg': 'OK', 'code': '0'}, 'lang_list': ['pt', 'ro']}
>>> payload['txt'] = 'haraka kahawia mbweha anaruka juu ya mbwa wavivu'
>>> # Swahili
>>> requests.post(url, data=payload).json()
{'status': {'remaining_credits': '37363', 'credits': '1', 'msg': 'OK', 'code': '0'}, 'lang_list': ['sw']}

The service isn't perfect. It struggles on shorter texts using non-western alphabet. But it's pretty easy to use and delivers pretty good results.

UPDATE

Note! If you intend to do this in bulk and you have access to Python and NLTK use this script instead.

I tried it on my nltk install and I have 14 languages that it can detect.

UPDATE 2

A much better solution than NLTK is guess_language-spirit. It's superfast and I spotchecked a bunch of its outputs and put the non-English text into Google Translate and a it almost always gets it right.

json-schema-reducer

August 2, 2016
0 comments Python

Last week I made a little library called json-schema-reducer. It's a simple function that takes a JSON Schema and dict (or a JSON string or a .json file path), and makes a new dict that only contains the keys listed in the JSON Schema.

This is handy if you have a JSON Schema which dictates what you can/want to share/publish/save, but you have a data structure that contains keys and values you don't want to share/publish/save.

I built this because there are a couple of projects that can turn data structures into models from a JSON Schema but none that have the ability to reduce stuff from a data structure. Here's an example:

Sample JSON Schema (schema.json)

{
    "type": "object", 
    "$schema": "http://json-schema.org/draft-04/schema#", 
    "title": "Sample JSON Schema", 
    "required": [
        "name", 
        "sex"
    ], 
    "properties": {
        "name": {
            "type": "string"
        }, 
        "sex": {
            "type": "string"
        }
        "title": {
            "type": "string"
        }    
    }
}

Sample data structure (sample.json)

{
    "name": "Peter",
    "sex": "male",
    "email": "peterbe@example.com",
}

Usage


>>> from json_schema_reducer import make_reduced_dict
>>> make_reduced_dict('schema.json', 'sample.json')
{'name': 'Peter', 'sex': 'male'}  # Note! No "email" key

The project works in Python 2 and Python 3. See tests.

Also, the function tries to be convenient in that it can accept either a dict, a JSON string or path to a .json file.

Premailer 3.0.0 - classes kept by default

June 7, 2016
0 comments Python, Web development

Today I released a new major version of premailer where the only difference is that one of the default options have changed from True to False.
The git commit for this change might look big but the only difference is that now, by default, the HTML class attribute is kept in the output HTML.

When premailer started, the land of HTML emails was very different. Basically, you used to not use CSS media queries, so, no reason to keep the class attribute. Now, these days, all pretty HTML emails need media queries and for that to work you need to have the class attribute kept in the HTML.

So fear not the major version upgrade! If you used to use premailer like this:


from premailer import Premailer

transformer = Premailer(html)
output_html = transformer.transform()

You now need to change it to:


from premailer import Premailer

transformer = Premailer(html, remove_classes=True)
output_html = transformer.transform()

As always, you can play with it on premailer.io.

CSS Bloat Comparison

June 3, 2016
0 comments Web development, JavaScript

tl;dr; How much web performance negative overhead does including a CSS stylesheet (that you don't use) add to the rendering time? I don't know. But WebPagetest gives us some clues.

To jump straight to the results, check out this video which is the slow-motion rendering of 1 + 5 pages. Each page has one more big fat CSS stylesheet linked than the other. I.e. the 5th one links to 5 different .css URLs. There's also a 0th one which only loads 1 .css file which has nothing in it.

WebPagetest results
The full results are here.

I love CSS frameworks and use them ALL the time. But I'm also interested in web performance and using techniques like real-time static analysis to figure out what CSS that doesn't need to be loaded. Some of those techniques surely lead to less stuff needing to be downloaded but how big is the gain? Not sure.

So I made 6 pages that loads a CSS framework but doesn't actually use any of it. The browser will be instructed to download the file and parse it. That takes time and CPU-work will surely will have an effect on the total rendering time.

In every page, I lastly load a little piece of JavaScript just to make something appear on the page. That means, the page will not fully render until AFTER the it has loaded the .css files and one little .js file which prints something on the DOM. The reason there's a 5 second delay until it uses AJAX (fetch) to figure out their sizes is because I don't want that effect to affect the rendering on WebPagetest.

css Bytes

Noteworthy

Notice that there are two cases of "outliers". According to the measurements, bloat3.html and bloat5.html take shorter time to render compared to their smaller files (bloat2.html and bloat4.html respectively). That seems to indicate that - even though WebPagetest does 3 runs each - that "network I/O luck" plays a big role.

Also, interesting is that the first file bloat1.html which loads 118Kb more CSS than bloat0.html but clearly it doesn't seem to have a big impact.

Conclusion

It does have an effect of reduced web performance. I.e. longer loading time. The page with just 1 .css file takes 0.5 seconds and the one with 5 .css files takes 0.8 seconds. However, the results also indicate that much of the total time is spent waiting for the download. Once the browser has downloaded the payload, it appears to be very fast at parsing it to get ready to render the page accordingly. In other words, don't worry so much about the "bloat" of the content of the CSS file. Worry more about the excessive HTTP requests needed in total.

It would be interesting to inline every CSS file into the .html page and re-run. That means only the .html needs to be downloaded from the network and although the last one will be bigger when all are gzipped the difference isn't huge.

In conclusion, I'm not sure it's a huge performance loss to add big bloated CSS frameworks to your site. Most likely your big wins lie with optimizing the images and JavaScript.

UPDATE

After publishing this, I decided to inline every .css as big <style> tags. For example, bloat4.html (View source).

Here's the result.

And here's the video.

What this proves is that the difference we saw earlier was almost entirely due to "network I/O luck". All pages are gzipped. The smallest one is only 0.19Kb and the largest one is 183Kb. But there's no noticeable difference in the total time it takes to render these two. Basically, the browser's ability to parse CSS is FAST! Don't worry so much about the size of the CSS payload itself. Go forth and make pretty web pages!

hashin 0.5.0 bug fix

May 17, 2016
0 comments Python

Thank you @davehunt for finding a ugly bug in in hashin.

The bug happened when you tried to add a package that had a similar name to a package that was already in your requirements.txt. By similar name, it was if the package name was inside another name. E.g. 'selenium==' in 'pytest-selenium=='.

Here's the fix.

So make sure you upgrade to version >0.5.