Filtered by Python

Page 11

Reset

Optimization of QuerySet.get() with or without select_related

November 3, 2016
1 comment Python, Django, PostgreSQL

If you know you're going to look up a related Django ORM object from another one, Django automatically takes care of that for you.

To illustrate, imaging a mapping that looks like this:


class Artist(models.Models):
    name = models.CharField(max_length=200)
    ...


class Song(models.Models):
    artist = models.ForeignKey(Artist)
    ...

And with that in mind, suppose you do this:


>>> Song.objects.get(id=1234567).artist.name
'Frank Zappa'

Internally, what Django does is that it looks the Song object first, then it does a look up automatically on the Artist. In PostgreSQL it looks something like this:

SELECT "main_song"."id", "main_song"."artist_id", ... FROM "main_song" WHERE "main_song"."id" = 1234567
SELECT "main_artist"."id", "main_artist"."name", ... FROM "main_artist" WHERE "main_artist"."id" = 111

Pretty clear. Right.

Now if you know you're going to need to look up that related field you can ask Django to make a join before the lookup even happens. It looks like this:


>>> Song.objects.select_related('artist').get(id=1234567).artist.name
'Frank Zappa'

And the SQL needed looks like this:

SELECT "main_song"."id", ... , "main_artist"."name", ... 
FROM "main_song" INNER JOIN "main_artist" ON ("main_song"."artist_id" = "main_artist"."id") WHERE "main_song"."id" = 1234567

The question is; which is fastest?

Well, there's only one way to find out and that is to measure with some relatistic data.

Here's the benchmarking code:


def f1(id):
    try:
        return Song.objects.get(id=id).artist.name
    except Song.DoesNotExist:
        pass

def f2(id):
    try:
        return Song.objects.select_related('artist').get(id=id).artist.name
    except Song.DoesNotExist:
        pass

def _stats(r):
    #returns the median, average and standard deviation of a sequence
    tot = sum(r)
    avg = tot/len(r)
    sdsq = sum([(i-avg)**2 for i in r])
    s = list(r)
    s.sort()
    return s[len(s)//2], avg, (sdsq/(len(r)-1 or 1))**.5

times = defaultdict(list)
functions = [f1, f2]
for id in range(100000, 103000):
    for f in functions:
        t0 = time.time()
        r = f(id)
        t1 = time.time()
        if r:
            times[f.__name__].append(t1-t0)
    # Shuffle the order so that one doesn't benefit more
    # from deep internal optimizations/caching in Postgre.
    random.shuffle(functions)

for k, values in times.items():
    print(k, [round(x * 1000, 2) for x in _stats(values)])

For the record, here are the parameters of this little benchmark:

  • The server is a 4Gb DigitialOcean server that is being used but not super busy
  • The PostgreSQL server is running on the same virtual machine as the Python process used in the benchmark
  • Django 1.10, Python 3.5, PostgreSQL 9.5.4
  • 1,588,258 "Song" objects, 102,496 "Artist" objects
  • Note that the range of IDs has gaps but they're not counted

The Result

Function Median Average Std Dev
f1 3.19ms 9.17ms 19.61ms
f2 2.28ms 6.28ms 15.30ms

The Conclusion

If you use the median, using select_related is 30% faster and if you use the average, using select_related is 46% faster.

So, if you know you're going to need to do that lookup put in .select_related(relation) before every .get(id=...) in your Django code.

Deep down in PostgreSQL, the inner join is ultimately two ID-by-index lookups. And that's what the first method is too. It's likely that the reason the inner join approach is faster is simply because there's less connection overheads.

Lastly, YOUR MILEAGE WILL VARY. Every benchmark is flawed but this quite realistic because it's not trying to be optimized in either way.

Django test optimization with no-op PIL engine

October 27, 2016
6 comments Python, Django

The Air Mozilla project is a regular Django webapp. It's reasonably big for a more or less one man project. It's ~200K lines of Python and ~100K lines of JavaScript. There are 816 "unit tests" at the time of writing. Most of them are kinda typical Django tests. Like:


def test_some_feature(self):
    thing = MyModel.objects.create(key='value')
    url = reverse('namespace:name', args=(thing.id,))
    response = self.client.get(url)
    ....

Also, the site uses sorl.thumbnail to automatically generate thumbnails from uploaded images. It's a great library.

However, when running tests, you almost never actually care about the image itself. Your eyes will never feast on them. All you care about is that there is an image, that it was resized and that nothing broke. You don't write tests that checks the new image dimensions of a generated thumbnail. If you need tests that go into that kind of detail, it best belongs somewhere else.

So, I thought, why not fake ALL operations that are happening inside sorl.thumbnail to do with resizing and cropping images.

Here's the changeset that does it. Note, that the trick is to override the default THUMBNAIL_ENGINE that sorl.thumbnail loads. It usually defaults to sorl.thumbnail.engines.pil_engine.Engine and I just wrote my own that does no-ops in almost every instance.

I admittedly threw it together quite quickly just to see if it was possible. Turns out, it was.


# Depends on setting something like:
#    THUMBNAIL_ENGINE = 'airmozilla.base.tests.testbase.FastSorlEngine'
# in your settings specifically for running tests.


from sorl.thumbnail.engines.base import EngineBase


class _Image(object):
    def __init__(self):
        self.size = (1000, 1000)
        self.mode = 'RGBA'
        self.data = '\xa0'


class FastSorlEngine(EngineBase):

    def get_image(self, source):
        return _Image()

    def get_image_size(self, image):
        return image.size

    def _colorspace(self, image, colorspace):
        return image

    def _scale(self, image, width, height):
        image.size = (width, height)
        return image

    def _crop(self, image, width, height, x_offset, y_offset):
        image.size = (width, height)
        return image

    def _get_raw_data(self, image, *args, **kwargs):
        return image.data

    def is_valid_image(self, raw_data):
        return bool(raw_data)

So, was it much faster?

It's hard to measure because the time it takes to run the whole test suite depends on other stuff going on on my laptop during the long time it takes to run the tests. So I ran them 8 times with the old code and 8 times with this new hack.

Iteration Before After
1 82.789s 73.519s
2 82.869s 67.009s
3 77.100s 60.008s
4 74.642s 58.995s
5 109.063s 80.333s
6 100.452s 81.736s
7 85.992s 61.119s
8 82.014s 73.557s
Average 86.865s 69.535s
Median 82.869s 73.519s
Std Dev 11.826s 9.0757s

So rougly 11% faster. Not a lot but it adds up when you're doing test-driven development or debugging where you run a suite or a test over and over as you're saving the files/tests you're working on.

Room for improvement

In my case, it just worked with this simple solution. Your site might do fancier things with the thumbnails. Perhaps we can combine forces on this and finalize a working solution into a standalone package.

hashin 0.7.0 and multiple packages

August 30, 2016
0 comments Python

My colleague Andrew Halberstadt stepped up with a great contribution on hashin (on PyPI). Now you can install multiple packages in one sweep. Like this:

$ hashin requests Django premailer mincss

And if you need to specify a different requirements file than the default (./requirements.txt) or a different algorithm than the default (sha256) you can do that like this:

$ hashin requests Django premailer mincss --algorithm=sha512 --requirements-file=dev/reqs.txt

or

$ hashin requests Django premailer mincss -a sha512 -r dev/reqs.txt

This is an important change if you were used to typing:

$ hashin somepackage dev/reqs.txt

...because if you continue to do that it's going to try to fetch the hash for a PyPI package supposedly called "dev/reqs.txt".

Thanks @ahal!

Note! The operation is not atomic. So if you do hashin requests somejunk it will hash in the latest requests to your requirements.txt and error on the second one.

django-html-validator - now locally, fast!

August 12, 2016
1 comment Python, Web development, Django

A couple of years ago I released a project called django-html-validator (GitHub link) and it's basically a Django library that takes the HTML generated inside Django and sends it in for HTML validation.

The first option is to send the HTML payload, over HTTPS, to https://validator.nu/. Not only is this slow but it also means sending potentially revealing HTML. Ideally you don't have any passwords in your HTML and if you're doing HTML validation you're probably testing against some test data. But... it sucked.

The other alternative was to download a vnu.jar file from the github.com/validator/validator project and executing it in a subprocess with java -jar vnu.jar /tmp/file.html. Problem with this is that it's really slow because java programs take such a long time to boot up.

But then, at the beginning of the year some contributors breathed fresh life into the project. Python 3 support and best of all; the ability to start the vnu.jar as a local server on http://localhost:8888 and HTTP post HTML over to that. Now you don't have to pay the high cost of booting up a java program and you don't have to rely on a remote HTTP call.

Now it becomes possible to have HTML validation checked on every rendered HTML response in the Django unit tests.

To try it, check out the new instructions on "Setting the vnu.jar path".

The contributor who's made this possible is Ville "scop" Skyttä, as well as others. Thanks!!

How to identify/classify what language a piece of text is

August 9, 2016
0 comments Misc. links, Python

Suppose you have a piece of text but you don't know what language it is. If you speak English and the text looks English, it's easy. But what about "Den snabba bruna räven hoppar över den lata hunden" or "haraka kahawia mbweha anaruka juu ya mbwa wavivu" or "A ligeira raposa marrom ataca o cão preguiçoso"? Can you guess?

MeaningCloud can guess. They have a Language Identification API that you can use for free. Their freemium plan allows for 40,000 API requests per month.

So to get started, you have to register, verify your email and sig in to get your "license key". Now when you have that you simply use it like this:

>>> import requests
>>> url = 'http://api.meaningcloud.com/lang-1.1'
>>> payload={'key': 'b49....................ee',
... 'txt': 'Den snabba bruna räven hoppar över den lata hunden'}
>>>
>>> requests.post(url, data=payload).json()
{'status': {'remaining_credits': '39999', 'credits': '1', 'msg': 'OK', 'code': '0'}, 'lang_list': ['sv', 'da', 'no', 'es']}
>>>

If you look at the lang_list list, the first one is sv for Swedish.

If you want the full name of a language code, look it up in the "ISO 639-1 Code" table.

Let's do the other ones too:

>>> payload['txt'] = 'A ligeira raposa marrom ataca o cão preguiçoso'
>>> # Portugese
>>> requests.post(url, data=payload).json()
{'status': {'remaining_credits': '39998', 'credits': '1', 'msg': 'OK', 'code': '0'}, 'lang_list': ['pt', 'ro']}
>>> payload['txt'] = 'haraka kahawia mbweha anaruka juu ya mbwa wavivu'
>>> # Swahili
>>> requests.post(url, data=payload).json()
{'status': {'remaining_credits': '37363', 'credits': '1', 'msg': 'OK', 'code': '0'}, 'lang_list': ['sw']}

The service isn't perfect. It struggles on shorter texts using non-western alphabet. But it's pretty easy to use and delivers pretty good results.

UPDATE

Note! If you intend to do this in bulk and you have access to Python and NLTK use this script instead.

I tried it on my nltk install and I have 14 languages that it can detect.

UPDATE 2

A much better solution than NLTK is guess_language-spirit. It's superfast and I spotchecked a bunch of its outputs and put the non-English text into Google Translate and a it almost always gets it right.

json-schema-reducer

August 2, 2016
0 comments Python

Last week I made a little library called json-schema-reducer. It's a simple function that takes a JSON Schema and dict (or a JSON string or a .json file path), and makes a new dict that only contains the keys listed in the JSON Schema.

This is handy if you have a JSON Schema which dictates what you can/want to share/publish/save, but you have a data structure that contains keys and values you don't want to share/publish/save.

I built this because there are a couple of projects that can turn data structures into models from a JSON Schema but none that have the ability to reduce stuff from a data structure. Here's an example:

Sample JSON Schema (schema.json)

{
    "type": "object", 
    "$schema": "http://json-schema.org/draft-04/schema#", 
    "title": "Sample JSON Schema", 
    "required": [
        "name", 
        "sex"
    ], 
    "properties": {
        "name": {
            "type": "string"
        }, 
        "sex": {
            "type": "string"
        }
        "title": {
            "type": "string"
        }    
    }
}

Sample data structure (sample.json)

{
    "name": "Peter",
    "sex": "male",
    "email": "peterbe@example.com",
}

Usage


>>> from json_schema_reducer import make_reduced_dict
>>> make_reduced_dict('schema.json', 'sample.json')
{'name': 'Peter', 'sex': 'male'}  # Note! No "email" key

The project works in Python 2 and Python 3. See tests.

Also, the function tries to be convenient in that it can accept either a dict, a JSON string or path to a .json file.

Premailer 3.0.0 - classes kept by default

June 7, 2016
0 comments Python, Web development

Today I released a new major version of premailer where the only difference is that one of the default options have changed from True to False.
The git commit for this change might look big but the only difference is that now, by default, the HTML class attribute is kept in the output HTML.

When premailer started, the land of HTML emails was very different. Basically, you used to not use CSS media queries, so, no reason to keep the class attribute. Now, these days, all pretty HTML emails need media queries and for that to work you need to have the class attribute kept in the HTML.

So fear not the major version upgrade! If you used to use premailer like this:


from premailer import Premailer

transformer = Premailer(html)
output_html = transformer.transform()

You now need to change it to:


from premailer import Premailer

transformer = Premailer(html, remove_classes=True)
output_html = transformer.transform()

As always, you can play with it on premailer.io.

hashin 0.5.0 bug fix

May 17, 2016
0 comments Python

Thank you @davehunt for finding a ugly bug in in hashin.

The bug happened when you tried to add a package that had a similar name to a package that was already in your requirements.txt. By similar name, it was if the package name was inside another name. E.g. 'selenium==' in 'pytest-selenium=='.

Here's the fix.

So make sure you upgrade to version >0.5.

Time to do concurrent CPU bound work

May 13, 2016
3 comments Python, Linux, macOS

Did you see my blog post about Decorated Concurrency - Python multiprocessing made really really easy? If not, fear not. There, I'm demonstrating how I take a task of creating 100 thumbnails from a large JPG. First in serial, then concurrently, with a library called deco. The total time to get through the work massively reduces when you do it concurrently. No surprise. But what's interesting is that each individual task takes a lot longer. Instead of 0.29 seconds per image it took 0.65 seconds per image (...inside each dedicated processor).

The simple explanation, even from a layman like myself, must be that when doing so much more, concurrently, the whole operating system struggles to keep up with other little subtle tasks.

With deco you can either let Python's multiprocessing just use as many CPUs as your computer has (8 in the case of my Macbook Pro) or you can manually set it. E.g. @concurrent(processes=5) would spread the work across a max of 5 CPUs.

So, I ran my little experiment again for every number from 1 to 8 and plotted the results:

Time elapsed vs. work time

What to take away...

The blue bars is the time it takes, in total, from starting the program till the program ends. The lower the better.

The red bars is the time it takes, in total, to complete each individual task.

Meaning, when the number of CPUs is low you have to wait longer for all the work to finish and when the number of CPUs is high the computer needs more time to finish its work. This is an insight into over-use of operating system resources.

If the work is much much more demanding than this experiment (the JPG is only 3.3Mb and one thumbnail only takes 0.3 seconds to make) you might have a red bar on the far right that is too expensive for your server. Or worse, it might break things so that everything stops.

In conclusion...

Choose wisely. Be aware how "bound" the task is.

Also, remember that if the work of each individual task is too "light", the overhead of messing with multprocessing might actually cost more than it's worth.

The code

Here's the messy code I used:


import time
from PIL import Image
from deco import concurrent, synchronized
import sys

processes = int(sys.argv[1])
assert processes >= 1
assert processes <= 8


@concurrent(processes=processes)
def slow(times, offset):
    t0 = time.time()
    path = '9745e8.jpg'
    img = Image.open(path)
    size = (100 + offset * 20, 100 + offset * 20)
    img.thumbnail(size, Image.ANTIALIAS)
    img.save('thumbnails/{}.jpg'.format(offset), 'JPEG')
    t1 = time.time()
    times[offset] = t1 - t0


@synchronized
def run(times):
    for index in range(100):
        slow(times, index)

t0 = time.time()
times = {}
run(times)
t1 = time.time()
print "TOOK", t1-t0
print "WOULD HAVE TAKEN", sum(times.values())

UPDATE

I just wanted to verify that the experiment is valid that proves that CPU bound work hogs resources acorss CPUs that affects their individual performance.

Let's try to the similar but totally different workload of a Network bound task. This time, instead of resizing JPEGs, it waits for finishing HTTP GET requests.

Network bound

So clearly it makes sense. The individual work withing each process is not generally slowed down much. A tiny bit, but not much. Also, I like the smoothness of the curve of the blue bars going from left to right. You can clearly see that it's reverse logarithmic.

Decorated Concurrency - Python multiprocessing made really really easy

May 13, 2016
17 comments Python

tl;dr There's a new interesting wrapper on Python multiprocessing called deco, written by Alex Sherman and Peter Den Hartog, both at University of Wisconsin - Madison. It makes Python multiprocessing really really easy.

The paper is here (PDF) and the code is here: https://github.com/alex-sherman/deco.

This library is based on something called Pydron which, if I understand it correctly, is still a piece of research with no code released. ("We currently estimate that we will be ready for the release in the first quarter of 2015.")

Apart from using simple decorators on functions, the big difference that deco takes, is that it makes it really easy to get started and that there's a hard restriction on how to gather the results of sub-process calls'. In deco, you pass in a mutable object that has a keyed index (e.g. a python dict). A python list is also mutable but it doesn't have an index. Meaning, you could get race conditions on mylist.append().

"However, DECO does impose one important restriction on the program: all mutations may only by index based."

Some basic example

Just look at this example:


# before.py

def slow(index):
    time.sleep(5)

def run():
    for index in list('123'):
        slow(index)
run()

And when run, you clearly expect it to take 15 seconds:

$ time python before.py

real    0m15.090s
user    0m0.057s
sys 0m0.022s

Ok, let's parallelize this with deco. First pip install deco, then:


# after.py

from deco import concurrent, synchronized

@concurrent
def slow(index):
    time.sleep(5)

@synchronized
def run():
    for index in list('123'):
        slow(index)

run()

And when run, it should be less than 15 seconds:

$ time python after.py

real    0m5.145s
user    0m0.082s
sys 0m0.038s

About the order of execution

Let's put some logging into that slow() function above.


def slow(index):
    time.sleep(5)
    print 'done with {}'.format(index)

Run the example a couple of times and note that the order is not predictable:

$ python after.py
done with 1
done with 3
done with 2
$ python after.py
done with 1
done with 2
done with 3
$ python after.py
done with 3
done with 2
done with 1

That probably don't come as a surprise for those familiar with async stuff, but it's worth reminding so you don't accidentally depend on order.

@synchronized or .wait()

Remember the run() function in the example above? The @synchronized decorator is magic. It basically figures out that within the function call there are calls out to sub-process work. What it does it that it "pauses" until all those have finished. An alternative approach is to call the .wait() method on the decorated concurrency function:


def run():
    for index in list('123'):
        slow(index)
    slow.wait()

That works the same way. This could potentially be useful if you, on the next line, need to depend on the results. But if that's the case you could just split up the function and slap a @synchronized decorator on the split-out function.

No Fire-and-forget

It might be tempting to not set the @synchronized decorator and not call .wait() hoping the work will be finished anyway somewhere in the background. The functions that are concurrent could be, for example, functions that generate thumbnails from a larger image or something time consuming where you don't care when it finishes, as long as it finishes.


# fireandforget.py
# THIS DOES NOT WORK
# And it's not expected to either.

@concurrent
def slow(index):
    time.sleep(5)

def run():
    for index in list('123'):
        slow(index)

run()

When you run it, you don't get an error:

$ time python fireandforget.py

real    0m0.231s
user    0m0.079s
sys 0m0.047s

But if you dig deeper, you'll find that it never actually executes those concurrent functions.

If you want to do fire-and-forget you need to have another service/process that actually keeps running and waiting for all work to be finished. That's how the likes of a message queue works.

Number of concurrent workers

multiprocessing.Pool automatically, as far as I can understand, figures out how many concurrent jobs it can run. On my Mac, where I have 8 CPUS, the number is 8.

This is easy to demonstrate. In the example above it does exactly 3 concurrent jobs, because len(list('123')) == 3. If I make it 8 items, the whole demo run takes, still, 5 seconds (plus a tiny amount of overhead). If I make it 9 items, it now takes 10 seconds.

How multiprocessing figures this out I don't know but I can't imagine it being anything but a standard lib OS call to ask the operating system how many CPUs it has.

You can actually override this with your own number. It looks like this:


from deco import concurrent

@concurrent(processes=5)
def really_slow_and_intensive_thing():
    ...

So that way, the operating system doesn't get too busy. It's like a throttle.

A more realistic example

Let's actually use the mutable for something and let's do something that isn't just a time.sleep(). Also, let's do something that is CPU bound. A lot of times where concurrency is useful is when you're network bound because running many network waiting things at the same time doesn't hose the system from being able to do other things.

Here's the code:


from PIL import Image
from deco import concurrent, synchronized


@concurrent
def slow(times, offset):
    t0 = time.time()
    path = '9745e8.jpg'
    img = Image.open(path)
    size = (100 + offset * 20, 100 + offset * 20)
    img.thumbnail(size, Image.ANTIALIAS)
    img.save('thumbnails/{}.jpg'.format(offset), 'JPEG')
    t1 = time.time()
    times[offset] = t1 - t0

@synchronized
def run(times):
    for index in range(100):
        slow(times, index)

t0 = time.time()
times = {}
run(times)
t1 = time.time()
print "TOOK", t1-t0
print "WOULD HAVE TAKEN", sum(times.values())

It generates 100 different thumbnails from a very large original JPG. Running this on my macbook pro takes 8.4 seconds but the individual times was a total of 65.1 seconds. The numbers makes sense, because 65 seconds / 8 cores ~= 8 seconds.

But, where it gets really interesting is that if you remove the deco decorators and run 100 thumbnail creations in serial, on my laptop, it takes 28.9 seconds. Now, 28.9 seconds is much more than 8.4 seconds so it's still a win to multiprocessing for this kind of CPU bound work. However, stampeding herd of doing 8 CPU intensive tasks at the same time can put some serious strains on your system. Also, it could cause high spikes in terms of memory allocation that wouldn't have happened if freed space can be re-used in the serial pattern.

Here's by the way the difference in what this looks like in the Activity Monitor:

Fully concurrent PIL work

Running PIL in all CPUs

Same work but in serial

In serial

One more "realistic" pattern

Let's do this again with a network bound task. Let's download 100 webpages from my blog. We'll do this by keeping an index where the URL is the key and the value is the time it took to download that one individual URL. This time, let's start with the serial pattern:

(Note! I ran these two experiments a couple of times so that the server-side cache would get a chance to clear out outliers)


import time, requests

urls = """
https://www.peterbe.com/plog/blogitem-040212-1
https://www.peterbe.com/plog/geopy-distance-calculation-pitfall
https://www.peterbe.com/plog/app-for-figuring-out-the-best-car-for-you
https://www.peterbe.com/plog/Mvbackupfiles
...a bunch more...
https://www.peterbe.com/plog/swedish-holidays-explaine
https://www.peterbe.com/plog/wing-ide-versus-jed
https://www.peterbe.com/plog/worst-flash-site-of-the-year-2010
""".strip().splitlines()
assert len(urls) == 100

def download(url, data):
    t0 = time.time()
    assert requests.get(url).status_code == 200
    t1 = time.time()
    data[url] = t1-t0

def run(data):
    for url in urls:
        download(url, data)

somemute = {}
t0 = time.time()
run(somemute)
t1 = time.time()
print "TOOK", t1-t0
print "WOULD HAVE TAKEN", sum(somemute.values()), "seconds"

When run, the output is:

TOOK 35.3457410336
WOULD HAVE TAKEN 35.3454759121 seconds

Now, let's add the deco decorators, so basically these changes:

from deco import concurrent, synchronized

@concurrent
def download(url, data):
    t0 = time.time()
    assert requests.get(url).status_code == 200
    t1 = time.time()
    data[url] = t1-t0

@synchronized
def run(data):
    for url in urls:
        download(url, data)

And the output this time:

TOOK 5.13103795052
WOULD HAVE TAKEN 39.7795288563 seconds

So, instead of it having to take 39.8 seconds it only needed to take 5 seconds with extremely little modification. I call that a win!

What's next

Easy; actually build something that uses this.