Peterbe.com

Peter Bengtsson's blog

Page 30

Don't forget your sets in Python!

March 10, 2017
4 comments Python

I had this piece of code:


new_all_ids = set(
    x for x in all_ids if x not in to_process_ids
)

The all_ids is a list object of 1.1 million IDs. Some repeated. And to_process_ids is a list sample of 1,000 randomly selected IDs from that list but all unique. The objective of the code was to first extract 1,000 IDs, do something when them, then remove those 1,000 from the original list and once that's done, update all_ids back with those from to_process_ids removed.

Only problem was that I noticed that this operation took 30+ seconds! How can it take 30 seconds to do a little bit of list comprehension? The explanation is the lack of index lookups.

What the code actually does is this:


new_all_ids = set()
for x in all_ids:
    not_found = False
    for y in to_process_ids:
        if y != x:
            not_found = True
    if not_found:
         new_all_ids.add(x)

Can you see it? It's doing a nested loop. Or, O(n^2) in computer lingo. That's 1.1 million * 1,000 iterations of that conditional.

What sets do, on the other hand, is that they convert the list of keys in the set to a hash table. Kinda similar to how dict works except it doesn't point to a value. That means that asking a set if it contains a certain key is a O(1) operation.

You might have spotted the solution already.
If you instead do:


to_process_ids_set = set(to_process_ids)
new_all_ids = set(
    x for x in all_ids if x not in to_process_ids_set
)

Now, in my particular example, instead of taking 30+ seconds this implementation only takes 0.026 seconds. 1,000 times faster.

You might also have noticed that there's a much more convenient way to do this very same thing, namely set operations! The desired new_all_ids is going to become a set anyway. And if we can first convert it to a set, then do the operation on it we can avoid looping over repeats as we do the checking.

Final solution:


new_all_ids = set(all_ids) - set(to_process_ids)

You can try it yourself with this little benchmark:


"""
Demonstrate three ways how to reduce a non-unique list of integers
by 1,000 randomly selected unique ones.
Demonstrates the huge difference between lists and set lookups.
"""
import time
import random


# Range numbers chosen quite arbitrarily to get a benchmark that lasts
# a couple of seconds with a realistic proportion of unique and 
# repeated integers.
items = 200000
all_ids = [random.randint(1, items * 2) for _ in range(items)]

print('About', 100. * len(set(all_ids)) / len(all_ids), '% unique')

to_process_ids = random.sample(set(all_ids), 1000)


def f1(to_process_ids):
    return set(x for x in all_ids if x not in to_process_ids)


def f2(to_process_ids):
    to_process_ids_set = set(to_process_ids)
    return set(x for x in all_ids if x not in to_process_ids_set)


def f3(to_process_ids):
    return set(all_ids) - set(to_process_ids)


functions = [f1, f2, f3]
results = {f.__name__: [] for f in functions}
for i in range(10):
    random.shuffle(functions)
    for f in functions:
        t0 = time.time()
        f(to_process_ids)
        t1 = time.time()
        results[f.__name__].append(t1 - t0)

for function in sorted(results):
    print(function, sum(results[function])/ len(results[function]))

When I run that with my Python 3.5.1 I get:

About 78.616 % unique
f1 3.9494598865509034
f2 0.041156983375549315
f3 0.02245485782623291

Seems to match expectations.

crontabber now supports locking, both high- and low-level

March 4, 2017
0 comments Python, Mozilla, PostgreSQL

tl;dr; In other words, you can now have multiple servers with crontabber, all talking to a central PostgreSQL for its state, and not have to worry about jobs being started more than exactly once. This will be super useful if your crontabber apps are such that they kick of stored procedures that would freak out if run more than once with the same parameters.

crontabber is an advanced Python program to run cron-like applications in a predictable way. Every app is a Python class with a run method. Example here. Until version 0.18 you had to do locking outside but now the locking has been "internalized". Meaning, if you open two terminals and run python crontabber.py --admin.conf=myconfig.ini in both you don't have to worry about it starting the same apps in parallel.

General, business logic locking

Every app has a state. It's stored in PostgreSQL. It looks like this:

# \d crontabber
              Table "public.crontabber"
    Column    |           Type           | Modifiers
--------------+--------------------------+-----------
 app_name     | text                     | not null
 next_run     | timestamp with time zone |
 first_run    | timestamp with time zone |
 last_run     | timestamp with time zone |
 last_success | timestamp with time zone |
 error_count  | integer                  | default 0
 depends_on   | text[]                   |
 last_error   | json                     |
 ongoing      | timestamp with time zone |
Indexes:
    "crontabber_unique_app_name_idx" UNIQUE, btree (app_name)

The last column, ongoing used to be just for the "curiosity". For example, in Socorro we used that to display a flashing message about which jobs are ongoing right now.

As of version 0.18, this ongoing column is actually used to NOT run apps again. Basically, when started, crontabber figures out which app to run next (assuming it's time to run it) and now the first thing it does is look up if it's ongoing already, and if it is the whole crontabber application exits with an error code of 3.

Sub-second locking

What might happen is that two separate servers which almost perfectly synchronoized clocks might have cron run crontabber at the "exact" same time. Or rather, only a few milliseconds apart. But the database is central so what might happen is that two distinct PostgreSQL connection tries to send a... UPDATE crontabber SET ongoing=now() WHERE app_name='some-app-name' at the very same time.

So how is this solved? The answer is row-level locking. The magic sauce is here. You make a select, by app_name with a suffix of FOR UPDATE WAIT. Imagine two distinct PostgreSQL connections sending this:


BEGIN;
SELECT ongoing FROM crontabber WHERE app_name = 'my-app-name'
FOR UPDATE NOWAIT;

-- do some other stuff in Python

UPDATE crontabber SET ongoing = now() WHERE app_name = 'my-app-name';
COMMIT;

One of them will succeed the other will raise an error. Now all you need to do is catch that raised error, check that it's a row-level locking error and not some other general error. Instead of worrying about the raised error you just accept it and exit the program early.

This screenshot of a test.sql script demonstrates this:

Two terminals lined up and I start one and quickly switch and start the other one

Another way to demonstrate this is to use psycopg2 in a little script:


import threading
import psycopg2


def updater():
    connection = psycopg2.connect('dbname=crontabber_exampleapp')
    cursor = connection.cursor()
    cursor.execute("""
    SELECT ongoing FROM crontabber WHERE app_name = 'bar'
    FOR UPDATE NOWAIT
    """)
    cursor.execute("""
    UPDATE crontabber SET ongoing = now() WHERE app_name = 'bar'
    """)
    print("JOB CAN START!")
    connection.commit()


# Use threads to simulate starting two connections virtually 
# simultaneously.
threads = [
    threading.Thread(target=updater),
    threading.Thread(target=updater),
]
for thread in threads:
    thread.start()
for thread in threads:
    thread.join()

The output of this is:

▶ python /tmp/test.py
JOB CAN START!
Exception in thread Thread-1:
Traceback (most recent call last):
...
OperationalError: could not obtain lock on row in relation "crontabber"

With threads, you never know exactly which one will work and which one will not. In this case it was Thread-1 that sent its SQL a couple of nanoseconds too late.

In conclusion...

As of version 0.18 of crontabber, all locking is now dealt with inside crontabber. You still kick off crontabber from cron or crontab but if your cron does kick it off whilst it's still in the midst of running a job, it will simply exit with an error code of 2 or 3.

In other words, you can now have multiple servers with crontabber, all talking to a central PostgreSQL for its state, and not have to worry about jobs being started more than exactly once. This will be super useful if your crontabber apps are such that they kick of stored procedures that would freak out if run more than once with the same parameters.

Podcasttime.io - How Much Time Do Your Podcasts Take To Listen To?

February 13, 2017
3 comments Python, Web development, Django, JavaScript, React

tl;dr; It's a web app where you search and find the podcasts you listen to. It then gives you a break down how much time that requires to keep up, per day, per week and per month. Podcasttime.io

First I wrote some scripts to scrape various sources of podcasts. This is basically a RSS feed URL from which you can fetch the name and an image. And with some cron jobs you can download and parse each podcast feed and build up an index of how many episodes they have and how long each episode is. Together with each episodes "publish date" you can easily figure out an average of how much content each podcast puts out over time.

Suppose you listen to JavaScript Air, Talk Python To Me and Google Cloud Platform Podcast for example, that means you need to listen to podcasts for about 8 minutes per day to keep up.

The Back End

The technology is exciting. The backend is a Django 1.10 server. It manages a PostgreSQL database of all the podcasts, episodes, cron jobs etc. Through Django ORM signals is packages up each podcast with its metadata and stores it in an Elasticsearch database. All the communication between Django and ElasticSearch is done with Elasticsearch DSL.

Also, all the downloading and parsing of feeds is done as background tasks in Celery. This got really interesting/challenging because sooo many podcasts are poorly marked up and many a times the only way to find out how long an episode is is to use ffmpeg to probe it and that takes time.

Another biggish challenge is that fact that often things simply don't work because of networks being what they are, unreliable. So you have to re-attempt network calls without accidentally getting caught in infinite loops of accidentally putting a bad/broken RSS feed back into the background queue again and again and again.

The Front End

Actually, the first prototype of this app was written with Django as the front end plus some jQuery to tie things together. On a plane ride, and as an excuse to learn it, I re-wrote the whole thing in React with Redux. To be honest, I never really enjoyed that and it felt like everything was hard and I had to do more jumping-around-files than actual coding. In particular, Redux is nice but when you have a lot of AJAX both inside components and upon mounting it gets quite messy in my humble opinion.

So, on another plane ride (to Hawaii, so I had more time) I re-wrote it from scratch but this time using three beautiful pieces of front end technology: create-react-app, Mobx and mobx-router. Suddenly it became fun again. Mobx (or Redux or something "fluxy") is necessary if you want fancy pushState URLs AND a central (aka global) state management.

To be perfectly honest, I never actually tried combining Mobx with something like react-router or if it's even possible. But with mobx-router it's quite neat. You write a "views route map" (see example) where you can kick off AJAX before entering (and leaving) routes. Then you use that to populate a global store and now all components can be almost entirely about simply rendering the store. There is some AJAX within the mounted components (e.g. the search and autocomplete).

On the home page, there's a chart that rather unscientifically plots episode durations over time as a line chart. I'm trying a library called Plotly which is actually a online app for building charts but they offer a free JavaScript library too for generating graphs. Not entirely sure how I feel about it yet but apart from looking a big crowded on mobile, it's working really well.

A Killer Feature

This is a pattern I've wanted to build but never managed to get right. The way to get data about a podcast (and its episodes) is to do an Elasticsearch search. From the homepage you basically call /find?q=Planet%20money when you search. That gives you almost all the information you need. So you store that in the global store. Then, if the user clicks on that particular podcast to go to its "perma page" you can simply load that podcast's individual route and you don't need to do something like /find?id=727 because you already have everything you need. If the user then opens that page in a new tab or reloads you now have to fetch just the one podcast, so you simply call /find?id=727. In other words, subsequent page loads load instantly! (Basically, it updates the store's podcast object upon clicking any of the podcasts iterated over from the listing. Code here)

And to top that - and this is where a good router shines - if you make a search or something, click something and click back since you have a global store of state, you can simply reuse that without needing another AJAX query.

The State of the Future

First of all, this is a fun little side project and it's probably buggy. My goal is not to make money on it but to build up a graph. Every time someone uses the site and finds the podcasts they listen to that slowly builds up connections. If you listen to "The Economist", "Planet Money" and "Freakonomics", that tie those together loosely. It's hard to programmatically know that those three podcasts are "related" but they are by "peoples' taste".

The ultimate goal of this is; now I can recommend other podcasts based on a given set. It's a little bit like LastFM used to work. Using Audioscrobbler LastFM was able to build up a graph based on what people preferred to listen to and using that network of knowledge they can recommend things you have not listened to but probably would appreciate.

At the moment, there's a simple Picks listing of "lists" (aka "picks") that people have chosen. With enough time and traffic I'll try to use Elasticsearch's X-Pack Graph capabilities to develop a search engine based on this.

At the time of writing, I've indexed 4,669 podcasts, spanning 611,025 episodes which equates to 549,722 hours of podcast content.

The Code

The front end code is available on github.com/peterbe/podcasttime2 and is relatively neat and tidy. The most interesting piece is probably the views/index.js which is the "controller" of things. That's where it decides which component to render, does the AJAX queries and manages the global store.

The back end code is a bit messier. It's done as an "app" as part of this very blog. The way the Elasticsearch indexing is configured is here and the hotch potch code for scraping and parsing RSS feeds is here.

Please try it out and show me your selection. You can drop feedback here.

Autocompeter is Dead. Long live Autocompeter!

January 9, 2017
0 comments Python, Web development, Go

About 2 years ago I launched Autocompeter.com. It was two parts:

1) A autocompeter.js pure JavaScript solution to add autocomplete to a search input field.
2) A REST API where you can submit titles with a HTTP header key, and a fancy autocomplete search.

Only Rewrote the Go + Redis part

The second part has now been completely re-written. The server was originally written in Go and used Redis. Now it's Django and ElasticSearch.

The ultimate reason for this was that Redis was, by far, the biggest memory consumer on my shared DigitalOcean server. The way it worked was that every prefix of every word in every title was indexes as a key. For example the words p, pe, pet, pete, peter and peter$ are all keys and they point to an array of IDs that you then look up to get the distinct set of titles and their URLs. This makes it really really fast but since redis doesn't support namespaces, or multiple columns it means that for every prefix it needs a prefix of its own for the domain they belong to. So the hash for www.peterbe.com is eb9f747 so the strings to store are instead eb9f747p, eb9f747pe, eb9f747pet, eb9f747pete, eb9f747peter and eb9f747peter$.

ElasticSearch on the other hand has ALL of this built in deep in Lucene. AND you can filter. So the way it's queried now instead is something like this:


search = TitleDoc.search()
search = search.filter('term', domain=domain.name)
search = search.query(Q('match_phrase', title=request.GET['q']))
search = search.sort('-popularity', '_score')
search = search[:size]
response = search.execute()
...

And here's how the mapping is defined:


from elasticsearch_dsl import (
    DocType,
    Float,
    Text,
    Index,
    analyzer,
    Keyword,
    token_filter,
)


edge_ngram_analyzer = analyzer(
    'edge_ngram_analyzer',
    type='custom',
    tokenizer='standard',
    filter=[
        'lowercase',
        token_filter(
            'edge_ngram_filter', type='edgeNGram',
            min_gram=1, max_gram=20
        )
    ]
)


class TitleDoc(DocType):
    id = Keyword()
    domain = Keyword(required=True)
    url = Keyword(required=True, index=False)
    title = Text(
        required=True,
        analyzer=edge_ngram_analyzer,
        search_analyzer='standard'
    )
    popularity = Float()
    group = Keyword()

I'm learning ElasticSearch rapidly but I still feel like I have so much to learn. This solution I have here is quite good and I'm pretty happy with the results but I bet there's a lot of things I can learn to make it even better.

Why Ditch Go?

I actually had a lot of fun building the first server version of Autocompeter in Go but Django is just so many times more convenient. It's got management commands, ORM, authentication system, CSRF protection, awesome error reporting, etc. All built in! With Go I had to build everything from scratch.

Also, I felt like the important thing here is the JavaScript client and the database. Now that I've proven this to work with Django and elasticsearch-dsl I think it wouldn't be too hard to re-write the critical query API in Go or in something like Sanic for maximum performance.

All Dockerized

Oh, one of the reasons I wanted to do this new server in Python is because I want to learn Docker better and in particular Docker with Python projects.

The project is now entirely contained in Docker so you can start the PostgreSQL, ElasticSearch 5.1.1 and Django with docker-compose up. There might be a couple of things I've forgot to document for how to configure things but this is actually the first time I've developed something entirely in Docker.

ElasticSearch 5 in Travis-CI

January 6, 2017
0 comments Python, Linux, Web development

tl;dr; Here's a working .travis.yml file that works with ElasticSearch 5.1.1

I had to jump through hoops to get Travis-CI to run with ElasticSearch 5.1.1 and I thought I'd share. If you just do:

services:
  - elasticsearch

This is from the Travis-CI documentation but this installs ElasticSearch 1.4. Not good enough. The instructions on the same page for using higher versions did not work for me.

To get a specific version you need to download it yourself and install it with dpkg -i but the problem is that if you want to use ElasticSearch version 5, you need to have Java 1.8. The short answer is that this is how you install Java 1.8:

addons:
  apt:
    packages:
      - oracle-java8-set-default

But now you need to sudo so you need to add sudo: true in your .travis.yml. Bummer, because it makes the build a bit slower. However, a necessary evil.

The critical line I use to install it is this:

curl -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.1.1.deb && \
sudo dpkg -i --force-confnew elasticsearch-5.1.1.deb && \
sudo service elasticsearch start

I thought I could "upgrade" the existing install, but that breaks thinks. In other words you have to remove the services: - elasticsearch line or else it can't upgrade.

Now, during debugging I was not getting errors on the line:

sudo service elasticsearch start

So I add this to be sure the right version got installed:

#!/bin/bash
curl -v http://localhost:9200/

and then I can see that the right version was installed. It should look something like this:

* About to connect() to localhost port 9200 (#0)
*   Trying 127.0.0.1... connected
> GET / HTTP/1.1
> User-Agent: curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3
> Host: localhost:9200
> Accept: */*
> 
< HTTP/1.1 200 OK
< content-type: application/json; charset=UTF-8
< content-length: 327
< 
{
  "name" : "m_acpqT",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "b4_KnK6KQmSx64C9o-81Ug",
  "version" : {
    "number" : "5.1.1",
    "build_hash" : "5395e21",
    "build_date" : "2016-12-06T12:36:15.409Z",
    "build_snapshot" : false,
    "lucene_version" : "6.3.0"
  },
  "tagline" : "You Know, for Search"
}
* Connection #0 to host localhost left intact
* Closing connection #0

Note the line that says "number" : "5.1.1",.

So, yay! Hopefully this will help someone else because it took me quite a while to get right.

10 Reasons I Love create-react-app

January 4, 2017
4 comments Web development, JavaScript

I now have two finished side projects (1 & 2) done and deployed based on create-react-app which I've kinda fallen in love with. Here are my reasons why I love create-react-app:

1. It Just Works

That wonderful Applesque feeling of "Man! That wasn't hard at all!". It's been such a solid experience with so few hiccups that I'm tickled really impressed. I think it works so well because it's a big project (174 contributors at the time of writing) and the scope is tight. Many other modern web framework boilerplatey projects feels like "This is how I do it. See if you like it. But you better think like I do."

2. No Hard Decisions To Be Made

Webpack still intimidates me. Babel still intimidates me. ESlint still intimidates me. I know how to use all three but whenever I do I feel like I'm not doing it right and that I'm missing out on something that everyone else knows. create-react-app encompasses all of these tools with a set of hard defaults. I know the Babel it ships with just works and I know that it isn't configured to support decorators but it works nicely and so far I've not had any desires to extend it.

Some people will get claustrophic not being able to configure things. But I'm "a making kinda person" so I like this focus on shipping rather than doing it the rightest way. In fact, I think it's a feature that you can't configure things in create-react-app. The option to yarn run eject isn't really an option because once you've ejected, and now can configure, you're no longer using create-react-app.

3. I Feel Like I'm Using Up-to-date Tech/Tools

I've tried other React (and AngularJS) boilerplate projects before and once started you always get nervous upgrading underlying bits with fear that it's going to stop working. Once you start a create-react-app project there are only two (three technically) things you need to update in your package.json and that's react-scripts and react (and react-dom technically).

If you read about a new cool way of doing something with Webpack|Babel|ESlint most likely that will be baked into react-scripts soon for you. So just stay cool and stay up to date.

By the way, as a side-note. If you've gotten used to keeping your package.json up to date with npm outdated make sure you switch to yarn and run yarn outdated. With yarn outdated it only checks the packages you've listed in package.json against npmjs.com. With npm oudated it will mention dependencies within listed packages that are outdated and that's really hard to evaluate.

4. The Server Proxy

Almost all my recent projects are single-page apps that load a behemoth of .js code and once started it starts getting data from a REST server built in Django or Go or something. So, realistically the app looks like this:


  componentDidMount() {
    fetch('/api/userdata/')
    .then(r = r.json())
    .then(response => {
      this.setState({userData: response.user_data})
    })
  }

With create-react-app you just add this line to your package.json:

"proxy": "http://localhost:8000"

Now, when you open http://localhost:3000 in the browser and the AJAX kicks off the browser thinks it's talking to http://localhost:3000/api/userdata/ but the Node server that create-react-app ships with just automatically passes that on to http://localhost:8000/api/userdata/. (This example just changes the port but note that it's a URL so the domain can be different too).

Now you don't have to worry about CORS or having to run everything via a local Nginx and use rewrite rules to route the /api/* queries.

5. One Page Documentation

This single page is all you need.

At the time of writing that documentation uses npm instead of yarn and I don't know why that is but, anyways, it's nice that it's all there in one single page. To find out how to do something I can just ⌘-f and find.

6. Has Nothing To Do With SCSS, Less or PostCSS

SCSS, Less and PostCSS are amazing but don't work with create-react-app. You have to use good old CSS. Remember that one? This is almost always fine by me because the kinds of changes I make to the CSS are mainly nudging one margin here or floating one div there. For that it's fine to keep is simple.

Although I haven't used it yet I'm very intrigued to use styled components as a really neat trick to style things in a contained and decentralized way that looks neat.

7. Has Nothing To Do With Redux, react-router or MobX

I'm sure you've seen it too. You discover some juicy looking React boilerplate project that looks really powerful but then you discover that it "forces" you to use Redux|react-router|MobX. Individually these are amazing libraries but if the boilerplate forces a choice that will immediately put some people off and the boilerplate project will immediately suffer from lack of users/contributors because it made a design decision too early.

It's really not hard to chose one of the fancier state management apps or other fancy routing libraries since your create-react-app project starts from scratch. They're all good but since I get to start from scratch and I build with my chosen libraries and learn how they work since I have to do it myself.

8. You Still Get Decent Hot Module Reloading

I have never been comfortable with react-hot-loader. Even after I tried the version 3 branch there were lots of times where I had to pause to think if I have to refresh the page or not.

By default, when you start a create-react-app project, it has a websocket thing connected to the server that notices when you edit source files and when doing so it refreshes the page you're on. That might feel brutish compared to the magic of proper hot module reloading. But for a guy who prefers console.log over debugger breakpoints this feels more than good enough to me.

In fact, if you add this to then bottom of your src/index.js now, when you edit a file, it won't refresh the page. Just reload the app:


if (module.hot) {
  module.hot.accept()
}

It will reload the state but the web console log doesn't disappear and a full browser refresh is usually slower.

9. Running Tests Included

I'll be honest and admit that I don't write tests. Not for side projects. Reason being that I'm almost always a single developer plus the app is most likely more a proof of concept than something supposed to stand the test of time.

But usually writing tests in JavaScript projects is scary. Not because it's hard but because it's tricky to get started. Which libraries should I use? Usually when you have a full suite up and running you realize that you depend on so many different libraries and once it works you cry a little when you read on Hacker News about some new fancy suite runner that literally has bells and whistles. create-react-app builds on jest but the documentation carefully takes you through the steps to start with simple unit tests to start rendering components, code coverage and continuous CI.

10. Dan Abramov

Dan is a powerhouse in the React community. Not just because he wrote Hot Reloading, redux and most of create-react-app, but because he's so incredibly nice.

I follow him on Twitter and his humble persona (aka #juniordevforlife) and positive attitude makes me feel close to him. And he oozes both-feet-on-the-ground-ism when it comes to the crazy world of JavaScript framework/library frenzy we're going through.

The fact that he built most of create-react-app, and that he encourages its uptake, makes me feel like I'm about to climb up on the shoulder of a giant.

Using Fanout.io in Django

December 13, 2016
1 comment Python, Web development, Django, Mozilla, JavaScript

Earlier this year we started using Fanout.io in Air Mozilla to enhance the experience for users awaiting content updates. Here I hope to flesh out its details a bit to inspire others to deploy a similar solution.

What It Is

First of all, Fanout.io is basically a service that handles your WebSockets. You put in some of Fanout's JavaScript into your site that handles a persistent WebSocket connection between your site and Fanout.io. And to push messages to your user you basically send them to Fanout.io from the server and they "forward" it to the WebSocket.

The HTML page looks like this:


<html>
<body>

  <h1>Web Page</h1>

<!-- replace the FANOUT_REALM_ID with the ID you get in the Fanout.io admin page -->
<script 
  src="https://{{ FANOUT_REALM_ID }}.fanoutcdn.com/bayeux/static/faye-browser-1.1.2-fanout1-min.js"
></script>
<script src="fanout.js"></script>
</body>
</html>

And the fanout.js script looks like this:


window.onload = function() {
  // replace the FANOUT_REALM_ID with the ID you get in the Fanout.io admin page
  var client = new Faye.Client('https://{{ FANOUT_REALM_ID }}.fanoutcdn.com/bayeux')
  client.subscribe('/mycomments', function(data) {  
     console.log('Incoming updated data from the server:', data);
  })
};

And in server it looks something like this:


from django.conf import settings
import fanout

fanout.realm = settings.FANOUT_REALM_ID
fanout.key = settings.FANOUT_REALM_KEY


def post_comment(request):
    """A django view function that saves the posted comment"""
   text = request.POST['comment']
   saved_comment = Comment.objects.create(text=text, user=request.user)
   fanout.publish('mycomments', {'new_comment': saved_comment.id})
   return http.JsonResponse({'comment_posted': True})

Note that, in the client-side code, there's no security since there's no authentication. Any client can connect to any channel. So it's important that you don't send anything sensitive. In fact, you should think of this pattern simply as a hint that something has changed. For example, here's a slightly more fleshed out example of how you'd use the subscription.


window.onload = function() {
  // replace the FANOUT_REALM_ID with the ID you get in the Fanout.io admin page
  var client = new Faye.Client('https://{{ FANOUT_REALM_ID }}.fanoutcdn.com/bayeux')
  client.subscribe('/mycomments', function(data) {  
    if (data.new_comment) {
      // server says a new comment has been posted in the server
      $.json('/comments', function(response) {
        $('#comments .comment').remove();
        $.each(response.comments, function(comment) {        
          $('<div class="comment">')
          .append($('<p>').text(comment.text))
          .append($('<span>').text('By: ' + comment.user.name))
          .appendTo('#comments');
        });
      });
    }
  })
};

Yes, I know jQuery isn't hip but it demonstrates the pattern well. Also, in the real world you might not want to ask the server for all comments (and re-render) but instead do an AJAX query to get all new comments since some parameter or something.

Why It's Awesome

It's awesome because you can have a simple page that updates near instantly when the server's database is updated. The alternative would be to do a setInterval loop that frequently does an AJAX query to see if there's new content to update. This is cumbersome because it requires a lot heavier AJAX queries. You might want to make it secure so you engage sessions that need to be looked up each time. Or, since you're going to request it often you have to write a very optimized server-side endpoint that is cheap to query often.

And last but not least, if you rely on an AJAX loop interval, you have to pick a frequency that your server can cope with and it's likely to be in the range of several seconds or else it might overload the server. That means that updates are quite delayed.

But maybe most important, you don't need to worry about running a WebSocket server. It's not terribly hard to do one yourself on your laptop with a bit of Node Express or Tornado but now you have yet another server to maintain and it, internally, needs to be connected to a "pub-sub framework" like Redis or a full blown message queue.

Alternatives

Fanout.io is not the only service that offers this. The decision to use Fanout.io was taken about a year ago and one of the attractive things it offers is that it's got a freemium option which is ideal for doing local testing. The honest truth is that I can't remember the other justifications used to chose Fanout.io over its competitors but here are some alternatives that popped up on a quick search:

It seems they all (including Fanout.io) has freemium plans, supports authentication, REST APIs (for sending and for querying connected clients' stats).

There are also some more advanced feature packed solutions like Meteor, Firebase and GunDB that act more like databases that are connected via WebSockets or alike. For example, you can have a database as a "conduit" for pushing data to a client. Meaning, instead of sending the data from the server directly you save it in a database which syncs to the connected clients.

Lastly, I've heard that Heroku has a really neat solution that does something similar whereby it sets up something similar as an extension.

Let's Get Realistic

The solution sketched out above is very simplistic. There are a lot more fine-grained details that you'd probably want to zoom in to if you're going to do this properly.

Throttling

In Air Mozilla, we call fanout.publish(channel, message) from a post_save ORM signal. If you have a lot of saves for some reason, you might be sending too many messages to the client. A throttling solution, per channel, simply makes sure your "callback" gets called only once per channel per small time frame. Here's the solution we employed:


window.Fanout = (function() {
  var _locks = {};
  return {
    subscribe: function subscribe(channel, callback) {
      _client.subscribe(channel, function(data) {
          if (_locks[channel]) {
              // throttled
              return;
          }
          _locks[channel] = true;
          callback(data);
          setTimeout(function() {
              _locks[channel] = false;
          }, 500);
      });        
    };
  }
})();

Subresource Integrity

Subresource integrity is an important web security technique where you know in advance a hash of the remote JavaScript you include. That means that if someone hacks the result of loading https://cdn.example.com/somelib.js the browser compares the hash of that with a hash mentioned in the <script> tag and refuses to load it if the hash doesn't match.

In the example of Fanout.io it actually looks like this:


<script 
  src="https://{{ FANOUT_REALM_ID }}.fanoutcdn.com/bayeux/static/faye-browser-1.1.2-fanout1-min.js"
  crossOrigin="anonymous"
  integrity="sha384-/9uLm3UDnP3tBHstjgZiqLa7fopVRjYmFinSBjz+FPS/ibb2C4aowhIttvYIGGt9"
></script>

The SHA you get from the Fanout.io documentation. It requires, and implies, that you need to use an exact version of the library. You can't use it like this: <script src="https://cdn.example/somelib.latest.min.js" ....

WebSockets vs. Long-polling

Fanout.io's JavaScript client follows a pattern that makes it compatible with clients that don't support WebSockets. The first technique it uses is called long-polling. With this the server basically relys on standard HTTP techniques but the responses are long lasting instead. It means the request simply takes a very long time to respond and when it does, that's when data can be passed.

This is not a problem for modern browsers. They almost all support WebSocket but you might have an application that isn't a modern browser.

Anyway, what Fanout.io does internally is that it first creates a long-polling connection but then shortly after tries to "upgrade" to WebSockets if it's supported. However, the projects I work only need to support modern browsers and there's a trick to tell Fanout to go straight to WebSockets:


var client = new Faye.Client('https://{{ FANOUT_REALM_ID }}.fanoutcdn.com/bayeux', {
    // What this means is that we're opting to have
    // Fanout start with fancy-pants WebSocket and
    // if that doesn't work it falls back on other
    // options, such as long-polling.
    // The default behaviour is that it starts with
    // long-polling and tries to "upgrade" itself
    // to WebSocket.
    transportMode: 'fallback'
});

Fallbacks

In the case of Air Mozilla, it already had a traditional solution whereby it does a setInterval loop that does an AJAX query frequently.

Because the networks can be flaky or because something might go wrong in the client, the way we use it is like this:


var RELOAD_INTERVAL = 5;  // seconds

if (typeof window.Fanout !== 'undefined') {
    Fanout.subscribe('/' + container.data('subscription-channel-comments'), function(data) {
        // Supposedly the comments have changed.
        // For security, let's not trust the data but just take it
        // as a hint that it's worth doing an AJAX query
        // now.
        Comments.load(container, data);
    });
    // If Fanout doesn't work for some reason even though it
    // was made available, still use the regular old
    // interval. Just not as frequently.
    RELOAD_INTERVAL = 60 * 5;
}
setInterval(function() {
    Comments.reload_loop(container);
}, RELOAD_INTERVAL * 1000);

Use Fanout Selectively/Progressively

In the case of Air Mozilla, there are lots of pages. Some don't ever need a WebSocket connection. For example, it might be a simple CRUD (Create Update Delete) page. So, for that I made the whole Fanout functionality "lazy" and it only gets set up if the page has some JavaScript that knows it needs it.

This also has the benefit that the Fanout resource loading etc. is slightly delayed until more pressing things have loaded and the DOM is ready.

You can see the whole solution here. And the way you use it here.

Have Many Channels

You can have as many channels as you like. Don't create a channel called comments when you can have a channel called comments-123 where 123 is the ID of the page you're on for example.

In the case of Air Mozilla, there's a channel for every single page. If you're sitting on a page with a commenting widget, it doesn't get WebSocket messages about newly posted comments on other pages.

Conclusion

We've now used Fanout for almost a year in our little Django + jQuery app and it's been great. The management pages in Air Mozilla use AngularJS and the integration looks like this in the event manager page:


window.Fanout.subscribe('/events', function(data) {
    $scope.$apply(lookForModifiedEvents);
});

Fanout.io's been great to us. Really responsive support and very reliable. But if I were to start a fresh new project that needs a solution like this I'd try to spend a little time to investigate the competitors to see if there are some neat features I'd enjoy.

UPDATE

Fanout reached out to help explain more what's great about Fanout.io

"One of Fanout's biggest differentiators is that we use and promote open technologies/standards. For example, our service supports the open Bayeux protocol, and you can connect to it with any compatible client library, such as Faye. Nearly all competing services have proprietary protocols. This "open" aspect of Fanout aligns pretty well with Mozilla's values, and in fact you'd have a hard time finding any alternative that works the same way."

Cope with JSONDecodeError in requests.get().json() in Python 2 and 3

November 16, 2016
6 comments Python

UPDATE June 2020

Thanks to commentors, I've been made aware that requests will use simplejson if it's installed to handle the deserialization of the JSON. So if you have simplejson in your requirements.txt or pyproject.toml you have to change to this:


import requests
import simplejson

response = requests.get(url)
try:
    print(response.json())
except simplejson.errors.JSONDecodeError:
    print("N'est pas JSON")

Or, make your app more solid by doing the same trick that requests does:


import requests
try:
    from simplejson.errors import JSONDecodeError
except ImportError:
    from json.decoder import JSONDecodeError

response = requests.get(url)
try:
    print(response.json())
except JSONDecodeError:
    print("N'est pas JSON")

Now, back to the old blog post...

Suppose you don't know with a hundred percent certainty that an API will respond in with a JSON payload you need to protect yourself.

This is how you do it in Python 3:


import json
import requests

response = requests.get(url)
try:
    print(response.json())
except json.decoder.JSONDecodeError:
    print("N'est pas JSON")

This is how you do it in Python 2:


import requests

response = requests.get(url)
try:
    print response.json()
except ValueError:
    print "N'est pas JSON"

Here's how you make the code work across both:


import json
import requests

try:
    from json.decoder import JSONDecodeError
except ImportError:
    JSONDecodeError = ValueError

response = requests.get(url)
try:
    print(response.json())
except JSONDecodeError:
    print("N'est pas JSON")

How to deploy a create-react-app

November 4, 2016
2 comments Web development, JavaScript, React

First of all, create-react-app is an amazing kit. It's a zero configuration bundle that gives you a react app boilerplate with a dev server, linting and a deployment tool. All are awesome but not perfect.

I could go on giving this project praise but if you're here reading this you might be convinced already.

Anyway, the way you deploy a create-react-app project is actually stunningly simple, but there is one major caveat to look out for. Basically running yarn run build will first delete existing files in the ./build/ directory. Files that it indents to replace. For example your ./build/index.html or your ./build/static/js/main.94a86fe3.js.

So, what I suggest is that you deploy it like this:

#!/bin/bash

# Go into the project where the package.json exists
cd myproject
# Upgrade any libraries
yarn
# Use 
yarn run build
mv build build_final

Note! This tip is only applicable if you deploy "in place" as opposed to building a whole new container/image and swapping an old container/image for a new one.

Now, for your Nginx point to the ./build_final directory instead. For example:

# /etc/nginx/sites-enabled/mysite.conf
server {
    server_name mydomain.example.com;
    root /full/path/to/myproject/build_final;

    location / {
        try_files $uri /index.html;
        add_header   Cache-Control public;
        expires      1d;
    }
}

The whole point of this tip is that it's a good idea to not point Nginx to the ./build directory (but to a copy of it instead) because otherwise, during the seconds that yarn run build runs (1-5 seconds) a bunch of files will be missing and Nginx will send 404 errors to the clients unlucky enought to connect during the deployment.

Optimization of QuerySet.get() with or without select_related

November 3, 2016
1 comment Python, Django, PostgreSQL

If you know you're going to look up a related Django ORM object from another one, Django automatically takes care of that for you.

To illustrate, imaging a mapping that looks like this:


class Artist(models.Models):
    name = models.CharField(max_length=200)
    ...


class Song(models.Models):
    artist = models.ForeignKey(Artist)
    ...

And with that in mind, suppose you do this:


>>> Song.objects.get(id=1234567).artist.name
'Frank Zappa'

Internally, what Django does is that it looks the Song object first, then it does a look up automatically on the Artist. In PostgreSQL it looks something like this:

SELECT "main_song"."id", "main_song"."artist_id", ... FROM "main_song" WHERE "main_song"."id" = 1234567
SELECT "main_artist"."id", "main_artist"."name", ... FROM "main_artist" WHERE "main_artist"."id" = 111

Pretty clear. Right.

Now if you know you're going to need to look up that related field you can ask Django to make a join before the lookup even happens. It looks like this:


>>> Song.objects.select_related('artist').get(id=1234567).artist.name
'Frank Zappa'

And the SQL needed looks like this:

SELECT "main_song"."id", ... , "main_artist"."name", ... 
FROM "main_song" INNER JOIN "main_artist" ON ("main_song"."artist_id" = "main_artist"."id") WHERE "main_song"."id" = 1234567

The question is; which is fastest?

Well, there's only one way to find out and that is to measure with some relatistic data.

Here's the benchmarking code:


def f1(id):
    try:
        return Song.objects.get(id=id).artist.name
    except Song.DoesNotExist:
        pass

def f2(id):
    try:
        return Song.objects.select_related('artist').get(id=id).artist.name
    except Song.DoesNotExist:
        pass

def _stats(r):
    #returns the median, average and standard deviation of a sequence
    tot = sum(r)
    avg = tot/len(r)
    sdsq = sum([(i-avg)**2 for i in r])
    s = list(r)
    s.sort()
    return s[len(s)//2], avg, (sdsq/(len(r)-1 or 1))**.5

times = defaultdict(list)
functions = [f1, f2]
for id in range(100000, 103000):
    for f in functions:
        t0 = time.time()
        r = f(id)
        t1 = time.time()
        if r:
            times[f.__name__].append(t1-t0)
    # Shuffle the order so that one doesn't benefit more
    # from deep internal optimizations/caching in Postgre.
    random.shuffle(functions)

for k, values in times.items():
    print(k, [round(x * 1000, 2) for x in _stats(values)])

For the record, here are the parameters of this little benchmark:

The server is a 4Gb DigitialOcean server that is being used but not super busy
The PostgreSQL server is running on the same virtual machine as the Python process used in the benchmark
Django 1.10, Python 3.5, PostgreSQL 9.5.4
1,588,258 "Song" objects, 102,496 "Artist" objects
Note that the range of IDs has gaps but they're not counted

The Result

Function	Median	Average	Std Dev
f1	3.19ms	9.17ms	19.61ms
f2	2.28ms	6.28ms	15.30ms

The Conclusion

If you use the median, using select_related is 30% faster and if you use the average, using select_related is 46% faster.

So, if you know you're going to need to do that lookup put in .select_related(relation) before every .get(id=...) in your Django code.

Deep down in PostgreSQL, the inner join is ultimately two ID-by-index lookups. And that's what the first method is too. It's likely that the reason the inner join approach is faster is simply because there's less connection overheads.

Lastly, YOUR MILEAGE WILL VARY. Every benchmark is flawed but this quite realistic because it's not trying to be optimized in either way.