How to JSON schema validate 10x (or 100x) faster in Python

November 4, 2018
9 comments Python

This is perhaps insanely obvious but it was a measurement I had to do and it might help you too if you use python-jsonschema a lot too.

I have this project which has a migration script that needs to transfer about 1M records from one PostgreSQL database, transform it a bit, validate it, and store it in another PostgreSQL database. The validation step was done like this:


from jsonschema import validate

...

with open(os.path.join(settings.BASE_DIR, "schema.yaml")) as f:
    SCHEMA = yaml.load(f)["schema"]

...


class Build(models.Model):

    ...

    @classmethod
    def validate_build(cls, build):
        validate(build, SCHEMA)

That works fine when you have a slow trickle of these coming in with many seconds or minutes apart. But when you have to do about 1M of them, the speed overhead starts to really matter. Granted, in this context, it's just a migration which is hopefully only done once but it helps that it doesn't take too long since it makes it easier to not have any downtime.

What about python-fastjsonschema?

The name python-fastjsonschema just sounds very appealing but I'm just not sure how mature it is or what the subtle differences are between that and the more established python-jsonschema which I was already using.

It has two ways of using it either...


fastjsonschema.validate(schema, data)

...or...


validator = fastjsonschema.compile(schema)
validator(data)

That got me thinking, why don't I just do that with regular python-jsonschema!
All you need to do is crack open the validate function and you can now re-used one instance for multiple pieces of data:


from jsonschema.validators import validator_for


klass = validator_for(schema)
klass.check_schema(schema)  # optional
instance = klass(SCHEMA)
instance.validate(data)

I rewrote my projects code to this:


from jsonschema import validate

...

with open(os.path.join(settings.BASE_DIR, "schema.yaml")) as f:
    SCHEMA = yaml.load(f)["schema"]
_validator_class = validator_for(SCHEMA)
_validator_class.check_schema(SCHEMA)
validator = _validator_class(SCHEMA)

...


class Build(models.Model):

    ...

    @classmethod
    def validate_build(cls, build):
        validator.validate(build)

How do they compare, performance-wise?

Let this simple benchmark code speak for itself:


from buildhub.main.models import Build, SCHEMA

import fastjsonschema
from jsonschema import validate, ValidationError
from jsonschema.validators import validator_for


def f1(qs):
    for build in qs:
        validate(build.build, SCHEMA)


def f2(qs):
    validator = validator_for(SCHEMA)
    for build in qs:
        validate(build.build, SCHEMA, cls=validator)


def f3(qs):
    cls = validator_for(SCHEMA)
    cls.check_schema(SCHEMA)
    instance = cls(SCHEMA)
    for build in qs:
        instance.validate(build.build)


def f4(qs):
    for build in qs:
        fastjsonschema.validate(SCHEMA, build.build)


def f5(qs):
    validator = fastjsonschema.compile(SCHEMA)
    for build in qs:
        validator(build.build)


# Reporting
import time
import statistics
import random

functions = f1, f2, f3, f4, f5
times = {f.__name__: [] for f in functions}


for _ in range(3):
    qs = list(Build.objects.all().order_by("?")[:1000])
    for func in functions:
        t0 = time.time()
        func(qs)
        t1 = time.time()
        times[func.__name__].append((t1 - t0) * 1000)


def f(ms):
    return f"{ms:.1f}ms"


for name, numbers in times.items():
    print("FUNCTION:", name, "Used", len(numbers), "times")
    print("\tBEST  ", f(min(numbers)))
    print("\tMEDIAN", f(statistics.median(numbers)))
    print("\tMEAN  ", f(statistics.mean(numbers)))
    print("\tSTDEV ", f(statistics.stdev(numbers)))

Basically, 3 times for each of the alternative implementations, do a validation on a 1,000 JSON blobs (technically Python dicts) that is around 1KB, each, in size.

The results:

FUNCTION: f1 Used 3 times
    BEST   1247.9ms
    MEDIAN 1309.0ms
    MEAN   1330.0ms
    STDEV  94.5ms
FUNCTION: f2 Used 3 times
    BEST   1266.3ms
    MEDIAN 1267.5ms
    MEAN   1301.1ms
    STDEV  59.2ms
FUNCTION: f3 Used 3 times
    BEST   125.5ms
    MEDIAN 131.1ms
    MEAN   133.9ms
    STDEV  10.1ms
FUNCTION: f4 Used 3 times
    BEST   2032.3ms
    MEDIAN 2033.4ms
    MEAN   2143.9ms
    STDEV  192.3ms
FUNCTION: f5 Used 3 times
    BEST   16.7ms
    MEDIAN 17.1ms
    MEAN   21.0ms
    STDEV  7.1ms

Basically, if you use python-jsonschema and create a reusable instance it's 10 times faster than the "default way". And if you do the same but with python-fastjsonscham it's 100 times faster.

By the way, in version f5 it validated 1,000 1KB records in 16.7ms. That's insanely fast!

React.memo instead of React.PureComponent

November 2, 2018
0 comments JavaScript, React

React Hooks isn't here yet but when it comes I'll be all over that, replacing many of my classes with functions.

However, as of React 16.6 there's this awesome new React.memo() thing which is such a neat solution. Why didn't I think of that, myself, sooner?!

Anyway, one of the subtle benefits of it is that writing functions minify a lot better than classes when Babel'ifying your ES6 code.

To test that, I took one of my project's classes, which needed to be "PureComponent":


class ShowAutocompleteSuggestionSong extends React.PureComponent {
  render() {
    const { song } = this.props;
    return (
      <div className="media autocomplete-suggestion-song">
        <div className="media-left">
          <img
            className={
              song.image && song.image.preview
                ? 'img-rounded lazyload preview'
                : 'img-rounded lazyload'
            }
            src={
              song.image && song.image.preview
                ? song.image.preview
                : placeholderImage
            }
            data-src={
              song.image ? absolutifyUrl(song.image.url) : placeholderImage
            }
            alt={song.name}
          />
        </div>
        <div className="media-body">
          <h5 className="artist-name">
            <b>{song.name}</b>
            {' by '}
            <span>{song.artist.name}</span>
          </h5>
          {song.fragments.map((fragment, i) => {
            return <p key={i} dangerouslySetInnerHTML={{ __html: fragment }} />;
          })}
        </div>
      </div>
    );
  }
}

Minified it weights 1,893 bytes and looks like this:

Minified PureComponent class
Minified PureComponent class

When re-written with React.memo it looks like this:


const ShowAutocompleteSuggestionSong = React.memo(({ song }) => {
  return (
    <div className="media autocomplete-suggestion-song">
      <div className="media-left">
        <img
          className={
            song.image && song.image.preview
              ? 'img-rounded lazyload preview'
              : 'img-rounded lazyload'
          }
          src={
            song.image && song.image.preview
              ? song.image.preview
              : placeholderImage
          }
          data-src={
            song.image ? absolutifyUrl(song.image.url) : placeholderImage
          }
          alt={song.name}
        />
      </div>
      <div className="media-body">
        <h5 className="artist-name">
          <b>{song.name}</b>
          {' by '}
          <span>{song.artist.name}</span>
        </h5>
        {song.fragments.map((fragment, i) => {
          return <p key={i} dangerouslySetInnerHTML={{ __html: fragment }} />;
        })}
      </div>
    </div>
  );
});

Minified it weights 783 bytes and looks like this:

Minified React.memo function
Minified React.memo function

Highly scientific measurement. Yeah, I know. (Joking)
Perhaps it's stating the obvious but part of the ES5 code that it generates, from classes can be reused for other classes.

Anyway, it's neat and worth considering to squeeze some bytes out. And the bonus is that it gets you prepared for Hooks in React 16.7.

React 16.6 with Suspense and lazy loading components with react-router-dom

October 26, 2018
7 comments Web development, JavaScript, React

If you're reading this, you might have thought one of two thoughts about this blog post title (or both); "Cool buzzwords!" or "Yuck! So much hyped buzzwords!"

Either way, React v16.6 came out a couple of days ago and it brings with it React.lazy: Code-Splitting with Suspense.

React.lazy is React's built-in way of lazy loading components. With Suspense you can make that lazy loading be smart and know to render a fallback component (or JSX element) whilst waiting for that slowly loading chunk for the lazy component.

The sample code in the announcement was deliciously simple but I was curious; how does that work with react-router-dom??

Without furher ado, here's a complete demo/example. The gist is an app that has two sub-components loaded with react-router-dom:


<Router>
  <div className="App">
    <Switch>
      <Route path="/" exact component={Home} />
      <Route path="/:id" component={Post} />
    </Switch>
  </div>
</Router>

The idea is that the Home component will list all the blog posts and the Post component will display the full details of that blog post. In my demo, the Post component never bothers to actually do the fetching of the full details to display. It just displays the passed in ID from the react-router-dom match prop. You get the idea.

That's standard React with react-router-dom stuff. Next up, lazy loading. Basically, instead of importing the Post component, you make it lazy:


-import Post from "./post";
+const Post = React.lazy(() => import("./post"));

And here comes the magic sauce. Instead of referencing component={Post} in the <Route/> you use this badboy:


function WaitingComponent(Component) {
  return props => (
    <Suspense fallback={<div>Loading...</div>}>
      <Component {...props} />
    </Suspense>
  );
}

Complete prototype

The final thing looks like this:


import React, { lazy, Suspense } from "react";
import ReactDOM from "react-dom";
import { MemoryRouter as Router, Route, Switch } from "react-router-dom";

import Home from "./home";
const Post = lazy(() => import("./post"));

function App() {
  return (
    <Router>
      <div className="App">
        <Switch>
          <Route path="/" exact component={Home} />
          <Route path="/:id" component={WaitingComponent(Post)} />
        </Switch>
      </div>
    </Router>
  );
}

function WaitingComponent(Component) {
  return props => (
    <Suspense fallback={<div>Loading...</div>}>
      <Component {...props} />
    </Suspense>
  );
}

const rootElement = document.getElementById("root");
ReactDOM.render(<App />, rootElement);

(sorry about the weird syntax highlighting with the red boxes.)

And it totally works! It's hard to show this with the demo but if you don't believe me, you can download the whole codesandbox as a .zip, run yarn && yarn run build && serve -s build and then you can see it doing its magic as if this was the complete foundation of a fully working client-side app.

1. Loading the "Home" page, then click one of the links

Loading the "Home" page

2. Lazy loading the Post component

Lazy loading the Post component

3. Post component lazily loaded once and for all

Post component lazily loaded once and for all

Bonus

One thing that can happen is that you might load the app when the Wifi is honky dory but when you eventually make a click that causes a lazy loading to actually need to go out on the Internet and download that .js file it might fail. For example, because the file has been removed from the server or your network just fails for some reason. To deal with that, simply wrap the whole <Suspense> component in an error boundary component.

See this demo which is a fork of the main demo but with error boundaries added.

In conclusion

No surprise that it works. React is pretty awesome. I just wasn't sure how it would look like with react-router-dom.

A word of warning, from the v16.6 announcement: "This feature is not yet available for server-side rendering. Suspense support will be added in a later release."

I think lazy loading isn't actually that big of a deal. It's nice that it works but how likely is it really that you have a sub-tree of components that is so big and slow that you can't just pay for it up front as part of one big fat build. If you really care about a really great web performance for those people who reach your app rarely and sporadically, the true ticket to success is server-side rendering and shipping a gzipped HTML document with all the React client-side code non-blocking rendering so that the user can download the HTML, start reading/consuming it immediately and then whilst the user is doing that you download the rest of the .js that is going to be needed once the user clicks around. Start there.

How much HTML is too much for optimal web performance

October 17, 2018
4 comments Web development, Web Performance

Right off the bat; I don't know. All I know is that it's complicated.

I have this page which is just a blog post page. It's entirely rendered on the server, comments and all. At the time of writing, the total size of the HTML document is 119KB (30KB gzipped). If you remove all the comments, which makes up the bulk of the HTML it reduces down to 31KB (7KB gzipped). Fair enough. That's 23KB less to download. But, does it matter (much)?

Downloading

First of all, I noticed this:

Waterfall
WebPagetest with iPhone 6, 4G on the same US coast as the datacenter

That's a WebPagetest using iPhone 6 on 4G and, lemme emphasize this, it took 126ms to download the HTML document. If you subtract "DNS Lookup" (283ms), "Initial Connection" (1013ms), and "SSL Negotiation" (733ms) it took 684ms serve the file, download it, and parse it. Remember, this is all on 4G. Pretty fast. In conclusion, it's probably not too much HTML in that page to download. This downloadingness is fraction of the total "web performance cost". Let's dig deeper.

Note! With WebPagetest all those numbers like DNS Lookup, Initial Connection and SSL Negotiation are wildly unpredictable between tests. Chances are, the numbers are very different the next time you run a test using the exact same input. Who knows. Deep internet plumbings beyond the control of WebPagetest.

Note! I ran it one more time with the exact same parameters and this time it was 535ms (instead of 684ms) to serve, download, and parse.

Parsing & layout

Parsing is hard to measure but here's what I found when using the Google Chrome dev tools:

Google Chrome Performance devtools
Google Chrome Performance devtools

It says it took...

  • parsed HTML - 94ms
  • recalculate style - 43ms
  • layout - 386ms

That's half a second just loading and rendering. Definitely sucks. But note, this test uses 4x CPU slowdown and 3G simulation. So perhaps it's not so bad.

Let's try again with a smaller HTML document

So I butchered up a hybrid version that has almost the same HTML except all but 1 of those 166'ish div.comment DOM nodes. It's now 31KB (7KB gzipped´) to download instead of 119KB (30KB gzipped).

Same WebPagetest parameters but now this this smaller HTML document:

WebPagetest with a much smaller HTML footprint
WebPagetest with a much smaller HTML footprint

Now it says it only took 39ms to download and 232ms (it was 684ms before) to serve the file, download it, and parse it. Interesting!

Note! I ran it one more time with exact same parameters and this time it was 237ms (instead of 232ms) to serve, download, and parse.

Clearly it's working. The smaller the HTML document the faster it performs. No surprise. But stick around for the conclusion.

Parsing & layout with a smaller HTML document

Check this out:

Google Chrome Performance devtools (smaller HTML document)
Google Chrome Performance devtools (smaller HTML document)

It says it took...

  • parsed HTML - 91ms
  • recalculate style - 6ms
  • layout - 29ms

Mind you, all of these numbers are at the mercy of what my laptop is up to at the moment as it can affect Chrome's rendering if it has, at that moment, less (or more) access to CPU and memory caching.

Either way, it parses + layout in 126ms instead of 523ms for the larger HTML document.

Side-by-side

The best test to see how much faster the smaller HTML document variant is, is to compare them side-by-side. It looks like this:

Side-by-side
Visual comparison on WebPagetest (using 4G)

Two major takeaways from this:

  1. The smaller HTML version starts rendering half a second before the original one.
  2. The complete time favors the smaller HTML version by 2.5 seconds but that's possibly influenced by the ads that load more than any slow layout rendering.
  3. This is using 4G which isn't unheard of but definitely much less common than better speeds.

Here they are compared on "Desktop" which appears to give the smaller HTML version a 0.2 second advantage:

Visual comparison on WebPagetest (using "Desktop")
Visual comparison on WebPagetest (using "Desktop")

And here are the Lighthouse reports side-by-side:

Lighthouses
Side-by-side using Lighthouse

Discussion

The above concludes rather unsurprisingly that a smaller HTML footprint downloads, parses and lays out quicker.

The killer reason that page is so large, with all those comments rendered in the original HTML is simple: SEO. Google loves comments because comments indicate that the page is thriving and a place where people go, spend time, and stick around. I've experimented with this in the past and found that if I make the HTML document smaller (or loading the rest after document load) the SEO takes a big hit. Yes, Google's bot renders with JavaScript but not always and even if it does, I assume it's smart enough to appreciate that content that is loaded (async or post-DOMContentLoad) is less important and thus not what the page is about.

Regarding SEO, we know that Google loves fast sites. Especially for mobile. But content is still king my gut tells me. Left as an exercise to the reader to take a stand on this.

Another problem with lazy loading the comments (or whatever else might be applicable to your site) is that it might cause "flicker". I put that word in quote because sometimes flicker is literally visual flicker and sometimes it's moments of browser sluggishness. The XHR request and the subsequent post-rendering will cause a bunch of work that strains the browser and might make it unpleasant when your eyes and brain is in the midst of committing to consuming it.

Basically, there are significant real benefits of not trying to squeeze every little millisecond out by making the HTML smaller upfront. Remember the fact that the "smaller HTML" version in this test is drastic. I butchered it from 119KB to 31KB which might be so drastic that it's not necessarily applicable at all. In other words, had I reduced the HTML size by just 20% it might not even register on the performance graph but could be significant in terms SEO keywords.

Conclusion

The majority of the time spend making a web page useful to a user is a sum of all sorts of metrics. The size of the HTML document does matter but remember that it's just one of multiple aspects to watch out for.

In conclusion, it's complicated and depends on your needs and context. I hope you can benefit a little bit from the metrics above.

Switching from AWS S3 (boto3) to Google Cloud Storage (google-cloud-storage) in Python

October 12, 2018
1 comment Python

I'm in the midst of rewriting a big app that currently uses AWS S3 and will soon be switched over to Google Cloud Storage. This blog post is a rough attempt to log various activities in both Python libraries:

Disclaimer: I'm manually copying these snippets from a real project and I have to manually scrub the code clean of unimportant quirks, hacks, and other unrelated things that would just add noise.

Install

boto3


$ pip install boto3
$ emacs ~/.aws/credentials

google-cloud-storage


$ pip install google-cloud-storage
$ cat ./google_service_account.json

Note: You need to create a service account and then that gives you a .json file which you download and make sure you pass its path when you create a client.

I suspect there are more/other ways to do this with environment variables alone but I haven't got there yet.

Making a "client"

boto3

Note, there are easier shortcuts for this but with this pattern you can have full control over things like like read_timeout, connect_timeout, etc. with that confi_params keyword.


import boto3
from botocore.config import Config


def get_s3_client(region_name=None, **config_params):
    options = {"config": Config(**config_params)}
    if region_name:
        options["region_name"] = region_name
    session = boto3.session.Session()
    return session.client("s3", **options)

google-cloud-storage


from google.cloud import storage


def get_gcs_client():
    return storage.Client.from_service_account_json(
        settings.GOOGLE_APPLICATION_CREDENTIALS_PATH
    )

Checking if a bucket exists and if you have access to it

boto3 (for s3_client here, see above)


from botocore.exceptions import ClientError, EndpointConnectionError


try:

    s3_client.head_bucket(Bucket=bucket_name)
except ClientError as exception:
    if exception.response["Error"]["Code"] in ("403", "404"):
        raise BucketHardError(
            f"Unable to connect to bucket={bucket_name!r} "
            f"ClientError ({exception.response!r})"
        )
    else:
        raise
except EndpointConnectionError:
    raise BucketSoftError(
        f"Unable to connect to bucket={bucket.name!r} "
        f"EndpointConnectionError"
    )
else:
    print("It exists and we have access to it.")

google-cloud-storage


from google.api_core.exceptions import BadRequest


try:
    gcs_client.get_bucket(bucket_name)
except BadRequest as exception:
    raise BucketHardError(
        f"Unable to connect to bucket={bucket_name!r}, "
        f"because bucket not found due to {exception}"
    )
else:
    print("It exists and we have access to it.")

Checking if an object exists

boto3


from botocore.exceptions import ClientError


def key_existing(client, bucket_name, key):
    """return a tuple of (
        key's size if it exists or 0,
        S3 key metadata
    )
    If the object doesn't exist, return None for the metadata.
    """
    try:
        response = client.head_object(Bucket=bucket_name, Key=key)
        return response["ContentLength"], response.get("Metadata")
    except ClientError as exception:
        if exception.response["Error"]["Code"] == "404":
            return 0, None
        raise

Note, if you do this a lot and often find that the object doesn't exist the using list_objects_v2 is probably faster.

google-cloud-storage


def key_existing(client, bucket_name, key):
    """return a tuple of (
        key's size if it exists or 0,
        S3 key metadata
    )
    If the object doesn't exist, return None for the metadata.
    """
    bucket = client.get_bucket(bucket_name)
    blob = bucket.get_blob(key)
    if blob:
        return blob.size, blob.metadata
    return 0, None

Uploading a file with a special Content-Encoding

Note: You have to use your imagination with regards to the source. In this example, I'm assuming that the source is a file on disk and that it might have already been compressed with gzip.

boto3


def upload(file_path, bucket_name, key_name, metadata=None, compressed=False):
    content_type = get_key_content_type(key_name)

    metadata = metadata or {}

    # boto3 will raise a botocore.exceptions.ParamValidationError
    # error if you try to do something like:
    #
    #  s3.put_object(Bucket=..., Key=..., Body=..., ContentEncoding=None)
    #
    # ...because apparently 'NoneType' is not a valid type.
    # We /could/ set it to something like '' but that feels like an
    # actual value/opinion. Better just avoid if it's not something
    # really real.
    extras = {}
    if content_type:
        extras["ContentType"] = content_type
    if compressed:
        extras["ContentEncoding"] = "gzip"
    if metadata:
        extras["Metadata"] = metadata

     with open(file_path, "rb") as f:
         s3_client.put_object(Bucket=bucket_name, Key=key_name, Body=f, **extras)

google-cloud-storage


def upload(file_path, bucket_name, key_name, metadata=None, compressed=False):
    content_type = get_key_content_type(key_name)

    metadata = metadata or {}
    bucket = gcs_client.get_bucket(bucket_name)
    blob = bucket.blob(key_name)

    if content_type:
        blob.content_type = content_type
    if compressed:
        blob.content_encoding = "gzip"
    blob.metadata = metadata
    blob.upload_from_file(f)

Downloading and uncompressing a gzipped object

boto3


from io import BytesIO
from gzip import GzipFile
from botocore.exceptions import ClientError

from .utils import iter_lines


def get_stream(bucket_name, key_name):
    try:
        response = source.s3_client.get_object(
            Bucket=bucket_name, Key=key
        )
    except ClientError as exception:
        if exception.response["Error"]["Code"] == "NoSuchKey":
            raise KeyHardError("key not in bucket")
        raise

    stream = response["Body"]
    # But if the content encoding is gzip we have re-wrap the stream.
    if response.get("ContentEncoding") == "gzip":
        body = response["Body"].read()
        bytestream = BytesIO(body)
        stream = GzipFile(None, "rb", fileobj=bytestream)

    for line in iter_lines(stream):
        yield line.decode("utf-8")

google-cloud-storage


from io import BytesIO
from gzip import GzipFile
from botocore.exceptions import ClientError

from .utils import iter_lines


def get_stream(bucket_name, key_name):
    bucket = gcs_client.get_bucket(bucket_name)
    blob = bucket.get_blob(key)
    if blob is None:
        raise KeyHardError("key not in bucket")

    bytestream = BytesIO()
    blob.download_to_file(bytestream)
    bytestream.seek(0)

    for line in iter_lines(bytestream):
        yield line.decode("utf-8")

Note! That here blob.download_to_file works a bit like requests.get() in that it automatically notices the Content-Encoding metadata and does the gunzip on the fly.

Conclusion

It's not fair to compare them on style because I think boto3 came out of boto which probably started back in the day when Google was just web search and web emails.

I wanted to include a section about how to unit test against these, especially how to mock them. But what I had for a draft was getting ugly. Yes, it works for the testing needs I have in my app but it's very personal taste (aka. appropriate for the context) and admittedly quite messy.

Fancy linkifying of text with Bleach and domain checks (with Python)

October 10, 2018
2 comments Python, Web development

Bleach is awesome. Thank you for it @willkg! It's a Python library for sanitizing text as well as "linkifying" text for HTML use. For example, consider this:

>>> import bleach
>>> bleach.linkify("Here is some text with a url.com.")
'Here is some text with a <a href="http://url.com" rel="nofollow">url.com</a>.'

Note that sanitizing is separate thing, but if you're curious, consider this example:

>>> bleach.linkify(bleach.clean("Here is <script> some text with a url.com."))
'Here is &lt;script&gt; some text with a <a href="http://url.com" rel="nofollow">url.com</a>.'

With that output you can confidently template interpolate that string straight into your HTML.

Getting fancy

That's a great start but I wanted a more. For one, I don't always want the rel="nofollow" attribute on all links. In particular for links that are within the site. Secondly, a lot of things look like a domain but isn't. For example This is a text.at the start which would naively become...:

>>> bleach.linkify("This is a text.at the start")
'This is a <a href="http://text.at" rel="nofollow">text.at</a> the start'

...because text.at looks like a domain.

So here is how I use it here on www.peterbe.com to linkify blog comments:


def custom_nofollow_maker(attrs, new=False):
    href_key = (None, u"href")

    if href_key not in attrs:
        return attrs

    if attrs[href_key].startswith(u"mailto:"):
        return attrs

    p = urlparse(attrs[href_key])
    if p.netloc not in settings.NOFOLLOW_EXCEPTIONS:
        # Before we add the `rel="nofollow"` let's first check that this is a
        # valid domain at all.
        root_url = p.scheme + "://" + p.netloc
        try:
            response = requests.head(root_url)
            if response.status_code == 301:
                redirect_p = urlparse(response.headers["location"])
                # If the only difference is that it redirects to https instead
                # of http, then amend the href.
                if (
                    redirect_p.scheme == "https"
                    and p.scheme == "http"
                    and p.netloc == redirect_p.netloc
                ):
                    attrs[href_key] = attrs[href_key].replace("http://", "https://")

        except ConnectionError:
            return None

        rel_key = (None, u"rel")
        rel_values = [val for val in attrs.get(rel_key, "").split(" ") if val]
        if "nofollow" not in [rel_val.lower() for rel_val in rel_values]:
            rel_values.append("nofollow")
        attrs[rel_key] = " ".join(rel_values)

    return attrs

html = bleach.linkify(text, callbacks=[custom_nofollow_maker])

This basically taking the default nofollow callback and extending it a bit.

By the way, here is the complete code I use for sanitizing and linkifying blog comments here on this site: render_comment_text.

Caveats

This is slow because it requires network IO every time a piece of text needs to be linkified (if it has domain looking things in it) but that's best alleviated by only doing it once and either caching it or persistently storing the cleaned and rendered output.

Also, the check uses try: requests.head() except requests.exceptions.ConnectionError: as the method to see if the domain works. I considered doing a whois lookup or something but that felt a little wrong because just because a domain exists doesn't mean there's a website there. Either way, it could be that the domain/URL is perfectly fine but in that very unlucky instant you checked your own server's internet or some other DNS lookup thing is busted. Perhaps wrapping it in a retry and doing try: requests.head() except requests.exceptions.RetryError: instead.

Lastly, the business logic I chose was to rewrite all http:// to https:// only if the URL http://domain does a 301 redirect to https://domain. So if the original link was http://bit.ly/redirect-slug it leaves it as is. Perhaps a fancier version would be to look at the domain name ending. For example HEAD http://google.com 301 redirects to https://www.google.com so you could use the fact that "www.google.com".endswith("google.com").

UPDATE Oct 10 2018

Moments after publishing this, I discovered a bug where it would fail badly if the text contained a URL with an ampersand in it. Turns out, it was a known bug in Bleach. It only happens when you try to pass a filter to the bleach.Cleaner() class.

So I simplified my code and now things work. Apparently, using bleach.Cleaner(filters=[...]) is faster so I'm losing that. But, for now, that's OK in my context.

Also, in another later fix, I improved the function some more by avoiding non-HTTP links (with the exception of mailto: and tel:). Otherwise it would attempt to run requests.head('ssh://server.example.com') which doesn't make sense.

The ideal number of workers in Jest

October 8, 2018
0 comments Python, React

tl;dr; Use --runInBand when running jest in CI and use --maxWorkers=3 on your laptop.

Running out of memory on CircleCI

We have a test suite that covers 236 tests across 68 suites and runs mainly a bunch of enzyme rendering of React component but also some plain old JavaScript function tests. We hit a problem where tests utterly failed in CircleCI due to running out of memory. Several individual tests, before it gave up or failed, reported to take up to 45 seconds.
Turns out, jest tried to use 36 workers because the Docker OS it was running was reporting 36 CPUs.


> circleci@9e4c489cf76b:~/repo$ node
> var os = require('os')
undefined
> os.cpus().length
36

After forcibly setting --maxWorkers=2 to the jest command, the tests passed and it took 20 seconds. Yay!

But that got me thinking, what is the ideal number of workers when I'm running the suite here on my laptop? To find out, I wrote a Python script that would wrap the call CI=true yarn run test --maxWorkers=%(WORKERS) repeatedly and report which number is ideal for my laptop.

After leaving it running for a while it spits out this result:

SORTED BY BEST TIME:
3 8.47s
4 8.59s
6 9.12s
5 9.18s
2 9.51s
7 10.14s
8 10.59s
1 13.80s

The conclusion is vague. There is some benefit to using some small number greater than 1. If you attempt a bigger number it might backfire and take longer than necessary and if you do do that your laptop is likely to crawl and cough.

Notes and conclusions

Inline scripts in create-react-app 2.0 and CSP hashes

October 5, 2018
0 comments Web development, JavaScript, React

UPDATE (1)

My understanding of how to generate the CSP nonces was wrong. What I initially posted was a confusion between nonces and hashes. Sorry. The blog post has been updated to use hashing.

UPDATE (2)

Shortly after publishing this I changed my mind entirely. I decided I don't want any inline scripts no matter how small. Reasons are: 1) with HTTP2 it's cheap to send another file and thus that critical precious first HTML document becomes smaller and 2) when you load it as an external you have the power to load it async if it's applicable.

Check out this new script, it's hackish but works: uninline_scripts.js

UPDATE (Oct 18, 2018)

If you use INLINE_RUNTIME_CHUNK=false yarn run build no scripts, independent of size, are inlined. See this pull request for details.

END UPDATES

I have an app that is hosted on github-pages and because I can't control Content Security Policy HTTP headers I have to do it with a <meta http-equiv="Content-Security-Policy" content="${csp}"> tag in the HTML. That's working fine and the way I do it is that I have a script that looks like this:


#!/usr/bin/env node
const fs = require("fs");
const crypto = require("crypto");

const CSP_TEMPLATE = `
default-src 'none';
connect-src 'self' kinto.workon.app peterbecom.auth0.com;
frame-src peterbecom.auth0.com;
img-src 'self' avatars2.githubusercontent.com https://*.googleusercontent.com;
script-src 'self'%SCRIPT_HASHES%;
style-src 'self' 'unsafe-inline';
font-src 'self' data:;
manifest-src 'self'
`.trim();

const htmlFile = process.argv[2];
if (!htmlFile) throw new Error("missing file argument");
let html = fs.readFileSync(htmlFile, "utf8");

let hashes = "";
let csp = CSP_TEMPLATE;
const matches = html.match(/<script>.*<\/script>/g);
if (matches) {
  matches.forEach(scriptTag => {
    const hash = crypto.createHash("sha256");
    hash.update(scriptTag.replace(/<script>/, "").replace("</script>", ""));
    const digest = hash.digest("hex");
    hashes += ` 'sha256-${digest.toString("base64")}'`;
  });
}
csp = csp.replace(/%SCRIPT_HASHES%/, hashes);

const metatag = `
  <meta http-equiv="Content-Security-Policy" content="${csp}">
`
  .replace(/\n/g, "")
  .trim();
if (html.search(metatag) > -1)
  throw new Error("already has CSP metatag in HTML");
const anchor = '<meta charset="utf-8">';
const newHtml = html.replace(anchor, `${anchor}${metatag}`);
fs.writeFileSync(htmlFile, newHtml, "utf8");

Laugh all you like at my hurried node scripting but it works. It finds any <script>ANYTHING</script> tags (which means it disregards any <script src="... tags), calculates a sha256 hash string out of it and then puts that into the CSP block.

The output becomes something like this:


<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <meta 
      http-equiv="Content-Security-Policy" 
      content="default-src 'none';script-src 'self' 'sha256-bb84aa7f904e73495b9e99f08531053f3a86f3c1b2e232e3abbac252bf723f1f';">
  </head>
  <body>
    ...
    <script>....</script>
  </body>
</html>

I don't know if I've done it right but at least what didn't use to work now works; the page loads in my browsers now.

An awesome snippet to web performance test a page programmatically

October 1, 2018
0 comments Web development, JavaScript, Web Performance

I found this in an issue discussing measuring page performance with puppeteer and it's pure gold. Especially because it's so accessible and easy to use.

Here's the code:


const puppeteer = require('puppeteer');

async function run() {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.goto('https://www.peterbe.com/');

  console.log('\n==== performance.getEntries() ====\n');
  console.log(
    await page.evaluate(() =>
      JSON.stringify(performance.getEntries(), null, '  ')
    )
  );

  console.log('\n==== performance.toJSON() ====\n');
  console.log(
    await page.evaluate(() => JSON.stringify(performance.toJSON(), null, '  '))
  );

  console.log('\n==== page.metrics() ====\n');
  const perf = await page.metrics();
  console.log(JSON.stringify(perf, null, '  '));

  browser.close();
}

run();

Network waterfall Google Chrome

When you run it you get this output: https://gist.github.com/peterbe/afb09bf9277e5fa9242f8d270c687640
To run it you need to have a decently up-to-date version of puppeteer installed.

I don't claim (far from it actually!) to understand all the metrics points in there but I believe this is basically what the Network panel in the Google Chrome Dev tools is built upon. But some details and facts are easy to figure out and use in your analysis. For example, the fact that the getEntries() lists all the resources that had to be downloaded in the order they were downloaded. Also, at the end of getEntries() you get the first-paint which is often a useful metric.

Anyway, give it a spin. Wrap this up in a platform and see if you can build something really simple and really tailored to your web projects web performance testing.

Comparing KeyCDN and DigitalOcean's new Spaces CDN

September 28, 2018
2 comments Web development

The News

Spaces
This week DigitalOcean added a free CDN option for retrieving public files on their Spaces product. If you haven't heard about it, Spaces is like AWS S3 but Digital Ocean instead. You use the same tools as you use for S3, like s3cmd. It's super easy to use in the admin control panel and you can do all sorts of neat and nifty things via the web. And the documentation about how to use setup and use s3cmd is very good.

If we just focus on the CDN functionality, it's just a URL to a distributed resource that can be reached and retrieved faster because the server you get it from is hopefully geographically as near to you as possible. That's also what AWS CloudFront, Akamai CDN, and KeyCDN do. However, what you can do with the likes of KeyCDN is that you can have what's called a "pull zone". You basically host the files on your regular (aka. "origin server") and the CDN will automatically pick it up from there.

With DigitalOcean Spaces CDN...

1) GET https://myspace.nyc3.cdn.digitaloceanspaces.com/myfile.jpg
2) CDN doesn't have it because it's never been requested before
3) CDN does GET https://myspace.nyc3.digitaloceanspaces.com and serves it
4) If it didn't exist on https://myspace.nyc3.digitaloceanspaces.com the client gets a 404 Not Found.

With KeyCDN CDN...

1) GET https://myzone.kxcdn.com/myfile.jpg
2) CDN doesn't have it because it's never been requested before
3) CDN does GET https://www.mypersonalnginx.example.com/myfile.jpg and serves it
4) If it didn't exist https://www.mypersonalnginx.example.com/myfile.jpg the client gets a 404 Not Found

The critical differences is that with DigitalOcean Spaces CDN, it won't go and check your web server. On your web server it's not enough to have the files on disk. You have to upload them to the Spaces.

That's often not a problem because disks are not something you want to rely on anyway. It's better to store all files to something like Spaces or S3 and then feel free to just throw away the web server and recreate it. However, it's an important distinction.

On Performance

It's hard to measure these things but when shopping around for a CDN provider/solution you want to make sure you get one that is fast and reliable. There are lots of sites that compare CDNs, like cdnperf.com but I often find that they either don't have the CDN I could chose or there's something else weird about the comparison.

So I set up a test using Hyperping.io. I created a URL that uses KeyCDN and a URL that uses DigitalOcean Spaces CDN. It's the same 32KB image JPEG. Then, I created two monitors on Hyperping, both from San Francisco, New York, London, Frankfurt, Mumbai, São Paulo and Sydney.

Now they've been doing GET requests to these respective URLs for about 24h and the results so far are as follows:

Side by side

  • KeyCDN: Response Time 139ms
  • DigitalOcean Spaces: Response Time: 305ms

I don't know how much response times are "skewed" because of the long tail of response times for Mumbai and Sydney.
But if you take an average of the times for San Francisco, New York, London and Frankfurt you get:

  • KeyCDN: Average 33ms
  • DigitalOcean Spaces: Average 36ms

KeyCDN
KeyCDN on Hyperping

Spaces
DigitalOcean Spaces on Hyperping

In Conclusion

Being able to offload all the files from disk and put them somewhere safe is an important feature. In my side project Song Search I actually, currently, host about 7GB of images (~500k files), directly on disk and it's making me nervous. I need to move them off to something like Spaces. But it's a cumbersome to have to make sure every single file generated on disk is correctly synced. Especially if post-processing is happening with the file using the local file system. (This project runs mozjpeg and guetzli on every time).

Either way, this is a non-trivial dev-ops topic with many angles and opinions. I just thought I'd share about my quick research and performance testing.

And last but not least, the difference between 20ms and 30ms isn't important if 90+% of your visitors are from US. All of these considerations depend on your context.

UPDATE - Oct 1, 2018

I'll continue to take the risk of looking like an idiot who doesn't understand networks and the science of data analysis.

I wrote a Python script which downloads random images from each CDN URL. You leave it running for a long time and hopefully the median will start to even out and be less affected by your own network saturation.

I ran it on my laptop; from South Carolina, USA:

DOMAIN                                             MEDIAN     MEAN
songsearch.nyc3.cdn.digitaloceanspaces.com         0.140s     0.335s
songsearch-2916.kxcdn.com                          0.165s     0.250s
2,705 iterations.

And my colleague Mathieu ran it; from Barcelona, Spain:

DOMAIN                                             MEDIAN     MEAN
songsearch.nyc3.cdn.digitaloceanspaces.com         1.027s     1.168s    
songsearch-2916.kxcdn.com                          0.574s     0.648s    
134 iterations.

And my colleague Ethan ran it; from New York City, USA:

DOMAIN                                             MEDIAN     MEAN
songsearch.nyc3.cdn.digitaloceanspaces.com         0.335s     0.595s
songsearch-2916.kxcdn.com                          0.066s     0.076s
122 iterations.

I think that if you're in Barcelona you're so far away from the nearest edge location that the latency is "king of the difference". Mathieu found the KeyCDN median download times to be almost twices as good!

If you're in New York, where I suspect DigitalOcean has an edge location and I know KeyCDN has one the latency probably matters less and what matters more is the speed of the CDN web servers. In Ethan's case (only 122 measurement points) the median for KeyCDN is almost five times better! Not sure what that means but it definitely raises thoughts about how this actually matters a lot.

Also, if you're in Barcelona you would definitely want to download the little JPEGs half a second faster if you could.