Filtered by Web development

Page 7

Reset

When Docker is too slow, use your host

January 11, 2018
3 comments Web development, Django, macOS, Docker

I have a side-project that is basically a React frontend, a Django API server and a Node universal React renderer. The killer feature is its Elasticsearch database that searches almost 2.5M large texts and 200K named objects. All the data is stored in a PostgreSQL and there's some Python code that copies that stuff over to Elasticsearch for indexing.

Timings for searches in Songsearch
The PostgreSQL database is about 10GB and the Elasticsearch (version 6.1.0) indices are about 6GB. It's moderately big and even though individual searches take, on average ~75ms (in production) it's hefty. At least for a side-project.

On my MacBook Pro, laptop I use Docker to do development. Docker makes it really easy to run one command that starts memcached, Django, a AWS Product API Node app, create-react-app for the search and a separate create-react-app for the stats web app.

At first I tried to also run PostgreSQL and Elasticsearch in Docker too, but after many attempts I had to just give up. It was too slow. Elasticsearch would keep crashing even though I extended my memory in Docker to 4GB.

This very blog (www.peterbe.com) has a similar stack. Redis, PostgreSQL, Elasticsearch all running in Docker. It works great. One single docker-compose up web starts everything I need. But when it comes to much larger databases, I found my macOS host to be much more performant.

So the dark side of this is that I have remember to do more things when starting work on this project. My PostgreSQL was installed with Homebrew and is always running on my laptop. For Elasticsearch I have to open a dedicated terminal and go to a specific location to start the Elasticsearch for this project (e.g. make start-elasticsearch).

The way I do this is that I have this in my Django projects settings.py:


import dj_database_url
from decouple import config


DATABASES = {
    'default': config(
        'DATABASE_URL',
        # Hostname 'docker.for.mac.host.internal' assumes
        # you have at least Docker 17.12.
        # For older versions of Docker use 'docker.for.mac.localhost'
        default='postgresql://peterbe@docker.for.mac.host.internal/songsearch',
        cast=dj_database_url.parse
    )
}

ES_HOSTS = config('ES_HOSTS', default='docker.for.mac.host.internal:9200', cast=Csv())

(Actually, in reality the defaults in the settings.py code is localhost and I use docker-compose.yml environment variables to override this, but the point is hopefully still there.)

And that's basically it. Now I get Docker to do what various virtualenvs and terminal scripts used to do but the performance of running the big databases on the host.

Whatsdeployed facelift

January 5, 2018
0 comments Python, Web development, Mozilla, Docker

tl;dr; Whatsdeployed.io is an impressively simple web app to help web developers and web ops people quickly see what GitHub commits have made it into your Dev, Stage or Prod environment. Today it got a facelift.

The code is now more than 5 years old and has served me well. It's weird to talk too positively about the app because I actually wrote it but because it's so simple in terms of design and effort it feels less personal to talk about it.

Here's what's in the facelift

  • Upgraded to Bootstrap 4.
  • Instead of relying on downloading a heavy Glyphicon web font, just to display a single checkmark, that's now a simple image.
  • Ability to use a GitHub developer personal token to avoid rate limitations on GitHub's API.
  • The first lookup to get all commits is now done via the Flask app to use my auth token to avoid the rate limit.
  • Much better error handling if any of the underlying requests.get() that the Flask app does, fails. Also includes which URL it failed on.
  • Basic validation to prevent submitting the main form without typing anything in.
  • You can hack on it with Docker. Thanks @willkg.
  • Improved the code that extracts Bugzilla bug numbers out of commit messages. Thanks @edmorely.
  • Refreshed screenshots in the README.md
  • A brand new introduction text on the home page for people who end up on the site not knowing what it is.
  • If any XHR errors happen figuring out the "culprits", you now get a pretty error describing this instead of swallowing it all.

Please let me know if there's anything broken or missing.

CSS selector simplifier regular expression in JavaScript

December 20, 2017
0 comments Web development, JavaScript

The Problem

I'm working on a project where it needs to evaluate CSS as a string. Basically, it compares CSS selectors against a DOM to see if the CSS selector is used in the DOM.

But CSS has pseudo classes. A common one a lot of people are familiar with is: a:hover { text-decoration: crazy }. So that :hover part is not relevant when evaluating the CSS selector against the DOM. So you chop off the :hover bit and is left with a which you can then look for in the DOM.

But there are some tricks and make this less trivial. Consider, this from Bootstrap 3


a[href^="#"]:after,
a[href^="javascript:"]:after {
    content: "";
}

In this case we can't simply split on a : character.

Another non-trivial example comes from Semantic UI:


.ui[class*="4:3"].embed {
  padding-bottom: 75%;
}
.ui[class*="16:9"].embed {
  padding-bottom: 56.25%;
}
.ui[class*="21:9"].embed {
  padding-bottom: 42.85714286%;
}

Basically, if you just split the selectors (e.g. a:hover) on the first : and keep everything to the left (e.g. a), with these non-trivial CSS selectors you'd get this:

a[href^="javascript

and

.ui[class*="4

etc. These CSS selectors will fail. Both Firefox and Chrome seem to swallow any errors but cheerio will raise a SyntaxError and not just that but the problem is that the CSS selector is just the wrong one to look for.

The Solution

The solution has to be to split by the : character when it's not between two quotation marks.

This Stackoverflow post helped me with the regex. It was trivial to extend now my final solution looks like this:


/**
 * Reduce a CSS selector to be without any pseudo class parts.
 * For example, from 'a:hover' return 'a'. And from 'input::-moz-focus-inner'
 * to 'input'.
 * Also, more advanced ones like 'a[href^="javascript:"]:after' to
 * 'a[href^="javascript:"]'.
 * The last example works too if the input was 'a[href^='javascript:']:after'
 * instead (using ' instead of ").
 *
 * @param {string} selector
 * @return {string}
 */
const reduceCSSSelector = selector => {
  return selector.split(
    /:(?=([^"'\\]*(\\.|["']([^"'\\]*\\.)*[^"'\\]*['"]))*[^"']*$)/g
  )[0]
}

Extra; About regexes

I've been coding for about 20 years and would like to think I know my way around writing regular expressions in various languages. However, I'm also eager to admit that I often fumble and rely on googling/stackoverflow more than actually understanding what the heck I'm doing. That's why I found this comment so amusing:

Thank you! Didn't think it was possible. I understand 100% of the theory, about 60% of the regex, and I'm down to 0% when it comes to writing it on my own. Oh, well, maybe one of these days. – Azmisov

Msgpack vs JSON (with gzip)

December 19, 2017
14 comments Python, Web development

tl;dr; I see no reason worth switching to Msgpack instead of good old JSON.

I was curious, how much more efficient is Msgpack at packing a bunch of data into a file I can emit from a web service.

In this experiment I take a massive JSON file that is used in a single-page-app I worked on. If I download the file locally as a .json file, the file is 2.1MB.

Converting it to Msgpack:


>>> import json, msgpack
>>> with open('events.json') as f:
...   events=json.load(f)
...
>>> len(events)
3
>>> events.keys()
dict_keys(['max_modified', 'events', 'urls'])
>>> with open('events.msgpack', 'wb') as f:
...   f.write(msgpack.packb(events))
...
1880266

events.json vs events.msgpack
Now, let's compared the two file formats, as seen on disk:

▶ ls -lh events*
-rw-r--r--  1 peterbe  wheel   2.1M Dec 19 10:16 events.json
-rw-r--r--  1 peterbe  wheel   1.8M Dec 19 10:19 events.msgpack

But! How well does it compress?

More common than not your web server can return content encoded in Gzip as content-encoding: gzip. So, let's compare that:

▶ gzip events.json ; gzip events.msgpack
▶ ls -l events*
-rw-r--r--  1 peterbe  wheel  304416 Dec 19 10:16 events.json.gz
-rw-r--r--  1 peterbe  wheel  305905 Dec 19 10:19 events.msgpack.gz

Msgpack vs JSON (with gzip)

Oh my! When you gzip the files the .json file ultimately becomes smaller. By a whopping 0.5%!

What about speed?

First let's open the files a bunch of times and see how long it takes to unpack:


def f1():
    with open('events.json') as f:
        s = f.read()
    t0 = time.time()
    events = json.loads(s)
    t1 = time.time()
    assert len(events['events']) == 4365
    return t1 - t0


def f2():
    with open('events.msgpack', 'rb') as f:
        s = f.read()
    t0 = time.time()
    events = msgpack.unpackb(s, encoding='utf-8')
    t1 = time.time()
    assert len(events['events']) == 4365
    return t1 - t0


def f3():
    with open('events.json') as f:
        s = f.read()
    t0 = time.time()
    events = ujson.loads(s)
    t1 = time.time()
    assert len(events['events']) == 4365
    return t1 - t0

(Note that the timing is around the json.loads() etc without measuring how long it takes to get the files to strings)

json.loads() vs msgpack.unpack() vs. ujson.loads()
Result (using Python 3.6.1): All about the same.

FUNCTION: f1 Used 56 times
    MEDIAN 30.509352684020996
    MEAN   31.09178798539298
    STDEV  3.5620914333233595
FUNCTION: f2 Used 68 times
    MEDIAN 27.882099151611328
    MEAN   28.704492484821994
    STDEV  3.353800228776872
FUNCTION: f3 Used 76 times
    MEDIAN 27.746915817260742
    MEAN   27.920340236864593
    STDEV  2.21554251130519

Same benchmark using PyPy 3.5.3, but skipping the f3() which uses ujson:

FUNCTION: f1 Used 99 times
    MEDIAN 20.905017852783203
    MEAN   22.13949386519615
    STDEV  5.142071370453135
FUNCTION: f2 Used 101 times
    MEDIAN 36.96393966674805
    MEAN   40.54664857316725
    STDEV  17.833577642246738

Dicussion and conclusion

One of the benefits of Msgpack is that it can used for streaming. "Streaming unpacking" as they call it. But, to be honest, I've never used it. That can useful when you have structured data trickling in and you don't want to wait for it all before using the data.

Another cool feature Msgpack has is ability to encode custom types. E.g. datetime.datetime. Like bson can do. With JSON you have to, for datetime objects do string conversions back and forth and the formats are never perfectly predictable so you kinda have to control both ends.

But beyond some feature differences, it seems that JSON compressed just as well as Msgpack when Gzipped. And unlike Msgpack JSON is not binary so it's easy to poke around with any tool. And decompressing JSON is just as fast. Almost. But if you need to squeeze out a couple of extra free milliseconds from your JSON files you can use ujson.

Conclusion; JSON is fine. It's bigger but if you're going to Gzip anyway, it's just as small as Msgpack.

Bonus! BSON

Another binary encoding format that supports custom types is BSON. This one is a pure Python implementation. BSON is used by MongoDB but this bson module is not what PyMongo uses.

Size comparison:

▶ ls -l events*son
-rw-r--r--  1 peterbe  wheel  2315798 Dec 19 11:07 events.bson
-rw-r--r--  1 peterbe  wheel  2171439 Dec 19 10:16 events.json

So it's 7% larger than JSON uncompressed.

▶ ls -l events*son.gz
-rw-r--r--  1 peterbe  wheel  341595 Dec 19 11:07 events.bson.gz
-rw-r--r--  1 peterbe  wheel  304416 Dec 19 10:16 events.json.gz

Meaning it's 12% fatter than JSON when Gzipped.

Doing a quick benchmark with this:


def f4():
    with open('events.bson', 'rb') as f:
        s = f.read()
    t0 = time.time()
    events = bson.loads(s)
    t1 = time.time()
    assert len(events['events']) == 4365
    return t1 - t0

Compared to the original f1() function:

FUNCTION: f1 Used 106 times
    MEDIAN 29.58393096923828
    MEAN   30.289863640407347
    STDEV  3.4766612593557173
FUNCTION: f4 Used 94 times
    MEDIAN 231.00042343139648
    MEAN   231.40889786659403
    STDEV  8.947746458066405

In other words, bson is about 600% slower than json.

This blog post was supposed to be about how well the individual formats size up against each other on disk but it certainly would be interesting to do a speed benchmark comparing Msgpack and JSON (and maybe BSON) where you have a bunch of datetimes or decimal.Decimal objects and see if the difference is favoring the binary formats.

Another win for Tracking Protection in Firefox

December 13, 2017
0 comments Web development, Mozilla

When I read Swedish news I usually open DN.se. At least I used to. Opening it used to feel like an "investment". It takes too long to load and it makes the browser all janky because of the weight of all ad videos and whatnot.

With Tracking Protection, which is built in to recent versions of Firefox, all of those load speed concerns goes away pretty much.

Without Tracking Protection

Without

  • 322 requests
  • 8.78MB transferred
  • 16.12s to finish fully loading

With Tracking Protection

With Tracking Protection

  • 67 requests
  • 1.15MB transferred
  • 8.42s to finish fully loading

I've blogged about this before but I'm just so excited I felt like repeating it. With Tracking Protection now enabled it actually feels fun to open heavy-on-ads news sites again. Perhaps even worth paying for.

Synonyms with elasticsearch-dsl

December 5, 2017
0 comments Python, Web development, PostgreSQL

The documentation about how to use synonyms in Elasticsearch is good but because it's such an advanced topic, even if you read the documentation carefully, you're still left with lots of questions. Let me show you some things I've learned about how to use synonyms in Python with elasticsearch-dsl.

What's the nature of your documents?

I'm originally from Sweden but moved to London, UK in 1999 and started blogging a few years after. So I wrote most of my English with British English spelling. E.g. "centre" instead of "center". Later I moved to California in the US and slowly started to change my own English over to American English. I kept blogging but now I would prefer to write "center" instead of "centre".

Another example... Certain technical words or namings are tricky. For example, is it "go" or is it "golang"? Is it "React" or is it "ReactJS"? Is it "PostgreSQL" or "Postgres". I never know. Not only is it sometimes hard to know which is right because people use them differently, but also sometimes "brands" like that change over time since inception, the creator might have preferred something but the masses of people call it something else.

So with all that in mind, not only has the nature of my documents (my blog post texts) changed in terminology over the years. My visitors are also coming both from British English and American English. Or, suppose that I knew the perfect way to phrase that relational database that starts with "Postg...". Even if my text is always spelled one particular way, perfectly, my visitors will most likely refer to it as "postgres" sometimes and "postgresql" sometimes.

The simple solution, match all!

Create a custom analyzer

Let's jump straight into the code. People who have used elasticsearch_dsl should be familiar with most of this:


from elasticsearch_dsl import (
    DocType,
    Text,
    Index,
    analyzer,
    Keyword,
    token_filter,
)
from django.conf import settings


index = Index(settings.ES_INDEX)
index.settings(**settings.ES_INDEX_SETTINGS)


synonym_tokenfilter = token_filter(
    'synonym_tokenfilter',
    'synonym',
    synonyms=[
        'reactjs, react',  # <-- important
    ],
)

text_analyzer = analyzer(
    'text_analyzer',
    tokenizer='standard',
    filter=[
        # The ORDER is important here.
        'standard',
        'lowercase',
        'stop',
        synonym_tokenfilter,
        # Note! 'snowball' comes after 'synonym_tokenfilter'
        'snowball',
    ],
    char_filter=['html_strip']
)

class BlogItemDoc(DocType):
    oid = Keyword(required=True)
    title = Text(
        required=True, 
        analyzer=text_analyzer
    )
    text = Text(analyzer=text_analyzer)

index.doc_type(BlogItemDoc)

This code above is copied from the "real code" but a lot of distracting things that aren't important to the point, have been removed.

The magic sauce here is that you create a token_filter and you can call it whatever you want. I called mine synonym_tokenfilter and that's also what the instance variable is called.

Notice the list of synonyms. It's a plain list of strings. Specifically, it's a list of 1 string reactjs, react.

Let's see how Elasticsearch analyzes this:
First with the text react.

$ curl -XGET 'http://127.0.0.1:9200/peterbecom/_analyze?analyzer=text_analyzer&text=react&pretty=1'
{
  "tokens" : [
    {
      "token" : "react",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "reactj",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "SYNONYM",
      "position" : 0
    }
  ]
}

Note that the analyzer snowball, converted reactjs to reactj which is wrong in a sense, because there's not plural "reacts", but it ultimately doesn't matter much. At least not in this particular case.

Secondly, analyze it with the text reactjs:

$ curl -XGET 'http://127.0.0.1:9200/peterbecom/_analyze?analyzer=text_analyzer&text=reactjs&pretty=1'
{
  "tokens" : [
    {
      "token" : "reactj",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "react",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "SYNONYM",
      "position" : 0
    }
  ]
}

Same tokens! Just different order.

Test it for reals

Now, the real proof is in actually doing a search on this. Look at these two screenshots:

Search for 'react'

Search for 'reactjs'

It worked! Different ways of phrasing your search but ultimately found all the documents that matched independent of different people or different authors might prefer to spell it.

Try it for yourself:

What it looked like before

Check out these two screenshots of how it would look like before, when synonyms for postgres and postgresql had not been set up yet:

Searching for 'postgresql'

Searching for 'postgres'

One immediate thought I have is what a mess I've been in blogging about that database. Clearly I struggled to pick one way to spell it consistently.

And here's what it would look like once that synonym has been set up:

Synonym set up for 'postgres' and 'postgresql'

"go" versus "golang"

Go is a programming language. That term, too, struggles with a name ambiguity. Granted, I rarely hear people say "golang", but it's definitely a written word that turns up a lot.

The problem with setting up a synonym for go == golang is that "go" is common English word. It's also the stem of the word "going" and such. So if you set up a synonym, like I did for react and reactjs above, this is what happens:

Search for 'golang'

This is now the exact search results as if I had searched for go. But look what it matched! It matched "Go" (good) but also "Going real simple..." (bad) and "...I should go" (bad).

If someone searches for the simple term "go" they probably intend to search for the Go programming language. All that snowball stemming is critical for a bunch of other non-computer-term searches so we can't remove the stemming.

The solution is to use what's called "Simple Contraction". And it looks like this:


all_synonyms = [
    'go => golang',
    'react => reactjs',
    'postgres => postgresql',
]

That basically means that a search for go is a search for golang. And a document that uses the word go (alone) is indexed as golang.

What happens is that the word go gets converted to golang which doesn't get stemming converted down to any other forms.

However, this is no silver bullet. Any search for the term go is ultimately a search for the word golang and the regular English word go. So the benefit of all of this was that we got rid of search results matching on going and gone.

What you have to decide...

The case for go is similar to the case for react. Both of these words are nouns but they're also verbs.

Should people find "reacting to events" when they search for "react"? If so, use react, reactjs in the synonyms list.

Should people only find documents related to noun "React" when they search for "event handing in react"? If so, use react => reactjs in the synonyms list.

It's up to you and your documents and what your users tend to search for.

Bonus! For American vs British English

AVKO.org publishes a list of all British to American English synonyms. You can download the whole list here. Unfortunately I can't find a license for this file but the compiled synonyms file is part of this repo which is licensed under MIT.

I download this list and keep it in the repo. Then when setting up the analyzer and token filters I load it in like this:


synonyms_root = os.path.join(
    settings.BASE_DIR, 'peterbecom/es-synonyms'
)
american_british_syns_fn = os.path.join(
    synonyms_root, 'be-ae.synonyms'
)

with open(american_british_syns_fn) as f:
    for line in f:
        if (
            '=>' not in line or 
             line.strip().startswith('#')
         ):
            continue
        all_synonyms.append(line.strip())

Now I can finally enjoy not having to worry about the fact that sometimes I spell it "license" and sometimes I spell it "licence". It's all the same now. Brits and Americans, rejoice on common ground!

Bonus! For terrible spellers

Although I don't have a big problem with this on my techy blog but you can use the Simple Contraction technique to list unambiguously bad spelling. Add dont => don't to the list of synonyms and a search for dont is a search for don't.

Last but not least, the official Elasticsearch documentation is the place to go. This blog post hopefully phrases it in more approachable terms. Especially for Python peeps.

How to create-react-app with Docker

November 17, 2017
31 comments Linux, Web development, JavaScript, React, Docker

Why would you want to use Docker to do React app work? Isn't Docker for server-side stuff like Python and Golang etc? No, all the benefits of Docker apply to JavaScript client-side work too.

So there are three main things you want to do with create-react-app; dev server, running tests and creating build artifacts. Let's look at all three but using Docker.

Create-react-app first

If you haven't already, install create-react-app globally:

▶ yarn global add create-react-app

And, once installed, create a new project:

▶ create-react-app docker-create-react-app
...lots of output...

▶ cd docker-create-react-app
▶ ls
README.md    node_modules package.json public       src          yarn.lock

We won't need the node_modules here in the project directory. Instead, when building the image we're going let node_modules stay inside the image. So you can go ahead and... rm -fr node_modules.

Create the Dockerfile

Let's just dive in. This Dockerfile is the minimum:

FROM node:8

ADD yarn.lock /yarn.lock
ADD package.json /package.json

ENV NODE_PATH=/node_modules
ENV PATH=$PATH:/node_modules/.bin
RUN yarn

WORKDIR /app
ADD . /app

EXPOSE 3000
EXPOSE 35729

ENTRYPOINT ["/bin/bash", "/app/run.sh"]
CMD ["start"]

A couple of things to notice here.
First of all we're basing this on the official Node v8 repository on Docker Hub. That gives you a Node and Yarn by default.

Note how the NODE_PATH environment variable puts the node_modules in the root of the container. That's so that it doesn't get added in "here" (i.e. the current working directory). If you didn't do this, the node_modules directory would be part of the mounted volume which not only slows down Docker (since there are so many files) it also isn't necessary to see those files.

Note how the ENTRYPOINT points to run.sh. That's a file we need to create too, alongside the Dockerfile file.

#!/usr/bin/env bash
set -eo pipefail

case $1 in
  start)
    # The '| cat' is to trick Node that this is an non-TTY terminal
    # then react-scripts won't clear the console.
    yarn start | cat
    ;;
  build)
    yarn build
    ;;
  test)
    yarn test $@
    ;;
  *)
    exec "$@"
    ;;
esac

Lastly, as a point of convenience, note that the default CMD is "start". That's so that when you simply run the container the default thing it does is to run yarn start.

Build container

Now let's build it:

▶ docker image build -t react:app .

The -t react:app is up to you. It doesn't matter so much what it is unless you're going to upload your container the a registry. Then you probably want the repository to be something unique.

Let's check that the build is there:

▶ docker image ls react:app
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
react               app                 3ee5c7596f57        13 minutes ago      996MB

996MB! The base Node image is about ~700MB and the node_modules directory (for a clean new create-react-app) is ~160MB (at the time of writing). What the remaining difference is, I'm not sure. But it's empty calories and easy to lose. When you blow away the built image (docker image rmi react:app) your hard drive gets all that back and no actual code is lost.

Before we run it, lets go inside and see what was created:

▶ docker container run -it react:app bash
root@996e708a30c4:/app# ls
Dockerfile  README.md  package.json  public  run.sh  src  yarn.lock
root@996e708a30c4:/app# du -sh /node_modules/
148M    /node_modules/
root@996e708a30c4:/app# sw-precache
Total precache size is about 355 kB for 14 resources.
service-worker.js has been generated with the service worker contents.

The last command (sw-precache) was just to show that executables in /node_modules/.bin are indeed on the $PATH and can be run.

Run container

Now to run it:

▶ docker container run -it -p 3000:3000 react:app
yarn run v1.3.2
$ react-scripts start
Starting the development server...

Compiled successfully!

You can now view docker-create-react-app in the browser.

  Local:            http://localhost:3000/
  On Your Network:  http://172.17.0.2:3000/

Note that the development build is not optimized.
To create a production build, use yarn build.

Default app running

Pretty good. Open http://localhost:3000 in your browser and you should see the default create-react-app app.

Next step; Warm reloading

create-react-app does not support hot reloading of components. But it does support web page reloading. As soon as a local file is changed, it sends a signal to the browser (using WebSockets) to tell it to... document.location.reload().

To make this work, we need to do two things:
1) Mount the current working directory into the Docker container
2) Expose the WebSocket port

The WebSocket thing is set up by exposing port 35729 to the host (-p 35729:35729).

Below is an example running this with a volume mount and both ports exposed.

▶ docker container run -it -p 3000:3000 -p 35729:35729 -v $(pwd):/app react:app
yarn run v1.3.2
$ react-scripts start
Starting the development server...

Compiled successfully!

You can now view docker-create-react-app in the browser.

  Local:            http://localhost:3000/
  On Your Network:  http://172.17.0.2:3000/

Note that the development build is not optimized.
To create a production build, use yarn build.

Compiling...
Compiled successfully!
Compiling...
Compiled with warnings.

./src/App.js
  Line 7:  'neverused' is assigned a value but never used  no-unused-vars

Search for the keywords to learn more about each warning.
To ignore, add // eslint-disable-next-line to the line before.

Compiling...
Failed to compile.

./src/App.js
Module not found: Can't resolve './Apps.css' in '/app/src'

In the about example output. First I make a harmless save in the src/App.js file just to see that the dev server notices and that my browser reloads when I did that. That's where it says

Compiling...
Compiled successfully!

Secondly, I make an edit that triggers a warning. That's where it says:

Compiling...
Compiled with warnings.

./src/App.js
  Line 7:  'neverused' is assigned a value but never used  no-unused-vars

Search for the keywords to learn more about each warning.
To ignore, add // eslint-disable-next-line to the line before.

And lastly I make an edit by messing with the import line

Compiling...
Failed to compile.

./src/App.js
Module not found: Can't resolve './Apps.css' in '/app/src'

This is great! Isn't create-react-app wonderful?

Build build :)

There are many things you can do with the code you're building. Let's pretend that the intention is to build a single-page-app and then take the static assets (including the index.html) and upload them to a public CDN or something. To do that we need to generate the build directory.

The trick here is to run this with a volume mount so that when it creates /app/build (from the perspective) of the container, that directory effectively becomes visible in the host.

▶ docker container run -it -v $(pwd):/app react:app build
yarn run v1.3.2
$ react-scripts build
Creating an optimized production build...
Compiled successfully.

File sizes after gzip:

  35.59 KB  build/static/js/main.591fd843.js
  299 B     build/static/css/main.c17080f1.css

The project was built assuming it is hosted at the server root.
To override this, specify the homepage in your package.json.
For example, add this to build it for GitHub Pages:

  "homepage" : "http://myname.github.io/myapp",

The build folder is ready to be deployed.
You may serve it with a static server:

  yarn global add serve
  serve -s build

Done in 5.95s.

Now, on the host:

▶ tree build
build
├── asset-manifest.json
├── favicon.ico
├── index.html
├── manifest.json
├── service-worker.js
└── static
    ├── css
    │   ├── main.c17080f1.css
    │   └── main.c17080f1.css.map
    ├── js
    │   ├── main.591fd843.js
    │   └── main.591fd843.js.map
    └── media
        └── logo.5d5d9eef.svg

4 directories, 10 files

The contents of that file you can now upload to a CDN some public Nginx server that points to this as the root directory.

Running tests

This one is so easy and obvious now.

▶ docker container run -it -v $(pwd):/app react:app test

Note the that we're setting up a volume mount here again. Since the test runner is interactive it sits and waits for file changes and re-runs tests immediately, it's important to do the mount now.

All regular jest options work too. For example:

▶ docker container run -it -v $(pwd):/app react:app test --coverage
▶ docker container run -it -v $(pwd):/app react:app test --help

Debugging the node_modules

First of all, when I say "debugging the node_modules", in this context, I'm referring to messing with node_modules whilst running tests or running the dev server.

One way to debug the node_modules used is to enter a bash shell and literally mess with the files inside it. First, start the dev server (or start the test runner) and give the container a name:

▶ docker container run -it -p 3000:3000 -p 35729:35729 -v $(pwd):/app --name mydebugging react:app

Now, in a separate terminal start bash in the container:

▶ docker exec -it mydebugging bash

Once you're in you can install an editor and start editing files:

root@2bf8c877f788:/app# apt-get update && apt-get install jed
root@2bf8c877f788:/app# jed /node_modules/react/index.js

As soon as you make changes to any of the files, the dev server should notice and reload.

When you stop the container all your changes will be reset. So if you had to sprinkle the node_modules with console.log('WHAT THE HECK!') all of those disappear when the container is stopped.

NodeJS shell

This'll come as no surprise by now. You basically run bash and you're there:

▶ docker container run -it -v $(pwd):/app react:app bash
root@2a21e8206a1f:/app# node
> [] + 1
'1'

Conclusion

When I look back at all the commands above, I can definitely see how it's pretty intimidating and daunting. So many things to remember and it's got that nasty feeling where you feel like your controlling your development environment through unwieldy levers rather than your own hands.

But think of the fundamental advantages too! It's all encapsulated now. What you're working on will be based on the exact same version of everything as your teammate, your dev server and your production server are using.

Pros:

  • All packaged up and all team members get the exact same versions of everything, including Node and Yarn.
  • The node_modules directory gets out of your hair.
  • Perhaps some React code is just a small part of a large project. E.g. the frontend is React, the backend is Django. Then with some docker-compose magic you can have it all running with one command without needing to run the frontend in a separate terminal.

Cons:

  • Lack of color output in terminal.
  • The initial (or infrequent) wait for building the docker image is brutal on a slow network.
  • Lots of commands to remember. For example, How do you start a shell again?

In my (Mozilla Services) work, the projects I work on, I actually use docker-compose for all things. And I have a Makefile to help me remember all the various docker-compose commands (thanks Jannis & Will!). One definitely neat thing you can do with docker-compose is start multiple containers. Then you can, with one command, start a Django server and the create-react-app dev server with one command. Perhaps a blog post for another day.

To enable Tracking Protection for performance

November 1, 2017
0 comments Web development, Mozilla

In Firefox it's really easy to switch Tracking Protection on and off. I usually have it off in my main browser because, as a web developer, it often gives me an insight into what others would see and that's often helpful.

Tracking Protection as a means for performance boost as been mentioned many times before to not just avoid leaving digital footprints but also a way to browse faster.

An Example of Performance

I just wanted to show such an example. In both examples I load a blog post on seriouseats.com which has real content and lots of images (11 non-ad images, totalling ~600KB).

First, without Tracking Protection

Without Tracking Protection

Next, with Tracking Protection

With Tracking Protection

This is the Network waterfall view of all the requests needed to make up the page. The numbers at the very bottom are the interesting ones.

Without Tracking Protection

  • 555 requests
  • 8.61MB transferred
  • Total loading time 59 seconds

With Tracking Protection

  • 48 requests
  • 1.87MB transferred
  • Total loading time 3.3 seconds

It should be pointed out that the first version, without Tracking Protection, actually is document complete at about 20 seconds. That means that page loading icon in the browser toolbar stops spinning. That's because internally it starts triggering more downloads (for ads) by JavaScript that executes when the page load event is executed. So you can already start reading the main content and all content related images are ready at this point but it's still ~30 seconds of excessive browser and network activity just to download the ads.

Enable Tracking Protection

Enabling Tracking Protection in Firefox is really easy. It's not an addon or anything else that needs to be installed. Just click Preferences, then click the Privacy & Security tab, scroll down a little and look for Tracking Protection. There choose the Always option. That's it.

Discussion

Tracking Protection is big and involved topic. It digs into the realm of privacy and the right to your digital footprint. It also digs into the realm of making it harder (or easier, depending on how you phrase it) for content creators to generate revenue to be able to keep create content.

The takeaway is that it can mean many things but for people who just want to browse the web much much faster it can be just about performance.

Simple or fancy UPSERT in PostgreSQL with Django

October 11, 2017
7 comments Python, Web development, Django, PostgreSQL

As of PostgreSQL 9.5 we have UPSERT support. Technically, it's ON CONFLICT, but it's basically a way to execute an UPDATE statement in case the INSERT triggers a conflict on some column value. By the way, here's a great blog post that demonstrates how to use ON CONFLICT.

In this Django app I have a model that has a field called hash which has a unique=True index on it. What I want to do is either insert a row, or if the hash is already in there, it should increment the count and the modified_at timestamp instead.

The Code(s)

Here's the basic version in "pure Django ORM":


if MissingSymbol.objects.filter(hash=hash_).exists():
    MissingSymbol.objects.filter(hash=hash_).update(
        count=F('count') + 1,
        modified_at=timezone.now()
    )
else:
    MissingSymbol.objects.create(
        hash=hash_,
        symbol=symbol,
        debugid=debugid,
        filename=filename,
        code_file=code_file or None,
        code_id=code_id or None,
    )

Here's that same code rewritten in "pure SQL":


from django.db import connection


with connection.cursor() as cursor:
    cursor.execute("""
        INSERT INTO download_missingsymbol (
            hash, symbol, debugid, filename, code_file, code_id,
            count, created_at, modified_at
        ) VALUES (
            %s, %s, %s, %s, %s, %s,
            1, CLOCK_TIMESTAMP(), CLOCK_TIMESTAMP()
          )
        ON CONFLICT (hash)
        DO UPDATE SET
            count = download_missingsymbol.count + 1,
            modified_at = CLOCK_TIMESTAMP()
        WHERE download_missingsymbol.hash = %s
        """, [
            hash_, symbol, debugid, filename,
            code_file or None, code_id or None,
            hash_
        ]
    )

Both work.

Note the use of CLOCK_TIMESTAMP() instead of NOW(). Since Django wraps all writes in transactions if you use NOW() it will be evaluated to the same value for the whole transaction, thus making unit testing really hard.

But which is fastest?

The Results

First of all, this hard to test locally because my Postgres is running locally in Docker so the network latency in talking to a network Postgres means that the latency is less and having to do two different executions would cost more if the network latency is more.

I ran a simple benchmark where it randomly picked one of the two code blocks (above) depending on a 50% chance.
The results are:

METHOD     MEAN       MEDIAN
SQL        6.99ms     6.61ms
ORM        10.28ms    9.86ms

So doing it with a block of raw SQL instead is 1.5 times faster. But this difference would surely grow when the network latency is higher.

Discussion

There's an alternative and that's to use django-postgres-extra but I'm personally hesitant. The above little raw SQL hack is the only thing I need and adding more libraries makes far-future maintenance harder.

Beyond the time optimization of being able to send only 1 SQL instruction to PostgreSQL, the biggest benefit is avoiding concurrency race conditions. From the documentation:

"ON CONFLICT DO UPDATE guarantees an atomic INSERT or UPDATE outcome; provided there is no independent error, one of those two outcomes is guaranteed, even under high concurrency. This is also known as UPSERT — "UPDATE or INSERT"."

I'm going to keep this little hack. It's not beautiful but it works and saves time and gives me more comfort around race conditions.

"No space left on device" on OSX Docker

October 3, 2017
13 comments Web development, macOS, Docker

UPDATE 2020

As Greg Brown pointed out, the new way is:

docker container prune
docker image prune

Original blog post...


 

If you run out of disk space in your Docker containers on OSX, this is probably the best thing to run:

docker rm $(docker ps -q -f 'status=exited')
docker rmi $(docker images -q -f "dangling=true")

The Problem

This isn't the first time it's happened so I'm blogging about it to not forget. My postgres image in my docker-compose.yml didn't start and since it's linked its problem is "hidden". Running it in the foreground instead you can see what the problem is:

▶ docker-compose run db
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/postgresql/data ... ok
initdb: could not create directory "/var/lib/postgresql/data/pg_xlog": No space left on device
initdb: removing contents of data directory "/var/lib/postgresql/data"

Docker on OSX

I admit that I have so much to learn about Docker and the learning is slow. Docker is amazing but I think I'm slow to learn because I'm just not that interested as long as it works and I can work on my apps.

It seems to me that there's a cap of all storage of all Docker containers in one big file in OSX. It's capped to 64GB:

▶ cd ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/

com.docker.docker/Data/com.docker.driver.amd64-linux
▶ ls -lh Docker.qcow2
-rw-r--r--@ 1 peterbe  staff    63G Oct  3 08:51 Docker.qcow2

If you run the above mentioned commands (docker rm ...) this file does not shrink but space is freed up. Just like how MongoDB (used to) allocates much more disk space than it actually uses.

If you delete that Docker.qcow2 and restart Docker the space problem goes away but then the problem is that you lose all your active containers which is especially annoying if you have useful data in database containers.