Docker gotcha with building a Dockerfile in sub directory

March 2, 2018
1 comment Docker

tl;dr; Watch out for .dockerignore causing no such file or directory when building Docker images

First I tried to use docker-compose:

▶ docker-compose build ui
Building ui
Step 1/8 : FROM node:9
 ---> 29831ba76d93
Step 2/8 : ADD ui/package.json /package.json
ERROR: Service 'ui' failed to build: ADD failed: stat /var/lib/docker/tmp/docker-builder079614651/ui/package.json: no such file or directory

What the heck? Did I typo the name in the ui/Dockerfile?

The docker-compose.yml directive for this was:

yaml
  ui:
    build:
      context: .
      dockerfile: ui/Dockerfile
    environment:
      - NODE_ENV=development
    ports:
      - "3000:3000"
      - "35729:35729"
    volumes:
      - $PWD/ui:/app
    command: start

I don't know if it's the awesomest way to do docker-compose but I did copy this exactly from a different (working!) project. That other project called the web stuff frontend instead of ui in this project.

The Dockerfile looked like this:

FROM node:9

ADD ./ui/package.json ./
ADD ./ui/package-lock.json ./

RUN npm install

EXPOSE 3000
EXPOSE 35729

CMD [ "npm", "start" ]

Let's try without docker-compose. Or rather, do with docker what docker-compose does for me automatically.

▶ docker build ui -f ui/Dockerfile
Sending build context to Docker daemon  158.2MB
Step 1/8 : FROM node:9
 ---> 29831ba76d93
Step 2/8 : ADD ui/package.json /package.json
ADD failed: stat /var/lib/docker/tmp/docker-builder001494654/ui/package.json: no such file or directory

So I thought I perhaps have misunderstood how relative paths worked. I tried EVERYTHING!

I tried changing to docker build . -f ui/Dockerfile. No luck.

I tried all sorts of combinations of this with also changing the line to ADD ui/package.json ... or ADD /ui/package.json ... or ADD ./package.json ... or ADD package.json .... Nothing worked.

Finally I got it to work by doing

▶ cd ui
▶ docker build . -f Dockerfile

but for that to work I had to remove all references of the directory name ui in the Dockerfile. Nice, but this is not going to work in docker-compose.yml since that starts outside the directory ./ui/.

Sigh!

So then I learned about contexts in docker. Well, I skimmed the docs rapidly. To keep things clean it's a good idea to do things within the directory that matters. So I made this change in the docker-compose.yml:

  ui:
    build:
      context: ui
      dockerfile: Dockerfile
    environment:
      - NODE_ENV=development
    ports:
      - "3000:3000"
      - "35729:35729"
    volumes:
      - $PWD/ui:/app
    command: start

(Note the build.context: and build.dockerfile:)

Still doesn't work! 😫 Still various variations of no such file or directory.

The solution

Turns out, in projectroot/.dockerignore it had ui/ as a line entry!!!

I believe this project used to do some of the Python stuff with Docker and the React web app was done "locally" on the host. And since the ui/node_modules directory is so huge someone decided it was smart of avoid Docker mounting that.

Now the .dockerignore has .ui/node_modules and now everything works. I can build it with plain docker and docker-compose from outside the directory.

Perhaps I should have spent the time I spent writing this blog post to instead file a structured GitHub issue on Docker itself somewhere. I.e. that it should have warned be better. Any takers?

How to throttle AND debounce an autocomplete input in React

March 1, 2018
18 comments Web development, JavaScript, React

Let's start with some best practices for a good autocomplete input:

'f' - most common search term on Google

  • You want to start suggesting something as soon user starts typing. Apparently the most common search term on Google is f because people type that and Google's autocomplete starts suggesting Facebook (www.facebook.com).

  • If your autocomplete depends on a list of suggestions that is huge, such that you can't have all possible options preloaded in memory in JavaScript, starting a XHR request for every single input is not feasible. You have to throttle the XHR network requests.

  • Since networks are unreliable and results come back asynchronously in a possible different order from when they started, you should only populate your autocomplete list based on the latest XHR request.

  • If people type a lot in and keep ignoring autocomplete suggestions, you can calm your suggestions.

  • Unless you're Google or Amazon.com it might not make sense to suggest new words to autocomplete if what's been typed previously is not going to yield any results anyway. I.e. User's typed "Xjhfgxx1m 8cxxvkaspty efx8cnxq45jn Pet", there's often little value in suggesting "Peter" for that later term you're typing.

  • Users don't necessarily type one character at a time. On mobile, you might have some sort of autocomplete functionality with the device's keyboard. Bear that in mind.

  • You should not have to make an XHR request for the same input as done before. I.e. user types "f" then types in "fa" then backspaces so it's back to "f" again. This should only be at most 2 lookups.

To demonstrate these best practises, I'm going to use React with a mocked-out network request and mocked out UI for actual drop-down of options that usually appears underneath the input widget.

The Most Basic Version

In this version we have an event listener on every onChange and send the value of the input to the autocomplete function (called _fetch in this example):


class App extends React.Component {
  state = { q: "" };

  changeQuery = event => {
    this.setState({ q: event.target.value }, () => {
      this.autocompleteSearch();
    });
  };

  autocompleteSearch = () => {
    this._fetch(this.state.q);
  };

  _fetch = q => {
    const _searches = this.state._searches || [];
    _searches.push(q);
    this.setState({ _searches });
  };

  render() {
    const _searches = this.state._searches || [];
    return (
      <div>
        <input
          placeholder="Type something here"
          type="text"
          value={this.state.q}
          onChange={this.changeQuery}
        />
        <hr />
        <ol>
          {_searches.map((s, i) => {
            return <li key={s + i}>{s}</li>;
          })}
        </ol>
      </div>
    );
  }
}

You can try it here: No Throttle or Debounce

Note, when use it that an autocomplete lookup is done for every single change to the input (characters typed in or whole words pasted in). Typing in "Alask" at a normal speed our make an autocomplete lookup for "a", "al", "ala", "alas", and "alask".

Also worth pointing out, if you're on a CPU limited device, even if the autocomplete lookups can be done without network requests (e.g. you have a local "database" in-memory) there's still expensive DOM updates for that needs to be done for every single character/word typed in.

Throttled

What a throttle does is that it triggers predictably after a certain time. Every time. Basically, it prevents excessive or repeated calling of another function but doesn't get reset.

So if you type "t h r o t t l e" at a speed of 1 key press per 500ms the whole thing will take 8x500ms=3s and if you have a throttle on that, with a delay of 1s, it will fire 4 times.

I highly recommend using throttle-debounce to actually do the debounce. Let's rewrite our demo to use debounce:


import { throttle } from "throttle-debounce";

class App extends React.Component {
  constructor(props) {
    super(props);
    this.state = { q: "" };
    this.autocompleteSearchThrottled = throttle(500, this.autocompleteSearch);
  }

  changeQuery = event => {
    this.setState({ q: event.target.value }, () => {
      this.autocompleteSearchThrottled(this.state.q);
    });
  };

  autocompleteSearch = q => {
    this._fetch(q);
  };

  _fetch = q => {
    const _searches = this.state._searches || [];
    _searches.push(q);
    this.setState({ _searches });
  };

  render() {
    const _searches = this.state._searches || [];
    return (
      <div>
        <h2>Throttle</h2>
        <p>½ second Throttle triggering the autocomplete on every input.</p>
        <input
          placeholder="Type something here"
          type="text"
          value={this.state.q}
          onChange={this.changeQuery}
        />
        <hr />
        {_searches.length ? (
          <button
            type="button"
            onClick={event => this.setState({ _searches: [] })}
          >
            Reset
          </button>
        ) : null}
        <ol>
          {_searches.map((s, i) => {
            return <li key={s + i}>{s}</li>;
          })}
        </ol>
      </div>
    );
  }
}

One thing to notice on the React side is that the autocompleteSearch method can no longer use this.state.q because the function gets executed by the throttle function so the this is different. That's why, in this version we pass the search term as an argument instead.

You can try it here: Throttle

If you type something reasonably fast you'll notice it fires a couple of times. It's quite possible that if you type a bunch of stuff, with your eyes on the keyboard, by the time you're done you'll see it made a bunch of (mocked) autocomplete lookups whilst you weren't paying attention. You should also notice that it fired on the very first character you typed.

A cool feature about this is that if you can afford the network lookups, the interface will feel snappy. Hopefully, if your server is fast to respond to the autocomplete lookups there are quickly some suggestions there. At least it's a great indicator that the autocomplete UX is a think the user can expect as she types more.

Debounce

An alternative approach is to use a debounce. From the documentation of throttle-debounce:

"Debouncing, unlike throttling, guarantees that a function is only executed a single time, either at the very beginning of a series of calls, or at the very end."

Basically, ever time you "pile something on" it discards all the other delayed executions. Changing to this version is easy. just change import { throttle } from "throttle-debounce"; to import { debounce } from "throttle-debounce"; and change this.autocompleteSearchThrottled = throttle(1000, this.autocompleteSearch); to this.autocompleteSearchDebounced = debounce(1000, this.autocompleteSearch);

Here is the debounce version:


import { debounce } from "throttle-debounce";

class App extends React.Component {
  constructor(props) {
    super(props);
    this.state = { q: "" };
    this.autocompleteSearchDebounced = debounce(500, this.autocompleteSearch);
  }

  changeQuery = event => {
    this.setState({ q: event.target.value }, () => {
      this.autocompleteSearchDebounced(this.state.q);
    });
  };

  autocompleteSearch = q => {
    this._fetch(q);
  };

  _fetch = q => {
    const _searches = this.state._searches || [];
    _searches.push(q);
    this.setState({ _searches });
  };

  render() {
    const _searches = this.state._searches || [];
    return (
      <div>
        <h2>Debounce</h2>
        <p>
          ½ second Debounce triggering the autocomplete on every input.
        </p>
        <input
          placeholder="Type something here"
          type="text"
          value={this.state.q}
          onChange={this.changeQuery}
        />
        <hr />
        {_searches.length ? (
          <button
            type="button"
            onClick={event => this.setState({ _searches: [] })}
          >
            Reset
          </button>
        ) : null}
        <ol>
          {_searches.map((s, i) => {
            return <li key={s + i}>{s}</li>;
          })}
        </ol>
      </div>
    );
  }
}

You can try it here: Throttle

If you try it you'll notice that if you type at a steady pace (under 1 second for each input), it won't really trigger any autocomplete lookups at all. It basically triggers when you take your hands off the keyboard. But the silver lining with this approach is that if you typed "This is my long search input" it didn't bother looking things up for "this i", "this is my l", "this is my long s", "this is my long sear", "this is my long search in" since they are probably not very useful.

Best of Both World; Throttle and Debounce

The throttle works great in the beginning when you want the autocomplete widget to seem eager but if the user starts typing in a lot, you'll want to be more patient. It's quite human. If a friend is trying to remember something you're probably at first really quick to try to help with suggestions, but once you friend starts to remember and can start reciting, you patiently wait a bit more till they have said what they're going to say.

In this version we're going to use throttle (the eager one) in the beginning when the input is short and debounce (the patient one) when user has ignored the first autocomplete inputs and starting typing something longer.

Here is the version that uses both:


import { throttle, debounce } from "throttle-debounce";

class App extends React.Component {
  constructor(props) {
    super(props);
    this.state = { q: ""};
    this.autocompleteSearchDebounced = debounce(500, this.autocompleteSearch);
    this.autocompleteSearchThrottled = throttle(500, this.autocompleteSearch);
  }

  changeQuery = event => {
    this.setState({ q: event.target.value }, () => {
      const q = this.state.q;
      if (q.length < 5) {
        this.autocompleteSearchThrottled(this.state.q);
      } else {
        this.autocompleteSearchDebounced(this.state.q);
      }
    });
  };

  autocompleteSearch = q => {
    this._fetch(q);
  };

  _fetch = q => {
    const _searches = this.state._searches || [];
    _searches.push(q);
    this.setState({ _searches });
  };

  render() {
    const _searches = this.state._searches || [];
    return (
      <div>
        <h2>Throttle and Debounce</h2>
        <p>
          ½ second Throttle when input is small and ½ second Debounce when
          the input is longer.
        </p>
        <input
          placeholder="Type something here"
          type="text"
          value={this.state.q}
          onChange={this.changeQuery}
        />
        <hr />
        {_searches.length ? (
          <button
            type="button"
            onClick={event => this.setState({ _searches: [] })}
          >
            Reset
          </button>
        ) : null}
        <ol>
          {_searches.map((s, i) => {
            return <li key={s + i}>{s}</li>;
          })}
        </ol>
      </div>
    );
  }
}

In this version I cheated a little bit. The delays are different. The throttle has a delay of 500ms and the debounce as a delay of 1000ms. That makes it feel little bit more snappy there in the beginning when you start typing but once you've typed more than 5 characters, it switches to the more patient debounce version.

You can try it here: Throttle and Debounce

With this version, if you, in a steady pace typed in "south carolina" you'd notice that it does autocomplete lookups for "s", "sout" and "south carolina".

Avoiding wrongly ordered async responses

Suppose the user slowly types in "p" then "pe" then "pet", it would trigger 3 XHR requests. I.e. something like this:


fetch('/autocomplete?q=p')

fetch('/autocomplete?q=pe')

fetch('/autocomplete?q=pet')

But because all of these are asynchronous and sometimes there's unpredictable slowdowns on the network, it's not guarantee that they'll all come back in the same exact order. The solution to this is to use a "global variable" of the latest search term and then compare that to the locally scoped search term in each fetch callback promise. That might sound harder than it is. The solution basically looks like this:


class App extends React.Component {

  makeAutocompleteLookup = q => {
    // Store the latest input here scoped in the App instance.
    this.waitingFor = q;
    fetch('/autocompletelookup?q=' + q)
    .then(response => {
      if (response.status === 200) {
        // Only bother with this XHR response
        // if this query term matches what we're waiting for.
        if (q === this.waitingFor) {
          response.json()
          .then(results => {
              this.setState({results: results});
          })
        }
      }
    })
  }
}

Bonus feature; Caching

For caching the XHR requests, to avoid unnecessary network requests if the user uses backspace, the simplest solution is to maintain a dictionary of previous results as a component level instance. Let's assume you do the XHR autocomplete lookup like this initially:


class App extends React.Component {

  makeAutocompleteLookup = q => {
    const url = '/autocompletelookup?q=' + q;
    fetch(url)
    .then(response => {
      if (response.status === 200) {
        response.json()
        .then(results => {
            this.setState({ results });
        })
      }
    })
  }

}

To add caching (also a form of memoization) you can simply do this:


class App extends React.Component {

  _autocompleteCache = {};

  makeAutocompleteLookup = q => {
    const url = '/autocompletelookup?q=' + q;

    const cached = this._autocompleteCache[url];
    if (cached) {
      return Promise.resolve(cached).then(results => {
        this.setState({ results });
        });
      });
    }

    fetch(url)
    .then(response => {
      if (response.status === 200) {
        response.json()
        .then(results => {
            this.setState({ results });
        })
      }
    })
  }

}

In a more real app you might want to make that whole method always return a promise. And you might want to do something slightly smarter when response.status !== 200.

Bonus feature; Watch out for spaces

So the general gist of these above versions is that you debounce the XHR autocomplete lookups to only trigger sometimes. For short strings we trigger every, say, 300ms. When the input is longer, we only trigger when it appears the user has stopped typing. A more "advanced" approach is to trigger after a space. If I type "south carolina is a state" it's hard for a computer to know if "is", "a", or "state" is a complete word. Humans know and some English words can easily be recognized as stop words. However, what you can do is take advantage of the fact that a space almost always means the previous word was complete. It would be nice to trigger an autocomplete lookup after "south carolina" and "south carolina is" and "south carolina is a". These are also easier to deal with on the server side because, depending on your back-end, it's easier to search your database if you don't include "broken" words like "south carolina is a sta". To do that, here's one such implementation:


class App extends React.Component {

  // Just overriding the changeQuery method in this example.

  changeQuery = event => {
    const q = event.target.value
    this.setState({ q }, () => {

      // If the query term is short or ends with a
      // space, trigger the more impatient version.
      if (q.length < 5 || q.endsWith(' ')) {
        this.autocompleteSearchThrottled(q);
      } else {
        this.autocompleteSearchDebounced(q);
      }
    });
  };

  // Just overriding the changeQuery method in this example.

}

You can try it here: Throttle and Debounce with throttle on ending spaces.

Next level stuff

There is so much more that you can do for that ideal user experience. A lot depends on the context.

For example, when the input is small instead of doing a search on titles or names or whatever, you instead return a list of possible full search terms. So, if I have typed "sou" the back-end could return things like:

{
  "matches": [
     {"term": "South Carolina", "count": 123},
     {"term": "Southern", "count": 469},
     {"term": "South Dakota", "count": 98},
  ]
}

If the user selects one of these autocomplete suggestions, instead of triggering a full search you just append the selected match back into the search input widget. This is what Google does.

And if the input is longer you go ahead and actually search for the full documents. So if the input was "south caro" you return something like this:

{
  "matches": [
     {
       "title": "South Carolina Is A State", 
       "url": "/permapage/x19v093d"
     },
     {
       "title": "Best of South Carolina Parks", 
       "url": "/permapage/9vqif3z"
     },
     {
       "title": "I Live In South Carolina", 
       "url": "/permapage/abc300a1y"
     },
  ]
}

And when the XHR completes you look at what the user clicked and do something like this:


  return (<ul className="autocomplete">
    {this.state.results.map(result => {
      return <li onClick={event => {
        if (result.url) {
          document.location.href = result.url;
        } else {
          this.setState({ q: result.term });
        }
      }}>
        {result.url ? (
          <p className="document">{result.title}</p>
        ) : (
            <p className="new-term">{result.term}</p>
          )}
      </li>
    })
    }
    </ul>
  )

This is an incomplete example and more pseudo-code than a real solution but the pattern is quite nice. You're either helping the user type the full search term or if it's already a good match you can go skip the actual searching and go to the result directly.

This is how SongSearch works for example:

Suggestions for full search terms
Suggestions for full search terms

Suggestions for actual documents
Suggestions for actual documents

csso and django-pipeline

February 28, 2018
0 comments Python, Django, JavaScript

This is a quick-and-dirty how-to on how to use csso to handle the minification/compression of CSS in django-pipeline.

First create a file called compressors.py somewhere in your project. Make it something like this:


import subprocess
from pipeline.compressors import CompressorBase
from django.conf import settings


class CSSOCompressor(CompressorBase):

    def compress_css(self, css):
        proc = subprocess.Popen(
            [
                settings.PIPELINE['CSSO_BINARY'], 
                '--restructure-off'
            ],
            stdin=subprocess.PIPE,
            stdout=subprocess.PIPE,
        )
        css_out = proc.communicate(
            input=css.encode('utf-8')
        )[0].decode('utf-8')
        # was_size = len(css)
        # new_size = len(css_out)
        # print('FROM {} to {} Saved {}  ({!r})'.format(
        #     was_size,
        #     new_size,
        #     was_size - new_size,
        #     css_out[:50]
        # ))
        return css_out

In your settings.py where you configure django-pipeline make it something like this:


PIPELINE = {
    'STYLESHEETS': PIPELINE_CSS,
    'JAVASCRIPT': PIPELINE_JS,

    # These two important lines. 
    'CSSO_BINARY': path('node_modules/.bin/csso'),
    # Adjust the dotted path name to where you put your compressors.py
    'CSS_COMPRESSOR': 'peterbecom.compressors.CSSOCompressor',

    'JS_COMPRESSOR': ...

Next, install csso-cli in your project root (where you have the package.json). It's a bit confusing. The main package is called csso but to have a command line app you need to install csso-cli and when that's been installed you'll have a command line app called csso.

$ yarn add csso-cli

or

$ npm i --save csso-cli

Check that it installed:

$ ./node_modules/.bin/csso --version
3.5.0

And that's it!

--restructure-off

So csso has an advanced feature to restructure the CSS and not just remove whitespace and not needed semicolons. It costs a bit of time to do that so if you want to squeeze the extra milliseconds out, enable it. Trading time for space.
See this benchmark for a comparison with and without --restructure-off in csso.

Why csso you might ask

Check out the latest result from css-minification-benchmark. It's not super easy to read by it seems the best performing one in terms of space (bytes) is crass written by my friend and former colleague @mattbasta. However, by far the fastest is csso when using --restructre-off. Minifiying font-awesome.css with crass takes 326.52 ms versus 3.84 ms in csso.

But what's great about csso is Roman @lahmatiy Dvornov. I call him a friend too for all the help and work he's done on minimalcss (not a CSS minification tool by the way). Roman really understands CSS and csso is actively maintained by him and other smart people who actually get into the scary weeds of CSS browser hacks. That gives me more confidence to recommend csso. Also, squeezing a couple bytes extra out of your .min.css files isn't important when gzip comes into play. It's better that the minification tool is solid and stable.

Check out Roman's slides which, even if you don't read it all, goes to show that CSS minification is so much more than just regex replacing whitespace.
Also crass admits as one of its disadvantages: "Certain "CSS hacks" that use invalid syntax are unsupported".

Items function in JavaScript for looping over dictionaries like Python

February 23, 2018
1 comment JavaScript, React

Too many times I've written code like this:


class MyComponent extends React.PureComponent {
  render() {
    return <ul>
      {Object.keys(this.props.someDictionary).map(key => {
        return <li key={key}><b>{key}:</b> {this.props.someDictionary[key]}</li> 
      })}
    </ul>
  }
}

The clunky thing about this is that you have to reference the dictionary twice. Makes it harder to refactor. In Python, you do this instead:


for key, value in some_dictionary.items():
    print(f'$key: $value')

To do the same in JavaScript make a function like this:


function items(dict, fn) {
  return Object.keys(dict).map((key, i) => {
    return fn(key, dict[key], i)
  })
}

Now you can use it "more like Python":


class MyComponent extends React.PureComponent {
  render() {
    return <ul>
      {items(this.props.someDictionary, (key, value) => {
        return <li key={key}><b>{key}:</b> {value}</li> 
      })}
    </ul>
  }
}

Example on CodeSandbox here

UPDATE

Thanks to @Osmose and @saltycrane for alerting me to Object.entries().


class MyComponent extends React.PureComponent {
  render() {
    return <ul>
      {Object.entries(this.props.someDictionary).map(([key, value]) => {
        return <li key={key}><b>{key}:</b> {value}</li> 
      })}
    </ul>
  }
}

Updated CodeSandbox here

Run something forever in bash until you want to stop it

February 13, 2018
6 comments Linux

I often use this in various projects. I find it very useful. Thought I'd share to see if others find it useful.

Running something forever

Suppose you have some command that you want to run a lot. One way is to do this:

$ ./manage.py run-some-command && \
  ./manage.py run-some-command && \
  ./manage.py run-some-command && \
  ./manage.py run-some-command && \
  ./manage.py run-some-command && \
  ./manage.py run-some-command && \
  ./manage.py run-some-command && \
  ./manage.py run-some-command && \
  ./manage.py run-some-command && \
  ./manage.py run-some-command

That runs the command 10 times. Clunky but effective.

Another alternative is to hijack the watch command. By default it waits 2 seconds between each run but if the command takes longer than 2 seconds, it'll just wait. Running...

$ watch ./manage.py run-some-command

Is almost the same as running...:

$ clear && sleep 2 && ./manage.py run-some-command && \
  clear && sleep 2 && ./manage.py run-some-command && \
  clear && sleep 2 && ./manage.py run-some-command && \
  clear && sleep 2 && ./manage.py run-some-command && \
  clear && sleep 2 && ./manage.py run-some-command && \
  clear && sleep 2 && ./manage.py run-some-command && \
  ...
  ...forever until you Ctrl-C it...

But that's clunky too because you might not want it to clear the screen between each run and you get an un-necessary delay between each run.

The biggest problem is that with using watch or copy-n-paste the command many times with && between is that if you need to stop it you have to Ctrl-C and that might kill the command at a precious time.

A better solution

The important thing is that if you want to stop the command repeater, is that it gets to finish what it's working on at the moment.

Here's a great and simple solution:


#!/usr/bin/env bash
set -eo pipefail

_stopnow() {
  test -f stopnow && echo "Stopping!" && rm stopnow && exit 0 || return 0
}

while true
do
    _stopnow
    # Below here, you put in your command you want to run:

    ./manage.py run-some-command
done

Save that file as run-forever.sh and now you can do this:

$ bash run-forever.sh

It'll sit there and do its thing over and over. If you want to stop it (from another terminal):

$ touch stopnow

(the file stopnow will be deleted after it's spotted once)

Getting fancy

Instead of taking this bash script and editing it every time you need it to run a different command you can make it a globally available command. Here's how I do it:


#!/usr/bin/env bash
set -eo pipefail


count=0

_stopnow() {
    count="$(($count+1))"
    test -f stopnow && \
      echo "Stopping after $count iterations!" && \
      rm stopnow && exit 0 || return 0
}

control_c()
# run if user hits control-c
{
  echo "Managed to do $count iterations"
  exit $?
}

# trap keyboard interrupt (control-c)
trap control_c SIGINT

echo "To stop this forever loop created a file called stopnow."
echo "E.g: touch stopnow"
echo ""
echo "Now going to run '$@' forever"
echo ""
while true
do
    _stopnow

    eval $@

    # Do this in case you accidentally pass an argument
    # that finishes too quickly.
    sleep 1

done

This code in a Gist here.

Put this file in ~/bin/run-forever.sh and chmod +x ~/bin/run-forever.sh.

Now you can do this:

$ run-forever.sh ./manage.py run-some-command

If the command you want to run, forever, requires an operator you have to wrap everything in single quotation marks. For example:

$ run-forever.sh './manage.py run-some-command && echo "Cooling CPUs..." && sleep 10'

Be very careful with your add_header in Nginx! You might make your site insecure

February 11, 2018
17 comments Linux, Web development, Nginx

tl;dr; When you use add_header in a location block in Nginx, it undoes all "parent" add_header directives. Dangerous!

Gist of the problem is this:

There could be several add_header directives. These directives are inherited from the previous level if and only if there are no add_header directives defined on the current level.

From the documentation on add_header

The grand but subtle mistake

Basically, I had this:

server {
    server_name example.com;

    ...gzip...
    ...ssl...
    ...root...

    # Great security headers...
    add_header X-Frame-Options SAMEORIGIN;
    add_header X-XSS-Protection "1; mode=block";
    ...more security headers...

    location / {
        try_files    $uri /index.html;
    }
}

And when you curl it, you can see that it works:

$ curl -I https://example.com
[snip]
X-Frame-Options: SAMEORIGIN
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
Strict-Transport-Security: max-age=63072000; includeSubdomains; preload

The mistake I had, was that I added a new add_header inside a relevant location block. If you do that, all the other "global" add_headers are dropped.
E.g.

server {
    server_name example.com;

    ...gzip...
    ...ssl...
    ...root...

    # Great security headers...
    add_header X-Frame-Options SAMEORIGIN;
    add_header X-XSS-Protection "1; mode=block";
    ...more security headers...

    location / {
        try_files    $uri /index.html;
        # NOTE! Adding some more headers here
+       add_header X-debug-whats-going-on on; 
    }
}

Now, same curl command:

$ curl -I https://example.com
[snip]
X-debug-whats-going-on: on

Bad score on Observatory for www.peterbe.com
Yikes! Now those other useful security headers are gone!

Here are your options:

  1. Don't add headers like that inside location blocks. Yeah, that's not always a choice.
  2. Copy-n-paste all the general security add_header blocks into the location blocks where you have to have "custom" add_header entries.
  3. Use an include file, see below.

How to include files

First create a new file, like /etc/nginx/snippets/general-security-headers.conf then put this into it:

# Great security headers...
add_header X-Frame-Options SAMEORIGIN;
add_header X-XSS-Protection "1; mode=block";
...more security headers...
# More realistically, see https://gist.github.com/plentz/6737338

Now, instead of saying these add_header lines in your /etc/nginx/sites-enabled/example.conf change that to:

server {
    server_name example.com;

    ...gzip...
    ...ssl...
    ...root...

    include /etc/nginx/snippets/general-security-headers.conf;

    location / {
        try_files    $uri /index.html;
        # Note! This gets included *again* because
        # this location block needs its own custom add_header
        # directives.
        include /etc/nginx/snippets/general-security-headers.conf;
        # NOTE! Adding some more headers here
        add_header X-debug-whats-going-on on; 
    }
}

(You need to use your imagination that a real Nginx config site probably has many different more complex location directives)

It's arguably a bit clunky but it works and it's the best of both worlds. The right security headers for all locations and ability to set custom add_header directives for specific locations.

Discussion

I'm most disappointed in myself for not noticing. Not for not noticing this in the Nginx documentation, but that I didn't check my security headers on more than one path. But I'm also quite disappointed in Nginx for this rather odd behaviour. To quote my security engineer at Mozilla, April King:

"add" doesn't usually mean "subtract everything else"

She agreed with me that the way it works is counter-intuitive and showed me this snippet which uses include files the same way.

Show size of every PostgreSQL database you have

February 7, 2018
0 comments PostgreSQL

tl;dr; SELECT pg_database.datname, pg_database_size(pg_database.datname), pg_size_pretty(pg_database_size(pg_database.datname)) FROM pg_database ORDER by 2 DESC;

I recently had to transfer all my postgres data for my local databases on the laptop. Although I, unfortunately, can't remember what it was before I deleted a bunch of databases, now after some clean up, all my PostgreSQL databases weighs 12GB.

To find out the size of all your database, start psql like this:

$ psql postgres

Then run this:


SELECT pg_database.datname, 
pg_database_size(pg_database.datname), 
pg_size_pretty(pg_database_size(pg_database.datname)) 
FROM pg_database ORDER by 2 DESC;

Here's what it looked like on my laptop:

postgres=# SELECT pg_database.datname,
postgres-# pg_database_size(pg_database.datname),
postgres-# pg_size_pretty(pg_database_size(pg_database.datname))
postgres-# FROM pg_database ORDER by 2 DESC;
         datname          | pg_database_size | pg_size_pretty
--------------------------+------------------+----------------
 songsearch               |      10689224876 | 10194 MB
 air_mozilla_org          |        355639812 | 339 MB
 kl                       |        239297028 | 228 MB
 kl2                      |        239256068 | 228 MB
 peterbecom               |        191914500 | 183 MB
 kintobench               |        125968556 | 120 MB
 airmozilla               |         41640452 | 40 MB
 socorro_webapp           |         32530948 | 31 MB
 tecken                   |         26706092 | 25 MB
 dailycookie              |         12935684 | 12 MB
 autocompeter             |         12313092 | 12 MB
 socorro_integration_test |         11428356 | 11 MB
 breakpad                 |         11313668 | 11 MB
 test_peterbecom          |          9298436 | 9081 kB
 battleshits              |          9028100 | 8817 kB
 thuawood2                |          8716804 | 8513 kB
 thuawood                 |          8667652 | 8465 kB
 fastestdb                |          8012292 | 7825 kB
 premailer                |          7676420 | 7497 kB
 crontabber               |          7586308 | 7409 kB
 postgres                 |          7536812 | 7360 kB
 crontabber_exampleapp    |          7488004 | 7313 kB
 whatsdeployed            |          7414276 | 7241 kB
 socorro_test             |          7315972 | 7145 kB
 template1                |          7307780 | 7137 kB
 template0                |          7143940 | 6977 kB
(26 rows)

Component, component function or plain function in React

February 6, 2018
1 comment React

tl;dr; Use React.PureComponent (or React.Component) if your component contains, or might contain, non-trivial logic that might affect it rendering or not. For all other cases, use a function, especially if it's not React specific.

Your choices

When you have state, especially good old this.state and this.setState you have to use React.PureComponent (or React.Component if you must).

For stateless functions, where you're just getting some props in, perhaps massaging them and rendering some JSX, you have choices.
You can write a React component in these three different ways:

Component


class MyComponent extends React.PureComponent {
  render() {
    return <h1>Hello {this.props.name}</h1>
  }
}

Component function


const MyComponent = ({ name }) => {
  return <h1>Hello {name}</h1>
}

Plain function


const MyComponent = name => {
  return <h1>Hello {name}</h1>
}

The first two can be used like this:


return (
  <div>
    <MyComponent name="Peter"/>
  </div>
)

The last one can be called directly:


return (
  <div>
    {MyComponent("Peter")}/>
  </div>
)

To be exact, you can actually call the second, component function, like this too:


return (
  <div>
    {MyComponent({name: "Peter"})}
  </div>
)

Example CodeSandbox here.

Each one has its strength and weaknesses.

Pros & cons for class MyComponent extends React.PureComponent

  • You have access to life-cycle hooks such as componentDidMount.
  • It's easy to add class-level public field functions, e.g. onButtonClick = event => {...} if inlining isn't convenient.
  • It has a shouldComponentUpdate method which means it can avoid a potentially expensive render execution when the props (or state) haven't changed.
  • If you decide you want to add some state, you just need to add a constructor that sets up this.state = {...}
  • If you use prop-types you can, depending on your babel transforms, set it as a static class member inside the class without having to repeat the name of the component outside (e.g. not having to do MyComponent.propTypes = {...} outside)
  • Maybe slower. I heard this rumor. Let's see later if it checks out.
  • Doesn't feel as simple because it's not a regular old function.
  • Might make you feel uncertain that it might have side-effects that aren't obvious.

Pros & cons for const MyComponent = (...) =>

  • Clearly it's just doing one thing. Rendering. No buts, ifs, or maybes.
  • Doesn't say React all over and thus should be easy to reason about outside React, such as in a unit test.
  • If it doesn't have to do all the mounting life-cycle hooks, perhaps that's valuable time saved.
  • There's something hip about writing functions and feeling "functional" without all that verbosity and boilerplate like class and extends and render etc.
  • Maybe faster. We'll see.

Benchmarking the difference

I don't know with confidence if this is the right way to test this but I really wanted to avoid process.env.NODE_ENV==='development' and I wanted to run each variant a bunch of times, because it feels more realistic, so as to avoid the slowness of the initial mounting.

So I made an app that looks like this:


class Components extends React.Component {
  render() {
    return <Component100 count={this.props.count} />;
  }
}

export default Components;

class Component100 extends React.PureComponent {
  render() {
    return <Component99 count={this.props.count} />;
  }
}
class Component99 extends React.PureComponent {
  render() {
    return <Component98 count={this.props.count} />;
  }
}

//...
//...you can imagine...
//...

class Component1 extends React.PureComponent {
  render() {
    return <Component0 count={this.props.count} />;
  }
}

class Component0 extends React.PureComponent {
  render() {
    collect('Components', performance.now());
    return <h1>Component0: {this.props.count}</h1>;
  }
}

This long chain of components calling "sub-components" starts right after the prop at the top changes. In the App that parents all of the variants, when the state changes the props change and it trickles down to that final last component. By taking a timestamp right before changing the state and during that last render you get a rough timeline for how long the whole chain took to render.

See the variants here:

Perhaps it's best to skim the code of the App.js too. It's a bit messy and there's a bunch of whacky code that uses global window to log all the timestamps but the gist is that it measures the few milliseconds it takes before a re-render is triggered until the final components render function gets called.

The app has a little hacky interval function that randomly switches between the different variants every 2 seconds and every 300 milliseconds it clicks a button, which changes the state which triggers a re-render.

Benchmark results

Results

Component style Median Comparison
Components 3.46ms 100%
ComponentFunctions 3.04ms 14% faster
Functions 2.02ms 71% faster

This was done using React 16.2.0 with process.env.NODE_ENV === 'production' in Firefox 60.

Sample app
You can try for yourself here: https://peterbe.github.io/function-or-component/

It might break when you click Reset. If it doesn't work very well in github.io, just download it and test locally.

Discussion

Here's my rule of thumb, the life-cycle hooks are awesome. I often write a component, using ...extends React.PureComponent even though it could be a plain function. But over time, eventually you expand it and realize you need some life cycle hook. Or you might find that writing inline functions is getting messy. Or, you realize that this component is sometimes unnecessarily called by a more complicated parent, with the same props as last time!

The performance penalty, for using full React components, is small. It exists, but it's probably dwarfed by other costs such as mounting not to mention actual DOM updating. It's also very likely that your components could benefit more from avoiding render (which only shouldComponentUpdate really can do) than the cost of calling it. Meaning, if the slower component only has to render 500 times, marginally slower, than the function component rendering 1,000 times, then the slower sometimes-not-needing-to-render will eventually win the performance battle.

There is still value in the functional stateless component. See the pros & cons above. But one rule of thumb I have is that if the component is really simple and contains no fancy logic that might affect its rendering or not rendering, then use components as functions. They're "sending a message" (to the code reader) by being brief and simple. For example, I have this little snippet in my Common.js module:


export const formatFileSize = (bytes, decimals = 0) => {
  if (!bytes) return '0 bytes'
  const k = 1024
  const dm = decimals + 1 || 3
  const sizes = ['bytes', 'KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB']
  const i = Math.floor(Math.log(bytes) / Math.log(k))
  return parseFloat((bytes / Math.pow(k, i)).toFixed(dm)) + ' ' + sizes[i]
}

It's got nothing to do with React and that becomes extra obvious simply my looking at it. It's cleary got just one job. It's used a lot and often by more complicated components.

Last but not least; I'm very aware that the much more experienced React gurus of the world have already said something similar but with more accuracy. But I didn't want to just blurt out my opinion without adding some meat and some numbers to it. And I've always disliked the confusion that there's a choice at all so hopefully this blog post will help someone else who still suffers from having to wonder when to use which.

This tweet sums it up well:
Craig Kerstiens tweet

Convert web page to PDF, nicely

February 4, 2018
0 comments Misc. links, Web development

I saw about this service on Hacker News and I'm impressed. It can convert any article-like web page into a PDF. Not the first time we've seen that but this service really gets it right.

Here's example, when I print one of my own blog posts:

Blog post on Simple Print
Blog post on Simple Print. PDF download

Blog post via Firefox print to PDF
Regular print to PDF

Does that look scrumptious? It drops one of the images but it really gets the layout right.

I'm not sure this beats the neat integration that Pocket has but it certainly is a nice hack. Which reminds me, I really need to improve my print.css stylesheet.

Fastest way to unzip a zip file in Python

January 31, 2018
15 comments Python

So the context is this; a zip file is uploaded into a web service and Python then needs extract that and analyze and deal with each file within. In this particular application what it does is that it looks at the file's individual name and size, compares that to what has already been uploaded in AWS S3 and if the file is believed to be different or new, it gets uploaded to AWS S3.

Uploads today
The challenge is that these zip files that come in are huuuge. The average is 560MB but some are as much as 1GB. Within them, there are mostly plain text files but there are some binary files in there too that are huge. It's not unusual that each zip file contains 100 files and 1-3 of those make up 95% of the zip file size.

At first I tried unzipping the file, in memory, and deal with one file at a time. That failed spectacularly with various memory explosions and EC2 running out of memory. I guess it makes sense. First you have the 1GB file in RAM, then you unzip each file and now you have possibly 2-3GB all in memory. So, the solution, after much testing, was to dump the zip file to disk (in a temporary directory in /tmp) and then iterate over the files. This worked much better but I still noticed the whole unzipping was taking up a huge amount of time. Is there perhaps a way to optimize that?

Baseline function

First it's these common functions that simulate actually doing something with the files in the zip file:


def _count_file(fn):
    with open(fn, 'rb') as f:
        return _count_file_object(f)


def _count_file_object(f):
    # Note that this iterates on 'f'.
    # You *could* do 'return len(f.read())'
    # which would be faster but potentially memory 
    # inefficient and unrealistic in terms of this 
    # benchmark experiment. 
    total = 0
    for line in f:
        total += len(line)
    return total

Here's the simplest one possible:


def f1(fn, dest):
    with open(fn, 'rb') as f:
        zf = zipfile.ZipFile(f)
        zf.extractall(dest)

    total = 0
    for root, dirs, files in os.walk(dest):
        for file_ in files:
            fn = os.path.join(root, file_)
            total += _count_file(fn)
    return total

If I analyze it a bit more carefully, I find that it spends about 40% doing the extractall and 60% doing the looping over files and reading their full length.

First attempt

My first attempt was to try to use threads. You create an instance of zipfile.ZipFile, extract every file name within and start a thread for each name. Each thread is given a function that does the "meat of the work" (in this benchmark, iterating over the file and getting its total size). In reality that function does a bunch of complicated S3, Redis and PostgreSQL stuff but in my benchmark I just made it a function that figures out the total length of file. The thread pool function:


def f2(fn, dest):

    def unzip_member(zf, member, dest):
        zf.extract(member, dest)
        fn = os.path.join(dest, member.filename)
        return _count_file(fn)

    with open(fn, 'rb') as f:
        zf = zipfile.ZipFile(f)
        futures = []
        with concurrent.futures.ThreadPoolExecutor() as executor:
            for member in zf.infolist():
                futures.append(
                    executor.submit(
                        unzip_member,
                        zf,
                        member,
                        dest,
                    )
                )
            total = 0
            for future in concurrent.futures.as_completed(futures):
                total += future.result()
    return total

Result: ~10% faster

Second attempt

So perhaps the GIL is blocking me. The natural inclination is to try to use multiprocessing to spread the work across multiple available CPUs. But doing so has the disadvantage that you can't pass around a non-pickleable object so you have to send just the filename to each future function:


def unzip_member_f3(zip_filepath, filename, dest):
    with open(zip_filepath, 'rb') as f:
        zf = zipfile.ZipFile(f)
        zf.extract(filename, dest)
    fn = os.path.join(dest, filename)
    return _count_file(fn)



def f3(fn, dest):
    with open(fn, 'rb') as f:
        zf = zipfile.ZipFile(f)
        futures = []
        with concurrent.futures.ProcessPoolExecutor() as executor:
            for member in zf.infolist():
                futures.append(
                    executor.submit(
                        unzip_member_f3,
                        fn,
                        member.filename,
                        dest,
                    )
                )
            total = 0
            for future in concurrent.futures.as_completed(futures):
                total += future.result()
    return total

Result: ~300% faster

That's cheating!

The problem with using a pool of processors is that it requires that the original .zip file exists on disk. So in my web server, to use this solution, I'd first have to save the in-memory ZIP file to disk, then invoke this function. Not sure what the cost of that it's not likely to be cheap.

Well, it doesn't hurt to poke around. Perhaps, it could be worth it if the extraction was significantly faster.

But remember! This optimization depends on using up as many CPUs as it possibly can. What if some of those other CPUs are needed for something else going on in gunicorn? Those other processes would have to patiently wait till there's a CPU available. Since there's other things going on in this server, I'm not sure I'm willing to let on process take over all the other CPUs.

Conclusion

Doing it serially turns out to be quite nice. You're bound to one CPU but the performance is still pretty good. Also, just look at the difference in the code between f1 and f2! With concurrent.futures pool classes you can cap the number of CPUs it's allowed to use but that doesn't feel great either. What if you get the number wrong in a virtual environment? Of if the number is too low and don't benefit any from spreading the workload and now you're just paying for overheads to move the work around?

I'm going to stick with zipfile.ZipFile(file_buffer).extractall(temp_dir). It's good enough for this.

Want to try your hands on it?

I did my benchmarking using a c5.4xlarge EC2 server. The files can be downloaded from:

wget https://www.peterbe.com/unzip-in-parallel/hack.unzip-in-parallel.py
wget https://www.peterbe.com/unzip-in-parallel/symbols-2017-11-27T14_15_30.zip

The .zip file there is 34MB which is relatively small compared to what's happening on the server.

The hack.unzip-in-parallel.py is a hot mess. It contains a bunch of terrible hacks and ugly stuff but hopefully it's a start.