Avoid async when all you have is (SSD) disk I/O in NodeJS

October 24, 2019
1 comment Node, JavaScript

tl;dr; If you know that the only I/O you have is disk and the disk is SSD, then synchronous is probably more convenient, faster, and more memory lean.

I'm not a NodeJS expert so I could really do with some eyes on this.

There is little doubt in my mind that it's smart to use asynchronous ideas when your program has to wait for network I/O. Because network I/O is slow, it's better to let your program work on something else whilst waiting. But disk is actually fast. Especially if you have SSD disk.

The context

I'm working on a Node program that walks a large directory structure and looks for certain file patterns, reads those files, does some processing and then exits. It's a cli basically and it's supposed to work similar to jest where you tell it to go and process files and if everything worked, exit with 0 and if anything failed, exit with something >0. Also, it needs to be possible to run it so that it exits immediately on the first error encountered. This is similar to running jest --bail.

My program needs to process thousands of files and although there are thousands of files, they're all relatively small. So first I wrote a simple reference program: https://github.com/peterbe/megafileprocessing/blob/master/reference.js
What it does is that it walks a directory looking for certain .json files that have certain keys that it knows about. Then, just computes the size of the values and tallies that up. My real program will be very similar except it does a lot more with each .json file.

You run it like this:


▶ CHAOS_MONKEY=0.001 node reference.js ~/stumptown-content/kumadocs -q
Error: Chaos Monkey!
    at processDoc (/Users/peterbe/dev/JAVASCRIPT/megafileprocessing/reference.js:37:11)
    at /Users/peterbe/dev/JAVASCRIPT/megafileprocessing/reference.js:80:21
    at Array.forEach (<anonymous>)
    at main (/Users/peterbe/dev/JAVASCRIPT/megafileprocessing/reference.js:78:9)
    at Object.<anonymous> (/Users/peterbe/dev/JAVASCRIPT/megafileprocessing/reference.js:99:20)
    at Module._compile (internal/modules/cjs/loader.js:956:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:973:10)
    at Module.load (internal/modules/cjs/loader.js:812:32)
    at Function.Module._load (internal/modules/cjs/loader.js:724:14)
    at Function.Module.runMain (internal/modules/cjs/loader.js:1025:10)
Total length for 4057 files is 153953645
1 files failed.

(The environment variable CHAOS_MONKEY=0.001 makes it so there's a 0.1% chance it throws an error)

It processed 4,057 files and one of those failed (thanks to the "chaos monkey").
In its current state that (on my MacBook) that takes about 1 second.

It's not perfect but it's a good skeleton. Everything is synchronous. E.g.


function main(args) {
  // By default, don't exit if any error happens
  const { bail, quiet, root } = parseArgs(args);
  const files = walk(root, ".json");
  let totalTotal = 0;
  let errors = 0;
  files.forEach(file => {
    try {
      const total = processDoc(file, quiet);
      !quiet && console.log(`${file} is ${total}`);
      totalTotal += total;
    } catch (err) {
      if (bail) {
        throw err;
      } else {
        console.error(err);
        errors++;
      }
    }
  });
  console.log(`Total length for ${files.length} files is ${totalTotal}`);
  if (errors) {
    console.warn(`${errors} files failed.`);
  }
  return errors ? 1 : 0;
}

And inside the processDoc function it used const content = fs.readFileSync(fspath, "utf8");.

I/Os compared

@amejiarosario has a great blog post called "What every programmer should know about Synchronous vs. Asynchronous Code". In it, he has this great bar chart:

Latency vs. System Event

If you compare "SSD I/O" with "Network SFO/NCY" the difference is that SSD I/O is 456 times "faster" than SFO-to-NYC network I/O. I.e. the latency is 456 times less.

Another important aspect when processing lots of files is garbage collection. When running synchronous, it can garbage collect as soon as it has processed one file before moving on to the next. If it was asynchronous, as soon as it yields to move on to the next file, it might hold on to memory from the first file. Why does this matter? Because if the memory-usage when processing many files asynchronously bloat so hard that it actually crashes with an out-of-memory error. So what matters is avoiding that. It's OK if the program can use lots of memory if it needs to, but it's really bad if it crashes.

One way to measure this is to use /usr/bin/time -l (at least that's what it's called on macOS). For example:

▶ /usr/bin/time -l node reference.js ~/stumptown-content/kumadocs -q
Total length for 4057 files is 153970749
        0.75 real         0.58 user         0.23 sys
  57221120  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
     64160  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
         0  voluntary context switches
      1074  involuntary context switches

Its maximum memory usage total was 57221120 bytes (55MB) in this example.

Introduce asynchronous file reading

Let's change the reference implementation to use const content = await fsPromises.readFile(fspath, "utf8");. We're still using files.forEach(file => { but within the loop the whole function is prefixed with async function main() { now. Like this:


async function main(args) {
  // By default, don't exit if any error happens
  const { bail, quiet, root } = parseArgs(args);
  const files = walk(root, ".json");
  let totalTotal = 0;
  let errors = 0;

  let total;
  for (let file of files) {
    try {
      total = await processDoc(file, quiet);
      !quiet && console.log(`${file} is ${total}`);
      totalTotal += total;
    } catch (err) {
      if (bail) {
        throw err;
      } else {
        console.error(err);
        errors++;
      }
    }
  }
  console.log(`Total length for ${files.length} files is ${totalTotal}`);
  if (errors) {
    console.warn(`${errors} files failed.`);
  }
  return errors ? 1 : 0;
}

Let's see how it works:

▶ /usr/bin/time -l node async1.js ~/stumptown-content/kumadocs -q
Total length for 4057 files is 153970749
        1.31 real         1.01 user         0.49 sys
  68898816  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
     68107  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
         0  voluntary context switches
     62562  involuntary context switches

That means it maxed out at 68898816 bytes (65MB).

You can already see a difference. 0.79 seconds and 55MB for synchronous and 1.31 seconds and 65MB for asynchronous.

But to really measure this, I wrote a simple Python program that runs this repeatedly and reports a min/median on time and max on memory:

▶ python3 wrap_time.py /usr/bin/time -l node reference.js ~/stumptown-content/kumadocs -q
...
TIMES
BEST:   0.74s
WORST:  0.84s
MEAN:   0.78s
MEDIAN: 0.78s
MAX MEMORY
BEST:   53.5MB
WORST:  55.3MB
MEAN:   54.6MB
MEDIAN: 54.8MB

And for the asynchronous version:

▶ python3 wrap_time.py /usr/bin/time -l node async1.js ~/stumptown-content/kumadocs -q
...
TIMES
BEST:   1.28s
WORST:  1.82s
MEAN:   1.39s
MEDIAN: 1.31s
MAX MEMORY
BEST:   65.4MB
WORST:  67.7MB
MEAN:   66.7MB
MEDIAN: 66.9MB

Promise.all version

I don't know if the async1.js is realistic. More realistically you'll want to not wait for one file to be processed (asynchronously) but start them all at the same time. So I made a variation of the asynchronous version that looks like this instead:


async function main(args) {
  // By default, don't exit if any error happens
  const { bail, quiet, root } = parseArgs(args);
  const files = walk(root, ".json");
  let totalTotal = 0;
  let errors = 0;

  let values;
  values = await Promise.all(
    files.map(async file => {
      try {
        total = await processDoc(file, quiet);
        !quiet && console.log(`${file} is ${total}`);
        return total;
      } catch (err) {
        if (bail) {
          console.error(err);
          process.exit(1);
        } else {
          console.error(err);
          errors++;
        }
      }
    })
  );
  totalTotal = values.filter(n => n).reduce((a, b) => a + b);
  console.log(`Total length for ${files.length} files is ${totalTotal}`);
  if (errors) {
    console.warn(`${errors} files failed.`);
    throw new Error("More than 0 errors");
  }
}

You can see the whole file here: async2.js

The key difference is that it uses await Promise.all(files.map(...)) instead of for (let file of files) {.
Also, to accomplish the ability to bail on the first possible error it needs to use process.exit(1); within the callbacks. Not sure if that's right but from the outside, you get the desired effect as a cli program. Let's measure it too:

▶ python3 wrap_time.py /usr/bin/time -l node async2.js ~/stumptown-content/kumadocs -q
...
TIMES
BEST:   1.44s
WORST:  1.61s
MEAN:   1.52s
MEDIAN: 1.52s
MAX MEMORY
BEST:   434.0MB
WORST:  460.2MB
MEAN:   453.4MB
MEDIAN: 456.4MB

Note how this uses almost 10x max. memory. That's dangerous if the processing is really memory hungry individually.

When asynchronous is right

In all of this, I'm assuming that the individual files are small. (Roughly, each file in my experiment is about 50KB)
What if the files it needs to read from disk are large?

As a simple experiment read /users/peterbe/Downloads/Keybase.dmg 20 times and just report its size:


for (let x = 0; x < 20; x++) {
  fs.readFile("/users/peterbe/Downloads/Keybase.dmg", (err, data) => {
    if (err) throw err;
    console.log(`File size#${x}: ${Math.round(data.length / 1e6)} MB`);
  });
}

See the simple-async.js here. Basically it's this:


for (let x = 0; x < 20; x++) {
  fs.readFile("/users/peterbe/Downloads/Keybase.dmg", (err, data) => {
    if (err) throw err;
    console.log(`File size#${x}: ${Math.round(data.length / 1e6)} MB`);
  });
}

Results are:

▶ python3 wrap_time.py /usr/bin/time -l node simple-async.js
...
TIMES
BEST:   0.84s
WORST:  4.32s
MEAN:   1.33s
MEDIAN: 0.97s
MAX MEMORY
BEST:   1851.1MB
WORST:  3079.3MB
MEAN:   2956.3MB
MEDIAN: 3079.1MB

And the equivalent synchronous simple-sync.js here.


for (let x = 0; x < 20; x++) {
  const largeFile = fs.readFileSync("/users/peterbe/Downloads/Keybase.dmg");
  console.log(`File size#${x}: ${Math.round(largeFile.length / 1e6)} MB`);
}

It performs like this:

▶ python3 wrap_time.py /usr/bin/time -l node simple-sync.js
...
TIMES
BEST:   1.97s
WORST:  2.74s
MEAN:   2.27s
MEDIAN: 2.18s
MAX MEMORY
BEST:   1089.2MB
WORST:  1089.7MB
MEAN:   1089.5MB
MEDIAN: 1089.5MB

So, almost 2x as slow but 3x as much max. memory.

Lastly, instead of an iterative loop, let's start 20 readers at the same time (simple-async2.js):


Promise.all(
  [...Array(20).fill()].map((_, x) => {
    return fs.readFile("/users/peterbe/Downloads/Keybase.dmg", (err, data) => {
      if (err) throw err;
      console.log(`File size#${x}: ${Math.round(data.length / 1e6)} MB`);
    });
  })
);

And it performs like this:

▶ python3 wrap_time.py /usr/bin/time -l node simple-async2.js
...
TIMES
BEST:   0.86s
WORST:  1.09s
MEAN:   0.96s
MEDIAN: 0.94s
MAX MEMORY
BEST:   3079.0MB
WORST:  3079.4MB
MEAN:   3079.2MB
MEDIAN: 3079.2MB

So quite naturally, the same total time as the simple async version but uses 3x max. memory every time.

Ergonomics

I'm starting to get pretty comfortable with using promises and async/await. But I definitely feel more comfortable without. Synchronous programs read better from an ergonomics point of view. The async/await stuff is just Promises under the hood and it's definitely an improvement but the synchronous versions just have a simpler "feeling" to it.

Conclusion

I don't think it's a surprise that the overhead of event switching adds more time than its worth when the individual waits aren't too painful.

A major flaw with synchronous programs is that they rely on the assumption that there's no really slow I/O. So what if the program grows and morphs so that it someday does depend on network I/O then your synchronous program is "screwed" since an asynchronous version would run circles around it.

The general conclusion is; if you know that the only I/O you have is disk and the disk is SSD, then synchronous is probably more convenient, faster, and more memory lean.

Update to speed comparison for Redis vs PostgreSQL storing blobs of JSON

September 30, 2019
2 comments Redis, Nginx, Web Performance, Python, Django, PostgreSQL

Last week, I blogged about "How much faster is Redis at storing a blob of JSON compared to PostgreSQL?". Judging from a lot of comments, people misinterpreted this. (By the way, Redis is persistent). It's no surprise that Redis is faster.

However, it's a fact that I have do have a lot of blobs stored and need to present them via the web API as fast as possible. It's rare that I want to do relational or batch operations on the data. But Redis isn't a slam dunk for simple retrieval because I don't know if I trust its integrity with the 3GB worth of data that I both don't want to lose and don't want to load all into RAM.

But is it entirely wrong to look at WHICH database to get the best speed?

Reviewing this corner of Song Search helped me rethink this. PostgreSQL is, in my view, a better database for storing stuff. Redis is faster for individual lookups. But you know what's even faster? Nginx

Nginx??

The way the application works is that a React web app is requesting the Amazon product data for the sake of presenting an appropriate affiliate link. This is done by the browser essentially doing:


const response = await fetch('https://songsear.ch/api/song/5246889/amazon');

Internally, in the app, what it does is that it looks this up, by ID, on the AmazonAffiliateLookup ORM model. Suppose it wasn't there in the PostgreSQL, it uses the Amazon Affiliate Product Details API, to look it up and when the results come in it stores a copy of this in PostgreSQL so we can re-use this URL without hitting rate limits on the Product Details API. Lastly, in a piece of Django view code, it carefully scrubs and repackages this result so that only the fields used by the React rendering code is shipped between the server and the browser. That "scrubbed" piece of data is actually much smaller. Partly because it limits the results to the first/best match and it deletes a bunch of things that are never needed such as ProductTypeName, Studio, TrackSequence etc. The proportion is roughly 23x. I.e. of the 3GB of JSON blobs stored in PostgreSQL only 130MB is ever transported from the server to the users.

Again, Nginx?

Nginx has a built in reverse HTTP proxy cache which is easy to set up but a bit hard to do purges on. The biggest flaw, in my view, is that it's hard to get a handle of how much RAM this it's eating up. Well, if the total possible amount of data within the server is 130MB, then that is something I'm perfectly comfortable to let Nginx handle cache in RAM.

Good HTTP performance benchmarking is hard to do but here's a teaser from my local laptop version of Nginx:

▶ hey -n 10000 -c 10 https://songsearch.local/api/song/1810960/affiliate/amazon-itunes

Summary:
  Total:    0.9882 secs
  Slowest:  0.0279 secs
  Fastest:  0.0001 secs
  Average:  0.0010 secs
  Requests/sec: 10119.8265


Response time histogram:
  0.000 [1] |
  0.003 [9752]  |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.006 [108]   |
  0.008 [70]    |
  0.011 [32]    |
  0.014 [8] |
  0.017 [12]    |
  0.020 [11]    |
  0.022 [1] |
  0.025 [4] |
  0.028 [1] |


Latency distribution:
  10% in 0.0003 secs
  25% in 0.0006 secs
  50% in 0.0008 secs
  75% in 0.0010 secs
  90% in 0.0013 secs
  95% in 0.0016 secs
  99% in 0.0068 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0000 secs, 0.0001 secs, 0.0279 secs
  DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0026 secs
  req write:    0.0000 secs, 0.0000 secs, 0.0011 secs
  resp wait:    0.0008 secs, 0.0001 secs, 0.0206 secs
  resp read:    0.0001 secs, 0.0000 secs, 0.0013 secs

Status code distribution:
  [200] 10000 responses

10,000 requests across 10 clients at rougly 10,000 requests per second. That includes doing all the HTTP parsing, WSGI stuff, forming of a SQL or Redis query, the deserialization, the Django JSON HTTP response serialization etc. The cache TTL is controlled by simply setting a Cache-Control HTTP header with something like max-age=86400.

Now, repeated fetches for this are cached at the Nginx level and it means it doesn't even matter how slow/fast the database is. As long as it's not taking seconds, with a long Cache-Control, Nginx can hold on to this in RAM for days or until the whole server is restarted (which is rare).

Conclusion

If you the total amount of data that can and will be cached is controlled, putting it in a HTTP reverse proxy cache is probably order of magnitude faster than messing with chosing which database to use.

How much faster is Redis at storing a blob of JSON compared to PostgreSQL?

September 28, 2019
65 comments Python, PostgreSQL, Redis

tl;dr; Redis is 16 times faster at reading these JSON blobs.

In Song Search when you've found a song, it loads some affiliate links to Amazon.com. (In case you're curious it's earning me lower double-digit dollars per month). To avoid overloading the Amazon Affiliate Product API, after I've queried their API, I store that result in my own database along with some metadata. Then, the next time someone views that song page, it can read from my local database. With me so far?

Example view of affiliate links

The other caveat is that you can't store these lookups locally too long since prices change and/or results change. So if my own stored result is older than a couple of hundred days, I delete it and fetch from the network again. My current implementation uses PostgreSQL (via the Django ORM) to store this stuff. The model looks like this:


class AmazonAffiliateLookup(models.Model, TotalCountMixin):
    song = models.ForeignKey(Song, on_delete=models.CASCADE)
    matches = JSONField(null=True)
    search_index = models.CharField(max_length=100, null=True)
    lookup_seconds = models.FloatField(null=True)
    created = models.DateTimeField(auto_now_add=True, db_index=True)
    modified = models.DateTimeField(auto_now=True)

At the moment this database table is 3GB on disk.

Then, I thought, why not use Redis for this. Then I can use Redis's "natural" expiration by simply setting as expiry time when I store it and then I don't have to worry about cleaning up old stuff at all.

The way I'm using Redis in this project is as a/the cache backend and I have it configured like this:


CACHES = {
    "default": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": REDIS_URL,
        "TIMEOUT": config("CACHE_TIMEOUT", 500),
        "KEY_PREFIX": config("CACHE_KEY_PREFIX", ""),
        "OPTIONS": {
            "COMPRESSOR": "django_redis.compressors.zlib.ZlibCompressor",
            "SERIALIZER": "django_redis.serializers.msgpack.MSGPackSerializer",
        },
    }
}

The speed difference

Perhaps unrealistic but I'm doing all this testing here on my MacBook Pro. The connection to Postgres (version 11.4) and Redis (3.2.1) are both on localhost.

Reads

The reads are the most important because hopefully, they happen 10x more than writes as several people can benefit from previous saves.

I changed my code so that it would do a read from both databases and if it was found in both, write down their time in a log file which I'll later summarize. Results are as follows:

PG:
median: 8.66ms
mean  : 11.18ms
stdev : 19.48ms

Redis:
median: 0.53ms
mean  : 0.84ms
stdev : 2.26ms

(310 measurements)

It means, when focussing on the median, Redis is 16 times faster than PostgreSQL at reading these JSON blobs.

Writes

The writes are less important but due to the synchronous nature of my Django, the unlucky user who triggers a look up that I didn't have, will have to wait for the write before the XHR request can be completed. However, when this happens, the remote network call to the Amazon Product API is bound to be much slower. Results are as follows:

PG:
median: 8.59ms
mean  : 8.58ms
stdev : 6.78ms

Redis:
median: 0.44ms
mean  : 0.49ms
stdev : 0.27ms

(137 measurements)

It means, when focussing on the median, Redis is 20 times faster than PostgreSQL at writing these JSON blobs.

Conclusion and discussion

First of all, I'm still a PostgreSQL fan-boy and have no intention of ceasing that. These times are made up of much more than just the individual databases. For example, the PostgreSQL speeds depend on the Django ORM code that makes the SQL and sends the query and then turns it into the model instance. I don't know what the proportions are between that and the actual bytes-from-PG's-disk times. But I'm not sure I care either. The tooling around the database is inevitable mostly and it's what matters to users.

Both Redis and PostgreSQL are persistent and survive server restarts and crashes etc. And you get so many more "batch related" features with PostgreSQL if you need them, such as being able to get a list of the last 10 rows added for some post-processing batch job.

I'm currently using Django's cache framework, with Redis as its backend, and it's a cache framework. It's not meant to be a persistent database. I like the idea that if I really have to I can just flush the cache and although detrimental to performance (temporarily) it shouldn't be a disaster. So I think what I'll do is store these JSON blobs in both databases. Yes, it means roughly 6GB of SSD storage but it also potentially means loading a LOT more into RAM on my limited server. That extra RAM usage pretty much sums of this whole blog post; of course it's faster if you can rely on RAM instead of disk. Now I just need to figure out how RAM I can afford myself for this piece and whether it's worth it.

UPDATE September 29, 2019

I experimented with an optimization of NOT turning the Django ORM query into a model instance for each record. Instead, I did this:


+from dataclasses import dataclass


+@dataclass
+class _Lookup:
+    modified: datetime.datetime
+    matches: list

...

+base_qs = base_qs.values_list("modified", "matches")
-lookup = base_qs.get(song__id=song_id)
+lookup_tuple = base_qs.get(song__id=song_id)
+lookup = _Lookup(*lookup_tuple)

print(lookup.modified)

Basically, let the SQL driver's "raw Python" content come through the Django ORM. The old difference between PostgreSQL and Redis was 16x. The new difference was 14x instead.

uwsgi weirdness with --http

September 19, 2019
2 comments Python, Linux

Instead of upgrading everything on my server, I'm just starting from scratch. From Ubuntu 16.04 to Ubuntu 19.04 and I also upgraded everything else in sight. One of them was uwsgi. I copied various user config files but for uwsgi things didn't very well. On the old server I had uwsgi version 2.0.12-debian and on the new one 2.0.18-debian. The uWSGI changelog is pretty hard to read but I sure don't see any mention of this.

You see, on SongSearch I have it so that Nginx talks to Django via a uWSGI socket. But the NodeJS server talks to Django via 127.0.0.1:PORT. So I need my uWSGI config to start both. Here was the old config:

[uwsgi]
plugins = python35
virtualenv = /var/lib/django/songsearch/venv
pythonpath = /var/lib/django/songsearch
user = django
uid = django
master = true
processes = 3
enable-threads = true
touch-reload = /var/lib/django/songsearch/uwsgi-reload.touch
http = 127.0.0.1:9090
module = songsearch.wsgi:application
env = LANG=en_US.utf8
env = LC_ALL=en_US.UTF-8
env = LC_LANG=en_US.UTF-8

(The only difference on the new server was the python37 plugin instead)

I start it and everything looks fine. No errors in the log files. And netstat looks like this:

# netstat -ntpl | grep 9090
tcp        0      0 127.0.0.1:9090          0.0.0.0:*               LISTEN      1855/uwsgi

But every time I try to curl localhost:9090 I kept getting curl: (52) Empty reply from server. Nothing in the log files! It seemed no matter what I tried I just couldn't talk to it over HTTP. No, I'm not a sysadmin. I'm just a hobbyist trying to stand up my little server with the tools and limited techniques I know but I was stumped.

The solution

After endless Googling for a resolution and trying all sorts of uwsgi commands directly, I somehow stumbled on the solution.


[uwsgi]
plugins = python35
virtualenv = /var/lib/django/songsearch/venv
pythonpath = /var/lib/django/songsearch
user = django
uid = django
master = true
processes = 3
enable-threads = true
touch-reload = /var/lib/django/songsearch/uwsgi-reload.touch
-http = 127.0.0.1:9090
+http-socket = 127.0.0.1:9090
module = songsearch.wsgi:application
env = LANG=en_US.utf8
env = LC_ALL=en_US.UTF-8
env = LC_LANG=en_US.UTF-8

With this one subtle change, I can now curl localhost:9090 and I still have the /var/run/uwsgi/app/songsearch/socket socket. So, yay!

I'm blogging about this in case someone else ever gets stuck in the same nasty surprise as me.

Also, I have to admit, I was fuming with rage from this frustration. It's really inspired me to revive the quest for an alternative to uwsgi because I'm not sure it's that great anymore. There are new alternatives such as gunicorn, gunicorn with Meinheld, bjoern etc.

Fastest Python function to slugify a string

September 12, 2019
4 comments Python

In MDN I noticed a function that turns a piece of text (Python 2 unicode) into a slug. It looks like this:


    non_url_safe = ['"', '#', '$', '%', '&', '+',
                    ',', '/', ':', ';', '=', '?',
                    '@', '[', '\\', ']', '^', '`',
                    '{', '|', '}', '~', "'"]

    def slugify(self, text):
        """
        Turn the text content of a header into a slug for use in an ID
        """
        non_safe = [c for c in text if c in self.non_url_safe]
        if non_safe:
            for c in non_safe:
                text = text.replace(c, '')
        # Strip leading, trailing and multiple whitespace, convert remaining whitespace to _
        text = u'_'.join(text.split())
        return text

The code is 7-8 years old and relates to a migration when MDN was created as a Python fork from an existing PHP solution.

I couldn't help but to react to the fact that it's a list and it's looped over every single time. Twice, in a sense. Python has built-in tools for this kinda stuff. Let's see if I can make it faster.

The candidates


translate_table = {ord(char): u'' for char in non_url_safe}
non_url_safe_regex = re.compile(
    r'[{}]'.format(''.join(re.escape(x) for x in non_url_safe)))


def _slugify1(self, text):
    non_safe = [c for c in text if c in self.non_url_safe]
    if non_safe:
        for c in non_safe:
            text = text.replace(c, '')
    text = u'_'.join(text.split())
    return text

def _slugify2(self, text):
    text = text.translate(self.translate_table)
    text = u'_'.join(text.split())
    return text

def _slugify3(self, text):
    text = self.non_url_safe_regex.sub('', text).strip()
    text = u'_'.join(re.split(r'\s+', text))
    return text

I wrote a thing that would call each one of the candidates, assert that their outputs always match and store how long each one took.

The results

The slowest is fast enough. But if you're still reading, here are the results:

_slugify1 0.101ms
_slugify2 0.019ms
_slugify3 0.033ms

So using a translate table is 5 times faster. And a regex 3 times faster. But they're all sufficiently fast.

Conclusion

This is the least of your problems in a world of real I/O such as databases and other genuinely CPU intense stuff. Well, it was fun little side-trip.

Also, aren't there better solutions that just blacklist all control characters?

NodeJS fs walk() or glob or fast-glob

August 31, 2019
3 comments JavaScript

It started with this:


function walk(directory, filepaths = []) {
    const files = fs.readdirSync(directory);
    for (let filename of files) {
        const filepath = path.join(directory, filename);
        if (fs.statSync(filepath).isDirectory()) {
            walk(filepath, filepaths);
        } else if (path.extname(filename) === '.md') {
            filepaths.push(filepath);
        }
    }
    return filepaths;
}

And you use it like this:


const foundFiles = walk(someDirectoryOfMine);
console.log(foundFiles.length);

I thought, perhaps it's faster or better to use glob. So I installed that.
Then I found, fast-glob which sounds faster. You use both in a synchronous way.

I have a directory with about 450 files, of which 320 of them are .md files. Let's compare:

walk: 10.212ms
glob: 37.492ms
fg: 14.200ms

I measured it using console.time like this:


console.time('walk');
const foundFiles = walk(someDirectoryOfMine);
console.timeEnd('walk');
console.log(foundFiles.length);

I suppose those packages have other fancier features but, I guess this just goes to show, keep it simple.

UPDATE June 2021

The origins of this blog post were that I need a simple function to find files on disk. Later, the requirements became a bit more complex so I needed something a bit more advanced. In shopping around I found fdir which, from testing, performed excellently and has a great API (and documentation). I would handsdown use that again.

Train your own spell corrector with TextBlob

August 23, 2019
0 comments Python

TextBlob is a wonderful Python library it. It wraps nltk with a really pleasant API. Out of the box, you get a spell-corrector. From the tutorial:


>>> from textblob import TextBlob
>>> b = TextBlob("I havv goood speling!")
>>> str(b.correct())
'I have good spelling!'

The way it works is that, shipped with the library, is this text file: en-spelling.txt It's about 30,000 lines long and looks like this:

;;;   Based on several public domain books from Project Gutenberg
;;;   and frequency lists from Wiktionary and the British National Corpus.
;;;   http://norvig.com/big.txt
;;;   
a 21155
aah 1
aaron 5
ab 2
aback 3
abacus 1
abandon 32
abandoned 72
abandoning 27

That gave me an idea! How about I use the TextBlob API but bring my own text as the training model. It doesn't have to be all that complicated.

The challenge

(Note: All the code I used for this demo is available here: github.com/peterbe/spellthese)

I found this site that lists "Top 1,000 Baby Boy Names". From that list, randomly pick a couple of out and mess with their spelling. Like, remove letters, add letters, and swap letters.

So, 5 random names now look like this:

▶ python challenge.py
RIGHT: jameson  TYPOED: jamesone
RIGHT: abel     TYPOED: aabel
RIGHT: wesley   TYPOED: welsey
RIGHT: thomas   TYPOED: thhomas
RIGHT: bryson   TYPOED: brysn

Imagine some application, where fat-fingered users typo those names on the right-hand side, and your job is to map that back to the correct spelling.

First, let's use the built in TextBlob.correct. A bit simplified but it looks like this:


from textblob import TextBlob


correct, typo = get_random_name()
b = TextBlob(typo)
result = str(b.correct())
right = correct == result
...

And the results:

▶ python test.py
ORIGIN         TYPO           RESULT         WORKED?
jesus          jess           less           Fail
austin         ausin          austin         Yes!
julian         juluian        julian         Yes!
carter         crarter        charter        Fail
emmett         emett          met            Fail
daniel         daiel          daniel         Yes!
luca           lua            la             Fail
anthony        anthonyh       anthony        Yes!
damian         daiman         cabman         Fail
kevin          keevin         keeping        Fail
Right 40.0% of the time

Buuh! Not very impressive. So what went wrong there? Well, the word met is much more common than emmett and the same goes for words like less, charter, keeping etc. You know, because English.

The solution

The solution is actually really simple. You just crack open the classes out of textblob like this:


from textblob import TextBlob
from textblob.en import Spelling

path = "spelling-model.txt"
spelling = Spelling(path=path)
# Here, 'names' is a list of all the 1,000 correctly spelled names.
# e.g. ['Liam', 'Noah', 'William', 'James', ...
spelling.train(" ".join(names), path)

Now, instead of corrected = str(TextBlob(typo).correct()) we do result = spelling.suggest(typo)[0][0] as demonstrated here:


correct, typo = get_random_name()
b = spelling.suggest(typo)
result = b[0][0]
right = correct == result
...

So, let's compare the two "side by side" and see how this works out. Here's the output of running with 20 randomly selected names:

▶ python test.py
UNTRAINED...
ORIGIN         TYPO           RESULT         WORKED?
juan           jaun           juan           Yes!
ethan          etha           the            Fail
bryson         brysn          bryan          Fail
hudson         hudsn          hudson         Yes!
oliver         roliver        oliver         Yes!
ryan           rnyan          ran            Fail
cameron        caeron         carron         Fail
christopher    hristopher     christopher    Yes!
elias          leias          elias          Yes!
xavier         xvaier         xvaier         Fail
justin         justi          just           Fail
leo            lo             lo             Fail
adrian         adian          adrian         Yes!
jonah          ojnah          noah           Fail
calvin         cavlin         calvin         Yes!
jose           joe            joe            Fail
carter         arter          after          Fail
braxton        brxton         brixton        Fail
owen           wen            wen            Fail
thomas         thoms          thomas         Yes!
Right 40.0% of the time

TRAINED...
ORIGIN         TYPO           RESULT         WORKED?
landon         landlon        landon         Yes
sebastian      sebstian       sebastian      Yes
evan           ean            ian            Fail
isaac          isaca          isaac          Yes
matthew        matthtew       matthew        Yes
waylon         ywaylon        waylon         Yes
sebastian      sebastina      sebastian      Yes
adrian         darian         damian         Fail
david          dvaid          david          Yes
calvin         calivn         calvin         Yes
jose           ojse           jose           Yes
carlos         arlos          carlos         Yes
wyatt          wyatta         wyatt          Yes
joshua         jsohua         joshua         Yes
anthony        antohny        anthony        Yes
christian      chrisian       christian      Yes
tristan        tristain       tristan        Yes
theodore       therodore      theodore       Yes
christopher    christophr     christopher    Yes
joshua         oshua          joshua         Yes
Right 90.0% of the time

See, with very little effort you can got from 40% correct to 90% correct.

Note, that the output of something like spelling.suggest('darian') is actually a list like this: [('damian', 0.5), ('adrian', 0.5)] and you can use that in your application. For example:

<li><a href="?name=damian">Did you mean <b>damian</b></a></li>
<li><a href="?name=adrian">Did you mean <b>adrian</b></a></li>

Bonus and conclusion

Ultimately, what TextBlob does is a re-implementation of Peter Norvig's original implementation from 2007. I too, have written my own implementation in 2007. Depending on your needs, you can just figure out the licensing of that source code and lift it out and implement in your custom ways. But TextBlob wraps it up nicely for you.

When you use the textblob.en.Spelling class you have some choices. First, like I did in my demo:


path = "spelling-model.txt"
spelling = Spelling(path=path)
spelling.train(my_space_separated_text_blob, path)

What that does is creating a file spelling-model.txt that wasn't there before. It looks like this (in my demo):

▶ head spelling-model.txt
aaron 1
abel 1
adam 1
adrian 1
aiden 1
alexander 1
andrew 1
angel 1
anthony 1
asher 1

The number (on the right) there is the "frequency" of the word. But what if you have a "scoring" number of your own. Perhaps, in your application you just know that adrian is more right than damian. Then, you can make your own file:

Suppose the text file ("spelling-model-weighted.txt") contains lines like this:

...
adrian 8
damian 3
...

Now, the output becomes:

>>> import os
>>> from textblob.en import Spelling
>>> import os
>>> path = "spelling-model-weighted.txt"
>>> assert os.path.isfile(path)
>>> spelling = Spelling(path=path)
>>> spelling.suggest('darian')
[('adrian', 0.7272727272727273), ('damian', 0.2727272727272727)]

Based on the weighting, these numbers add up. I.e. 3 / (3 + 8) == 0.2727272727272727

I hope it inspires you to write your own spelling application using TextBlob.

For example, you can feed it the names of your products on an e-commerce site. The .txt file might bloat if you have too much but note that the 30K lines en-spelling.txt is only 314KB and it loads in...:

>>> from textblob import TextBlob
>>> from time import perf_counter
>>> b = TextBlob("I havv goood speling!")
>>> t0 = perf_counter(); right = b.correct() ; t1 = perf_counter()
>>> t1 - t0
0.07055813199999861

...70ms for 30,000 words.

function expandFiles(directoriesPatternsOrFiles)

August 15, 2019
0 comments JavaScript

I'm working on a CLI in Node. What the CLI does it that it takes one set of .json files, compute some stuff, and spits out a different set of .json files. But what it does is not important. I wanted the CLI to feel flexible and powerful but also quite forgiving. And if you typo something, it should bubble up an error rather than redirecting it to something like console.error("not a valid file!").

Basically, you use it like this:


node index.js /some/directory
# or
node index.js /some/directory /some/other/directory
# or 
node index.js /some/directory/specificfile.json
# or
node index.js /some/directory/specificfile.json /some/directory/otherfile.json
# or
node index.js "/some/directory/*.json"
# or 
node index.js "/some/directory/**/*.json"

(Note that when typing patterns in the shell you have quote them, otherwise the shell will do the expansion for you)

Or, any combination of all of these:


node index.js "/some/directory/**/*.json" /other/directory /some/specific/file.json 

Whatever you use, with patterns, in particular, it has to make the final list of found files distinct and ordered by the order of the initial arguments.

Here's what I came up with:


import fs from "fs";
import path from "path";
// https://www.npmjs.com/package/glob
import glob from "glob";


/** Given an array of "things" return all distinct .json files.
 *
 * Note that these "things" can be a directory, a file path, or a
 * pattern.
 * Only if each thing is a directory do we search for *.json files
 * in there recursively.
 */
function expandFiles(directoriesPatternsOrFiles) {
  function findFiles(directory) {
    const found = glob.sync(path.join(directory, "*.json"));

    fs.readdirSync(directory, { withFileTypes: true })
      .filter(dirent => dirent.isDirectory())
      .map(dirent => path.join(directory, dirent.name))
      .map(findFiles)
      .forEach(files => found.push(...files));

    return found;
  }

  const filePaths = [];
  directoriesPatternsOrFiles.forEach(thing => {
    let files = [];
    if (thing.includes("*")) {
      // It's a pattern!
      files = glob.sync(thing);
    } else {
      const lstat = fs.lstatSync(thing);
      if (lstat.isDirectory()) {
        files = findFiles(thing);
      } else if (lstat.isFile()) {
        files = [thing];
      } else {
        throw new Error(`${thing} is neither file nor directory`);
      }
    }
    files.forEach(p => filePaths.includes(p) || filePaths.push(p));
  });
  return filePaths;
}

This is where I'm bracing myself for comments that either point out something obvious that Node experts know or some awesome npm package that already does this but better.

If you have a typo, you get an error thrown that looks something like this:

Error: ENOENT: no such file or directory, lstat 'mydirectorrry'

(assuming mydirectory exists but mydirectorrry is a typo)

A React vs. Preact case study for a widget

July 24, 2019
0 comments Web development, React, Web Performance, JavaScript

tl;dr; The previous (React) total JavaScript bundle size was: 36.2K Brotli compressed. The new (Preact) JavaScript bundle size was: 5.9K. I.e. 6 times smaller. Also, it appears to load faster in WebPageTest.

I have this page that is a Django server-side rendered page that has on it a form that looks something like this:


<div id="root">  
  <form action="https://songsear.ch/q/">  
    <input type="search" name="term" placeholder="Type your search here..." />
    <button>Search</button>
  </form>  
</div>

It's a simple search form. But, to make it a bit better for users, I wrote a React widget that renders, into this document.querySelector('#root'), a near-identical <form> but with autocomplete functionality that displays suggestions as you type.

Anyway, I built that React bundle using create-react-app. I use the yarn run build command that generates...

  • css/main.83463791.chunk.css - 1.4K
  • js/main.ec6364ab.chunk.js - 9.0K (gzip 2.8K, br 2.5K)
  • js/runtime~main.a8a9905a.js - 1.5K (gzip 754B, br 688B)
  • js/2.b944397d.chunk.js - 119K (gzip 36K, br 33K)

Then, in Python, a piece of post-processing code copies the files from the build/static/ directory and inserts it into the rendered HTML file. The CSS gets injected as an inline <style> tag.

It's a simple little widget. No need for any service-workers or react-router or any global state stuff. (Actually, it only has 1 single runtime dependency outside the framework) I thought, how about moving this to Preact?

In comes preact-cli

The app used a couple of React hooks but they were easy to transform into class components. Now I just needed to run:


npx preact create --yarn widget name-of-my-preact-project
cd name-of-my-preact-project
mkdir src
cp ../name-of-React-project/src/App.js src/
code src/App.js

Then, I slowly moved over the src/App.js from the create-react-app project and slowly by slowly I did the various little things that you need to do. For example, to learn to build with preact build --no-prerender --no-service-worker and how I can override the default template.

Long story short, the new built bundles look like this:

  • style.82edf.css - 1.4K
  • bundle.d91f9.js - 18K (gzip 6.4K, br 5.9K)
  • polyfills.9168d.js - 4.5K (gzip 1.8K, br 1.6K)

(The polyfills.9168d.js gets injected as a script tag if window.fetch is falsy)

Unfortunately, when I did the move from React to Preact I did make some small fixes. Doing the "migration" I noticed a block of code that was never used so that gives the build bundle from Preact a slight advantage. But I think it's nominal.

In conclusion: The previous total JavaScript bundle size was: 36.2K (Brotli compressed). The new JavaScript bundle size was: 5.9K (Brotli compressed). I.e. 6 times smaller. But if you worry about the total amount of JavaScript to parse and execute, the size difference uncompressed was 129K vs. 18K. I.e. 7 times smaller. I can only speculate but I do suspect you need less CPU/battery to process 18K instead of 129K if CPU/batter matters more (or closer to) than network I/O.

WebPageTest - Visual Comparison - Mobile Slow 3G

Rendering speed difference

Rendering speed is so darn hard to measure on the web because the app is so small. Plus, there's so much else going on that matters.

However, using WebPageTest I can do a visual comparison with the "Mobile - Slow 3G" preset. It'll be a somewhat decent measurement of the total time of downloading, parsing and executing. Thing is, the server-side rended HTML form has a button. But the React/Preact widget that takes over the DOM hides that submit button. So, using the screenshots that WebPageTest provides, I can deduce that the Preact widget completes 0.8 seconds faster than the React widget. (I.e. instead of 4.4s it became 3.9s)

Truth be told, I'm not sure how predictable or reproducible is. I ran that WebPageTest visual comparison more than once and the results can vary significantly. I'm not even sure which run I'm referring to here (in the screenshot) but the React widget version was never faster.

Conclusion and thoughts

Unsurprisingly, Preact is smaller because you simply get less from that framework. E.g. synthetic events. I was lucky. My app uses onChange which I could easily "migrate" to onInput and I managed to get it to work pretty easily. I'm glad the widget app was so small and that I don't depend on any React specific third-party dependencies.

But! In WebPageTest Visual Comparison it was on "Mobile - Slow 3G" which only represents a small portion of the traffic. Mobile is a huge portion of the traffic but "Slow 3G" is not. When you do a Desktop comparison the difference is roughtly 0.1s.

Also, in total, that page is made up of 3 major elements

  1. The server-side rendered HTML
  2. The progressive JavaScript widget (what this blog post is about)
  3. A piece of JavaScript initiated banner ad

That HTML controls the "First Meaningful Paint" which takes 3 seconds. And the whole shebang, including the banner ad, takes a total of about 9s. So, all this work of rewriting a React app to Preact saved me 0.8s out of the total of 9s.

Web performance is hard and complicated. Every little counts, but keep your eye on the big ticket items assuming there's something you can do about them.

At the time of writing, preact-cli uses Preact 8.2 and I'm eager to see how Preact X feels. Apparently, since April 2019, it's in beta. Looking forward to giving it a try!

Find out all localStorage keys and their value sizes

July 13, 2019
0 comments Web development, JavaScript

I use localhost:3000 for a lot of different projects. It's the default port on create-react-app's dev server. The browser profile remains but projects come and go. There's a lot of old stuff in there that I have no longer any memory of adding.

My Storage tab in Firefox

Working in a recent single page app, I tried to use localStorage as a cache for some XHR requests and got: DOMException: "The quota has been exceeded.".
Wat?! I'm only trying to store a ~250KB JSON string. Surely that's far away from the mythical 5MB limit. Do I really have to lzw compress the string in and out to save room and pay for it in CPU cycles?

Better yet, find out what junk I still have in there.

Paste this into your Web Console (it's safe as milk):


Object.entries(localStorage).forEach(([k,v]) => console.log(k.padEnd(50), v.length, (v.length / 1024).toFixed(1) + 'KB'))

The output looks something like this:

Web Console output

Or, sorted and filtered a bit:


Object.entries(localStorage).sort((a, b) => b[1].length -a[1].length).slice(0,5).forEach(
([k,v]) => console.log(k.padEnd(50), v.length, (v.length / 1024).toFixed(1) + 'KB'));

Looks like this:

Sorted and sliced

And for the record, summed total in kilobytes:


(Object.values(localStorage).map(x => x.length).reduce((a, b) => a + b) / 1024).toFixed(1) + 'KB';

Summed in KB

Wrapping up

Seems my Firefox browser's localStorage limit is still 5MB.

Also, you can do the loop using localStorage.length and localStorage.key(n) and localStorage.getItem(localStorage.key(n)).length but using Object.entries(localStorage) seems neater.

I guess this means I can still use localStorage in my app. It seems I just need to localStorage.removeItem('massive-list:items') which sounds like an experiment, from eons ago, for seeing how much I can stuff in there.