How to count the most common lines in a file

October 7, 2022
0 comments Bash, macOS, Linux

tl;dr sort myfile.log | uniq -c | sort -n -r

I wanted to count recurring lines in a log file and started writing a complicated Python script but then wondered if I can just do it with bash basics.
And after some poking and experimenting I found a really simple one-liner that I'm going to try to remember for next time:

You can't argue with the nice results :)

cat myfile.log
one
two
three
one
two
one
once
one

▶ sort myfile.log | uniq -c | sort -n -r
   4 one
   2 two
   1 three
   1 once

Find the largest node_modules directories with bash

September 30, 2022
0 comments Bash, macOS, Linux

tl;dr; fd -I -t d node_modules | rg -v 'node_modules/(\w|@)' | xargs du -sh | sort -hr

It's very possible that there's a tool that does this, but if so please enlighten me.
The objective is to find which of all your various projects' node_modules directory is eating up the most disk space.
The challenge is that often you have nested node_modules within and they shouldn't be included.

The command uses fd which comes from brew install fd and it's a fast alternative to the built-in find. Definitely worth investing in if you like to live fast on the command line.
The other important command here is rg which comes from brew install ripgrep and is a fast alternative to built-in grep. Sure, I think one can use find and grep but that can be left as an exercise to the reader.

▶ fd -I -t d node_modules | rg -v 'node_modules/(\w|@)' | xargs du -sh | sort -hr
1.1G    ./GROCER/groce/node_modules/
1.0G    ./SHOULDWATCH/youshouldwatch/node_modules/
826M    ./PETERBECOM/django-peterbecom/adminui/node_modules/
679M    ./JAVASCRIPT/wmr/node_modules/
546M    ./WORKON/workon-fire/node_modules/
539M    ./PETERBECOM/chiveproxy/node_modules/
506M    ./JAVASCRIPT/minimalcss-website/node_modules/
491M    ./WORKON/workon/node_modules/
457M    ./JAVASCRIPT/battleshits/node_modules/
445M    ./GITHUB/DOCS/docs-internal/node_modules/
431M    ./GITHUB/DOCS/docs/node_modules/
418M    ./PETERBECOM/preact-cli-peterbecom/node_modules/
418M    ./PETERBECOM/django-peterbecom/adminui0/node_modules/
399M    ./GITHUB/THEHUB/thehub/node_modules/
...

How it works:

  • fd -I -t d node_modules: Find all directories called node_modules but ignore any .gitignore directives in their parent directories.
  • rg -v 'node_modules/(\w|@)': Exclude all finds where the word node_modules/ is followed by a @ or a [a-z0-9] character.
  • xargs du -sh: For each line, run du -sh on it. That's like doing cd some/directory && du -sh, where du means "disk usage" and -s means total and -h means human-readable.
  • sort -hr: Sort by the first column as a "human numeric sort" meaning it understands that "1M" is more than "20K"

Now, if I want to free up some disk space, I can look through the list and if I recognize a project I almost never work on any more, I just send it to rm -fr.

Spot the JavaScript bug with recursion and incrementing

September 28, 2022
0 comments JavaScript

What will this print?


function doSomething(iterations = 0) {
  if (iterations < 10) {
    console.log("Let's do this again!")
    doSomething(iterations++)    
  }
}
doSomething()

The answer is it will print

Let's do this again!
Let's do this again!
Let's do this again!
Let's do this again!
Let's do this again!
Let's do this again!
Let's do this again!
Let's do this again!
...forever...

The bug is the use of a "postfix increment" which is a bug I had in some production code (almost, it never shipped).

The solution is simple:


     console.log("Let's do this again!")
-    doSomething(iterations++)
+    doSomething(++iterations)    

That's called "prefix increment" which means it not only changes the variable but returns what the value became rather than what it was before increment.

The beautiful solution is actually the simplest solution:


     console.log("Let's do this again!")
-    doSomething(iterations++)
+    doSomething(iterations + 1)    

Now, you don't even mutate the value of the iterations variable but create a new one for the recursion call.

All in all, pretty simple mistake but it can easily happen. Particular if you feel inclined to look cool by using the spiffy ++ shorthand because it looks neater or something.

Create a large empty file for testing

September 8, 2022
0 comments Linux

Because I always end up Googling this and struggling to find it easily, I'm going to jot it down here so it's more present on the web for others (and myself!) to quickly find.

Suppose you want to test something like a benchmark; for example, a unit test that has to process a largish file. You can use the dd command which is available on macOS and most Linuxes.

▶ dd if=/dev/zero of=big.file count=1024 bs=1024

▶ ls -lh big.file
-rw-r--r--  1 peterbe  staff   1.0M Sep  8 15:54 big.file

So the count=1024 creates a 1MB file. To create a 500KB one you simply use...

▶ dd if=/dev/zero of=big.file count=500 bs=1024

▶ ls -lh big.file
-rw-r--r--  1 peterbe  staff   500K Sep  8 15:55 big.file

It creates a binary file so you can't cat view it. But if you try to use less, for example, you'll see this:

▶ less big.file
"big.file" may be a binary file.  See it anyway? [Enter]

^@^@^@...snip...^@^@^@
big.file (END)

Programmatically render a NextJS page without a server in Node

September 6, 2022
0 comments Web development, Node, JavaScript

If you use getServerSideProps() in Next you can render a page by visiting it. E.g. GET http://localhost:3000/mypages/page1
Or if you use getStaticProps() with getStaticPaths(), you can use npm run build to generate the HTML file (e.g. .next/server/pages directory).
But what if you don't want to start a server. What if you have a particular page/URL in mind that you want to generate but without starting a server and sending an HTTP GET request to it? This blog post shows a way to do this with a plain Node script.

Here's a solution to programmatically render a page:


#!/usr/bin/env node

import http from "http";

import next from "next";

async function main(uris) {
  const nextApp = next({});
  const nextHandleRequest = nextApp.getRequestHandler();
  await nextApp.prepare();

  const htmls = Object.fromEntries(
    await Promise.all(
      uris.map((uri) => {
        try {
          // If it's a fully qualified URL, make it its pathname
          uri = new URL(uri).pathname;
        } catch {}
        return renderPage(nextHandleRequest, uri);
      })
    )
  );
  console.log(htmls);
}

async function renderPage(handler, url) {
  const req = new http.IncomingMessage(null);
  const res = new http.ServerResponse(req);
  req.method = "GET";
  req.url = url;
  req.path = url;
  req.cookies = {};
  req.headers = {};
  await handler(req, res);
  if (res.statusCode !== 200) {
    throw new Error(`${res.statusCode} on rendering ${req.url}`);
  }
  for (const { data } of res.outputData) {
    const [, body] = data.split("\r\n\r\n");
    if (body) return [url, body];
  }
  throw new Error("No output data has a body");
}

main(process.argv.slice(2)).catch((err) => {
  console.error(err);
  process.exit(1);
});

To demonstrate I created this sample repo: https://github.com/peterbe/programmatically-render-next-page

Note, that you need to run npm run build first so Next can have all the static assets ready.

In conclusion

The alternative, in automation, would be run something like this:


▶ npm run build && npm run start &
▶ sleep 5  # give the server a chance to start
▶ xh http://localhost:3000/aboutus
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
Content-Type: text/html; charset=utf-8
Date: Tue, 06 Sep 2022 12:23:42 GMT
Etag: "m8ff9sdduo1hk"
Keep-Alive: timeout=5
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-Powered-By: Next.js

<!DOCTYPE html><html><head><meta charSet="utf-8"/><meta name="viewport" content="width=device-width"/><title>About Us page</title><meta name="description" content="We do things. I hope."/><link rel="icon" href="/favicon.ico"/><meta name="next-head-count" content="5"/><link rel="preload" href="/_next/static/css/ab44ce7add5c3d11.css" as="style"/><link rel="stylesheet" href="/_next/static/css/ab44ce7add5c3d11.css" data-n-g=""/><link rel="preload" href="/_next/static/css/ae0e3e027412e072.css" as="style"/><link rel="stylesheet" href="/_next/static/css/ae0e3e027412e072.css" data-n-p=""/><noscript data-n-css=""></noscript><script defer="" nomodule="" src="/_next/static/chunks/polyfills-c67a75d1b6f99dc8.js"></script><script src="/_next/static/chunks/webpack-7ee66019f7f6d30f.js" defer=""></script><script src="/_next/static/chunks/framework-db825bd0b4ae01ef.js" defer=""></script><script src="/_next/static/chunks/main-3123a443c688934f.js" defer=""></script><script src="/_next/static/chunks/pages/_app-deb173bd80cbaa92.js" defer=""></script><script src="/_next/static/chunks/996-f1475101e84cf548.js" defer=""></script><script src="/_next/static/chunks/pages/aboutus-41b1f037d974ef60.js" defer=""></script><script src="/_next/static/REJUWXI26y-lp9JVmzJB5/_buildManifest.js" defer=""></script><script src="/_next/static/REJUWXI26y-lp9JVmzJB5/_ssgManifest.js" defer=""></script></head><body><div id="__next"><div class="Home_container__bCOhY"><main class="Home_main__nLjiQ"><h1 class="Home_title__T09hD">About Use page</h1><p class="Home_description__41Owk"><a href="/">Go to the <b>Home</b> page</a></p></main><footer class="Home_footer____T7K"><a href="/">Home page</a></footer></div></div><script id="__NEXT_DATA__" type="application/json">{"props":{"pageProps":{}},"page":"/aboutus","query":{},"buildId":"REJUWXI26y-lp9JVmzJB5","nextExport":true,"autoExport":true,"isFallback":false,"scriptLoader":[]}</script></body></html>

There are probably many great ideas that this can be used for. At work we use getServerSideProps() and we have too many pages to build them all statically. We need a solution like this to do custom analysis of the rendered HTML to check for broken links by analyzing every generated <a href> tag.

Join a list with a bitwise or operator in Python

August 22, 2022
0 comments Python

The bitwise OR operator in Python is often convenient when you want to combine multiple things into one thing. For example, with the Django ORM you might do this:


from django.db.models import Q

filter_ = Q(first_name__icontains="peter") | Q(first_name__icontains="ashley")

for contact in Contact.objects.filter(filter_):
    print((contact.first_name, contact.last_name))

See how it hardcodes the filtering on strings peter and ashley.
But what if that was a bit more complicated:


from django.db.models import Q

filter_ = Q(first_name__icontains="peter")
if include("ashley"):
    filter_ | = Q(first_name__icontains="ashley")

for contact in Contact.objects.filter(filter_):
    print((contact.first_name, contact.last_name))

So far, same functionality.

But what if the business logic is more complicated? You can't do this:


filter_ = None
if include("peter"):
    filter_ | = Q(first_name__icontains="peter")  # WILL NOT WORK
if include("ashley"):
    filter_ | = Q(first_name__icontains="ashley")

for contact in Contact.objects.filter(filter_):
    print((contact.first_name, contact.last_name))

What if the list of things you want to filter on depends on a list? You'd need to do the |= stuff "dynamically". One way to solve that is with functools.reduce. Suppose the list of things you want to bitwise-OR together is a list:


from django.db.models import Q
from operator import or_
from functools import reduce


def include(_):
    import random
    return random.random() > 0.5

filters = []
if include("peter"):
    filters.append(Q(first_name__icontains="peter"))
if include("ashley"):
    filters.append(Q(first_name__icontains="ashley"))

assert len(filters), "must have at least one filter"
filter_ = reduce(or_, filters)  # THE MAGIC!

for contact in Contact.objects.filter(filter_):
    print((contact.first_name, contact.last_name))

And finally, if it's a list already:


from django.db.models import Q
from operator import or_
from functools import reduce

names = ["peter", "ashley"]
qs = [Q(first_name__icontains=x) for x in names]
filter_ = reduce(or_, qs)

for contact in Contact.objects.filter(filter_):
    print((contact.first_name, contact.last_name))

Side note

Django's django.db.models.Q is actually quite flexible with used with MyModel.objects.filter(...) because this actually works:


from django.db.models import Q

def include(_):
    import random
    return random.random() > 0.5

filter_ = Q()  # MAGIC SAUCE
if include("peter"):
    filter_ |= Q(first_name__icontains="peter")
if include("ashley"):
    filter_ |= Q(first_name__icontains="ashley")

for contact in Contact.objects.filter(filter_):
    print((contact.first_name, contact.last_name))

Comparing compression commands with hyperfine

July 6, 2022
0 comments Bash, macOS, Linux

Today I stumbled across a neat CLI for benchmark comparing CLIs for speed: hyperfine. By David @sharkdp Peter.
It's a great tool in your arsenal for quick benchmarks in the terminal.

It's written in Rust and is easily installed with brew install hyperfine. For example, let's compare a couple of different commands for compressing a file into a new compressed file. I know it's comparing apples and oranges but it's just an example:

hyperfine usage example
(click to see full picture)

It basically executes the following commands over and over and then compares how long each one took on average:

  • apack log.log.apack.gz log.log
  • gzip -k log.log
  • zstd log.log
  • brotli -3 log.log

If you're curious about the ~results~ apples vs oranges, the final result is:

▶ ls -lSh log.log*
-rw-r--r--  1 peterbe  staff    25M Jul  3 10:39 log.log
-rw-r--r--  1 peterbe  staff   2.4M Jul  5 22:00 log.log.apack.gz
-rw-r--r--  1 peterbe  staff   2.4M Jul  3 10:39 log.log.gz
-rw-r--r--  1 peterbe  staff   2.2M Jul  3 10:39 log.log.zst
-rw-r--r--  1 peterbe  staff   2.1M Jul  3 10:39 log.log.br

The point is that you type hyperfine followed by each command in quotation marks. The --prepare is run for each command and you can also use --cleanup="{cleanup command here}.

It's versatile so it doesn't have to be different commands but it can be: hyperfine "python optimization1.py" "python optimization2.py" to compare to Python scripts.

🎵 You can also export the output to a Markdown file. Here, I used:

▶ hyperfine "apack log.log.apack.gz log.log" "gzip -k log.log" "zstd log.log" "brotli -3 log.log" --prepare="rm -fr log.log.*" --export-markdown log.compress.md
▶ cat log.compress.md | pbcopy

and it becomes this:

Command Mean [ms] Min [ms] Max [ms] Relative
apack log.log.apack.gz log.log 291.9 ± 7.2 283.8 304.1 4.90 ± 0.19
gzip -k log.log 240.4 ± 7.3 232.2 256.5 4.03 ± 0.18
zstd log.log 59.6 ± 1.8 55.8 65.5 1.00
brotli -3 log.log 122.8 ± 4.1 117.3 132.4 2.06 ± 0.09

How to know if a PR has auto-merge enabled in a GitHub Action workflow

May 24, 2022
0 comments GitHub

tl;dr


      - name: Only if auto-merge is enabled
        if: ${{ github.event.pull_request.auto_merge }}
        run: echo "Auto-merge IS ENABLED"

      - name: Only if auto-merge is NOT enabled
        if: ${{ !github.event.pull_request.auto_merge }}
        run: echo "Auto-merge is NOT enabled"

The use case that I needed was that I have a workflow that does a bunch of things that aren't really critical to test the PR, but they also take a long time. In particular, every pull request deploys a "preview environment" so you get a "staging" site for each pull request. Well, if you know with confidence that you're not going to be clicking around on that preview/staging site, why bother deploying it (again)?

Also, a lot of PRs get the "Auto-merge" enabled because whoever pressed that button knows that as long as it builds OK, it's ready to merge in.

What's cool about the if: statements above is that they will work in all of these cases too:


on:
  workflow_dispatch:
  pull_request:
  push:
     branches:
       - main

I.e. if this runs because it was a push to main the line ${{ !github.event.pull_request.auto_merge }} will resolve to truthy. Same if you use the workflow dispatch from workflow_dispatch.

Auto-merge GitHub pull requests based on "partial required checks"

May 3, 2022
0 comments GitHub

Auto-merge is a fantastic GitHub Actions feature. You first need to set up some branch protections and then, as soon as you've created the PR you can press the "Enable auto-merge (squash)". It will ("Squash and merge") merge the PR as soon as all branch protection checks succeeded. Neat.

But what if you have a workflow that is made up of half critical and half not-so-important stuff. In particular, what if there's stuff in the workflow that is really slow and you don't want to wait. One example is that you might have a build-and-deploy workflow where you've decided that the "build" part of that is a required check, but the (slow) deployment is just a nice-to-have. Here's an example of that:


name: Build and Deploy stuff

on:
  workflow_dispatch:
  pull_request:


permissions:
  contents: read

jobs:
  build-stuff:
    runs-on: ubuntu-latest
    steps:
      - name: Slight delay
        run: sleep 5

  deploy-stuff:
    needs: build-stuff
    runs-on: ubuntu-latest
    steps:
      - name: Do something
        run: sleep 26

It's a bit artificial but perhaps you can see beyond that. What you can do is set up a required status check, as a branch protection, just for the build-stuff job.

Note how the job is made up of build-stuff and deploy-stuff, where the latter depends on the first. Now set up branch protection purely based on the build-stuff. This option should appear as you start typing buil there in the "Status checks that are required." section of Branch protections.

Branch protection

Now, when the PR is created it immediately starts working on that build-stuff job. While that's running you press the "Enable auto-merge (squash)" button:

Checks started

What will happen is that as soon as the build-stuff job (technically the full name becomes "Build and Deploy stuff / build-stuff") goes green, the PR is auto-merged. But the next (dependent) job deploy-stuff now starts so even if the PR is merged you still have an ongoing workflow job running. Note the little orange dot (instead of the green checkmark).

Still working

It's quite an advanced pattern and perhaps you don't have the use case yet, but it's good to know it's possible. What our use case at work was, was that we use auto-merge a lot in automation and our complete workflow depended on a slow step that is actually conditional (and a bit slow). So we didn't want the auto-merge to be delayed because of something that might be slow and might also turn out to not be necessary.

How to sort case insensitively with empty strings last in Django

April 3, 2022
1 comment Django, Python, PostgreSQL

Imagine you have something like this in Django:


class MyModel(models.Models):
    last_name = models.CharField(max_length=255, blank=True)
    ...

The most basic sorting is either: queryset.order_by('last_name') or queryset.order_by('-last_name'). But what if you want entries with a blank string last? And, you want it to be case insensitive. Here's how you do it:


from django.db.models.functions import Lower, NullIf
from django.db.models import Value


if reverse:
    order_by = Lower("last_name").desc()
else:
    order_by = Lower(NullIf("last_name", Value("")), nulls_last=True)


ALL = list(queryset.values_list("last_name", flat=True))
print("FIRST 5:", ALL[:5])
# Will print either...
#   FIRST 5: ['Zuniga', 'Zukauskas', 'Zuccala', 'Zoller', 'ZM']
# or 
#   FIRST 5: ['A', 'aaa', 'Abrams', 'Abro', 'Absher']
print("LAST 5:", ALL[-5:])
# Will print...
#   LAST 5: ['', '', '', '', '']

This is only tested with PostgreSQL but it works nicely.
If you're curious about what the SQL becomes, it's:


SELECT "main_contact"."last_name" FROM "main_contact" 
ORDER BY LOWER(NULLIF("main_contact"."last_name", '')) ASC

or


SELECT "main_contact"."last_name" FROM "main_contact" 
ORDER BY LOWER("main_contact"."last_name") DESC

Note that if your table columns is either a string, an empty string, or null, the reverse needs to be: Lower("last_name", nulls_last=True).desc().