Peterbe.com

Peter Bengtsson's blog

Filtered by Web development

Page 13

Fastest way to take screencaps out of videos

December 19, 2014
1 comment Linux, Web development, Mozilla

tl;dr Don't run ffmpeg over HTTP(S) and use ffmpegthumbnailer

UPDATE tl;dr Download the file then run ffmpeg with -ss HH:MM:SS first. Don't bother with ffmpegthumbnailer

At work I work on something called Air Mozilla. It's a site for hosting live video broadcasts and then archiving those so they can be retrieved later.

Unlike sites like YouTube we can't take a screencap from the video because many videos are future (aka. "upcoming") videos so instead we use a little placeholder thumbnail (for example, the Rust logo).

However, once it has been recorded we want to switch from the logo to an actual screen capture from the video itself. We set up a cronjob that uses ffmpeg to extract these as JPGs and then the users can go in and select whichever picture they like the best.

This is all work in progress by the way (as of December 2014).

One problem is that we have is that the command for extracting JPGs is really slow. So slow that we can't wrap the subprocess in a Django database connection because it's so slow that the database connection is often killed.

The command to extract them looks something like this:

ffmpeg -i https://cdnexample.com/url/to/file.mp4 -r 0.0143 /tmp/screencaps-%02d.jpg

Where the number r is based on the duration and how many pictures we want out. E.g. 0.0143 = 15 * 1049 where 15 is how many JPGs we want and 1049 is a duration of 17 minutes and 29 seconds.

The script I used first was: ffmpeg1.sh

My first experiment was to try to extract one picture at a time, hoping that way, internally, ffmpeg might be able to optimize something.

The second script I used was: ffmpeg2.sh

The third alternative was to try ffmpegthumbnailer which is an intricate wrapper on ffmpeg and it has the benefit that you can produce slightly higher picture quality too.

The third script I used was: ffmpeg3.sh

And running these three depend very much on the state of my DSL at the time.

For a video clip that is 17 minutes long and a 138Mb mp4 file.

ffmpeg1.sh   2m0.847s
ffmpeg2.sh   11m46.734s
ffmpeg3.sh   0m29.780s

Clearly it's not efficient to do one screenshot at a time.
Because with ffmpegthumbnailer you can tell it not to reduce the picture quality the total weight of the produced JPGs from ffmpeg1.sh was 784Kb and the total weight from ffmpeg3.sh was 1.5Mb.

Just to try again, I ran a similar experiment with a 35 minutes long and 890Mb mp4 file. And this time I didn't bother with ffmpeg2.sh. The results were:

ffmpeg1.sh   18m21.330s
ffmpeg3.sh   2m48.656s

So that means that using ffmpegthumbnailer is about 5 times faster than ffmpeg. Huge difference!

And now, a curveball!

The reason for doing ffmpeg -i https://... was so that we don't have to first download the whole beast and run the command on a local file. However, in light of how so much longer this takes and my disdain to have to install and depend on a new tool (ffmpegthumbnailer) across all servers. Why not download the whole file and run the ffmpeg command locally.

So I download the file and it's slow because of my, currently, terrible home DSL. Then I run and time them again but just a local file instead:

ffmpeg1.sh   0m20.426s
ffmpeg3.sh   0m0.635s

Did you see that!? That's an insane difference. Clearly doing this command over HTTP(S) is a bad idea. It'll be worth downloading it first.

UPDATE

On Stackoverflow, LordNeckBeard gave a great tip of using the -ss option before in the input file and now it's much faster. At this point. I'm no longer interested in having to bother with ffmpegthumbnailer.

Let's fork ffmpeg2.sh into two versions.

ffmpeg2.1.sh same as ffmpeg2.sh but a downloaded file instead of a remote HTTPS URL.

ffmpeg2.2.sh as ffmpeg2.1.sh except we put the -ss HH:MM:SS before the input file.

Now, let's run them again on the 138Mb file:

# the 138Mb mp4.mp4 file
ffmpeg2.1.sh   2m10.898s
ffmpeg2.2.sh   0m0.672s

187 times faster

And again, I re-ran this again against a bigger file that is 1.4Gb:

# the 1.4Gb mp4-1.44Gb.mp4 file
ffmpeg2.1.sh   10m1.143s
ffmpeg2.2.sh   0m1.428s

420 times faster

Why is it important to escape & in href attributes in tags?

November 11, 2014
13 comments Web development

Here’s an example of unescaped & characters in a A HREF tag attribute.
http://jsfiddle.net/32zbogfw/ It’s working fine.

I know it might break XML and possibly XHTML but who uses that still?

And I know an unescaped & in a href shows as red in the View Source color highlighting.

What can go wrong? Why is it important? Perhaps it used to be in 2009 but no longer the case.

This all started because I was reviewing some that uses python urllib.urlencode(...) and inserts the results into a Django template with href="{{ result_of_that_urlencode }}" which would mean you get un-escaped & characters and then I tried to find how and why that is bad but couldn't find any examples of it.

django-html-validator

October 20, 2014
2 comments Python, Web development, Django

A couple of weeks ago we had accidentally broken our production server (for a particular report) because of broken HTML. It was an unclosed tag which rendered everything after that tag to just plain white. Our comprehensive test suite failed to notice it because it didn't look at details like that. And when it was tested manually we simply missed the conditional situation when it was caused. Neither good excuses. So it got me thinking how can we incorporate HTML (html5 in particular) validation into our test suite.

So I wrote a little gist and used it a bit on a couple of projects and was quite pleased with the results. But I thought this might be something worthwhile to keep around for future projects or for other people who can't just copy-n-paste a gist.

With that in mind I put together a little package with a README and a setup.py and now you can use it too.

There are however some caveats. Especially if you intend to run it as part of your test suite.

Caveat number 1

You can't flood htmlvalidator.nu. Well, you can I guess. It would be really evil of you and kittens will die. If you have a test suite that does things like response = self.client.get(reverse('myapp:myview')) and there are many tests you might be causing an obscene amount of HTTP traffic to them. Which brings us on to...

Caveat number 2

The htmlvalidator.nu site is written in Java and it's open source. You can basically download their validator and point django-html-validator to it locally. Basically the way it works is java -jar vnu.jar myfile.html. However, it's slow. Like really slow. It takes about 2 seconds to run just one modest HTML file. So, you need to be patient.

Aggressively prefetching everything you might click

August 20, 2014
13 comments This site, Web development, JavaScript

I just rolled out a change here on my personal blog which I hope will make my few visitors happy.

**Basically; when you hover over a link (local link) long enough it prefetches it (with AJAX) so that if you do click it's hopefully already cached in your browser. **

If you hover over a link and almost instantly hover out it cancels the prefetching. The assumption here is that if you deliberately put your mouse cursor over a link and proceed to click on it you want to go there. Because your hand is relatively slow I'm using the opportunity to prefetch it even before you have clicked. Some hands are quicker than others so it's not going to help for the really quick clickers.

What I also had to do was set a Cache-Control header of 1 hour on every page so that the browser can learn to cache it.

The effect is that when you do finally click the link, by the time your browser loads it and changes the rendered output it'll hopefully be able to do render it from its cache and thus it becomes visually ready faster.

Let's try to demonstrate this with this horrible animated gif:
(or download the screencast.mov file)

Screencast
1. Hover over a link (in this case the "Now I have a Gmail account" from 2004)
2. Notice how the Network panel preloads it
3. Click it after a slight human delay
4. Notice that when the clicked page is loaded, its served from the browser cache
5. Profit!

So the code that does is is quite simply:


$(function() {
  var prefetched = [];
  var prefetch_timer = null;
  $('div.navbar, div.content').on('mouseover', 'a', function(e) {
    var value = e.target.attributes.href.value;
    if (value.indexOf('/') === 0) {
      if (prefetched.indexOf(value) === -1) {
        if (prefetch_timer) {
          clearTimeout(prefetch_timer);
        }
        prefetch_timer = setTimeout(function() {
          $.get(value, function() {
            // necessary for $.ajax to start the request :(
          });
          prefetched.push(value);
        }, 200);
      }
    }
  }).on('mouseout', 'a', function(e) {
    if (prefetch_timer) {
      clearTimeout(prefetch_timer);
    }
  });
});

Also, available on GitHub.

I'm excited about this change because of a couple of reasons:

On mobile, where you might be on a non-wifi data connection you don't want this. There you don't have the mouse event onmouseover triggering. So people on such devices don't "suffer" from this optimization.
It only downloads the HTML which is quite light compared to static assets such as pictures but it warms up the server-side cache if needs be.
It's much more targetted than a general prefetch meta header.
Most likely content will appear rendered to your eyes faster.

Optimizing MozTrap

June 4, 2014
3 comments Web development, JavaScript, Mozilla

MozTrap is what's called a "test case management system". Basically, software QA people need a structure and pattern to their testing. What to test, what versions to test on and what hardware/operatting system etc all is part of a "test suite". That's what MozTrap manages.

So this project was built by Mozilla's automation and tools team. It is currently not an actively developed project. Not because it's not needed or used but because it basically maps all the features we need. A large part of the code base was originally written by a personal friend of mine who I respect wholeheartedly; Carl Meyer of Django/pip/virtualenv/etc fame. I'm grateful for the awesome documentation he left behind amongst many other things.

Together with the team we sat down and listed all the biggest pain points as of today. Basically, the number one thing is speed. Pages load too slowly. Normally when web developers worry themselves with web performance it's a matter of shaving milliseconds off a page where a clients perception equals lost or gained profits. Here's not a problem of milliseconds but a problem of seconds. After some quick poking around on the production site and looking at some code the conclusion is simple: The site is so darn slow because the HTML sent from the server is way too MASSIVE. And baked into that is a mixture of the poor web server having to produce a massive HTML blob and it being sent over the wire.

One test run I made said it took 14 seconds to render a certain page.

Why is it so slow?

So how did this happen and why is it not Carl's fault? :) The reason it happened was because of the underestimated number of options added to the advanced filtering drop-downs. On a local dev version you never notice these things because you set up some options, for example tags, and the drop-down never gets larger than 10-20 options. For example, the "Creator" drop-down today has 1,664 different choices.

If you take all those choices and turn thing into a HTML like this: <option value="1">Adam</option>\n<option value="2">Bram</option>... etc. you get 66Kb of just HTML. However, MozTrap doesn't work like that. Instead it uses pretty drop-downs that don't look like regular HTML drop-downs. See for yourself; go to https://moztrap.mozilla.org/results/runs/ and click the "Advanced Filtering" button.
So, that means that the HTML for each option instead looks like this:


<li class="filter-item">
  <input name="filter-creator" data-name="creator" value="1" id="id-filter-creator-1" class="check" type="checkbox">
  <span class="onoff">
    <label for="id-filter-creator-1" class="onoffswitch">Adam</label>
                <span class="pinswitch"></span>
    <span class="content" title="creator: Adam">Adam</span>
  </span>
</li>

Now you get 620Kb of just HTML just for the "Creators" field. Granted, that is the biggest field of all the drop-downs but lots of them are massive.

So, this makes the page weigh a total of about 1.1Mb just for the HTML. Not only is it a lot of work for the (Django) server to generate this but it's also a heck of a lot of data to send across the Internet on every page request.

So what was the solution?

An ideal solution would have been a significant re-write whereby much of the values of the page gets rewritten as later AJAX calls. I.e. load a skeleton that loads superfast, and then load some AJAX in the background. That AJAX could potentially be cached in the browser with localStorage or something so that you get something to show very quickly whilst you wait for the AJAX request to complete. But this would have been too big a change and the way the filtering works on these pages, you actually need all the options in the drop-downs on immediate load because of the way "pinned filters" work.

So the solution was to replace all the repeated HTML chunks with 1 JSON string and then a piece of Javascript template rendering. So, in the Django template code instead of:


{% for field in filters %}
  {% include "lists/_filter_group.html" with advanced=1 prefix="filter" pinable=1 %}
{% endfor %}

We now replace this with:


<script>
var FILTERS = {% filterset_to_json filters with advanced=1 prefix="filter" pinable=1 %}
</script>
<script id="filter_group" type="text/html">
<section class="filter-group {{ field.cls }}" data-name="{{ field.key }}">
  <h5 class="category-title">
    {{ _field_name_lower }}
    {{# field.switchable }}
    ...

What that basically is is some Mustache code that I use to generate the HTML DOM nodes and insert into the page after load.

In conclusion

So basically nothing changes. Nothing of the Django view had to change. Visually there's no difference and the same actual user data is sent from the server to the client but just packed in a more optimal way.

There are multiple pages where these massive "Advanced Filtering" options exist but on one page I measured the whole page went from weighing 1.1Mb down to 132Kb.

HTML Tree on Hacker News

May 18, 2014
0 comments Web development, JavaScript

On Friday I did a Show HN and got featured on the front page for HTML Tree.

Amazingly, out of the 3,858 visitors (according to Google Analytics today) 2,034 URLs were submitted and tested on the app. Clearly a lot of people just clicked the example submission but out of those 1,634 were unique. Granted, some people submitted more than one URL but I think a large majority of people came up with a URL of their own to try. Isn't that amazing! What a turnout of a Friday afternoon hack (with some Sunday night hacking to make it into a decent looking website).

The lesson to learn here is that the Hacker News crowd is excellent for getting engagement. Yes, there are a lot of blather and almost repetitive submissions but by and large it's a very engaging community. Suck on that those who make fun of HN!

UPDATE

HTMLTree has been moved to https://htmltree.herokuapp.com/

GitHub PR triage across multiple projects

April 28, 2014
0 comments Web development, JavaScript, Mozilla

I have now closed issue #2 on github-pr-triage. So, now you can have a dashboad of every GitHub project whose pull requests you care about.

The only format of using just 1 repo works too. E.g. /owner/project) and should hopefully not break anybody's bookmarks. The new format for having multiple repos across (possibly) multiple owners is like this:

owner1:projectA,projectB;owner2:projectX,projectY,projectZ

See screenshot:

To set yours up, here's a running instance available on https://prs.paas.allizom.org

Grymt - because I didn't invent Grunt here

April 18, 2014
3 comments Python, Web development, JavaScript

grymt is a python tool that takes a directory full of .html, .css and .js and prepares the html for optimial production use.

For a teaser:

Look at the "input"
Look at the "output" (Note! You have to right-click and view source)

So why did I write my own tool and not use Grunt?!

Glad you asked! The reason is simple: I couldn't get Grunt to work.

Grunt is a framework. It's a place where you say which "recipes" to execute and how. It's effectively a common config framework. Like make.
However, I tried to set up a bunch of recipes in my Gruntfile.js and most of them worked well individually but it was a hellish nightmare to get it all to work together just the way I want it.

For example, the grunt-contrib-uglify is fine for doing the minification but it doesn't work with concatenation and it doesn't deal with taking one input file and outputting to a different file.
Basically, I spent two evenings getting things to work but I could never get exactly what I wanted. So I wrote my own and because I'm quite familiar with this kind of stuff, I did it in Python. Not because it's better than Node but just because I had it near by and was able to quicker build something.

So what sweet features do you get out of grymt?

You can easily make an output file have a hash in the filename. E.g. vendor-$hash.min.js becomes vendor-64f7425.min.js and thus the filename is always unique but doesn't change in between deployments unless you change the files.
It automatically notices which files already have been minified. E.g. no need to minify somelib.min.js but do minify otherlib.js.
You can put $git_revision anywhere in your HTML and this gets expanded automatically. For example, view the source of buggy.peterbe.com and look at the first 20 lines.
Images inside CSS get rewritten to have unique names (based on files' modified time) so they can be far-future cached aggresively too.
You never have to write down any lists of file names in soome Gruntfile.js equivalent file
It copies ALL files from a source directory. This is important in case you have something like this inside your javascript code: $('<img>').attr('src', 'picture.jpg') for example.
You can chose to inline all the minified and concatenated CSS or javascript. Inlining CSS is neat for single page apps where you have a majority of primed cache hits. Instead of one .html and one .css you get just one .html and the amount of bytes is the same. Not having to do another HTTP request can save a lot of time on web performance.
The generated (aka. "dist" directory) contains everything you need. It does not refer back to the source directory in any way. This means you can set up your apache/nginx to point directly at the root of your "dist" directory.

So what's the catch?

It's not Grunt. It's not a framework. It does only what it does and if you want it to do more you have to work on grymt itself.
The files you want to analyze, process and output all have to be in a sub directory.
Look at how I've laid out the files here in this project for example. ALL files that you need is all in one sub-directory called app. So, to run grymt I simply run: grymt app.
The HTML files you throw into it have to be plain HTML files. No templates for server-side code.

How do you use it?

pip install grymt

Then you need a directory it can process, e.g ./client/ (assumed to contain a .html file(s)).

grymt ./client

For more options, check out

grymt --help

What's in the future of grymt?

If people like it and want to add features, I'm more than happy to accept pull requests. Some future potential feature work:

I haven't needed it immediately, yet, myself, but it would be nice to add things like coffeescript, less, sass etc into pre-processing hooks.
It would be easy to automatically generate and insert a reference to a appcache manifest. Since every file used and mentioned is noticed, we could very accurately generate an appcache file that is less prone to human error.
Spitting out some stats about number bytes saved and number of files reduced.

Buggy - A sexy Bugzilla offline webapp

March 13, 2014
1 comment Web development, Mozilla, JavaScript

Buggy is a singe-page webapp that relies entirely on the Bugzilla Native REST API. And it works offline. Sort of. I say "sort of" because obviously without a network connection you're bound to have outdated information from the bugzilla database but at least you'll have what you had when you went offline.

When you post a comment from Buggy, the posted comment is added to an internal sync queue and if you're online it immediately processes that queue. There is, of course, always a risk that you might close a bug when you're in a tunnel or on a plane without WiFi and when you later get back online the sync fails because of some conflict.

The reason I built this was partly to scratch an itch I had ("What's the ideal way possible for me to use Bugzilla?") and also to experiment with some new techniques, namely AngularJS and localforage.

So, the way it works is:

You pick your favorite product and components.
All bugs under these products and components are downloaded and stored locally in your browser (thank you localforage).
When you click any bug it then proceeds to download its change history and its comments.
Periodically it checks each of your chosen product and components to see if new bugs or new comments have been added.
If you refresh your browser, all bugs are loaded from a local copy stored in your browser and in the background it downloads any new bugs or comments or changes.
If you enter your username and password, an auth token is stored in your browser and you can thus access secure bugs.

Pros and cons

The main advantage of Buggy compared to Bugzilla is that it's fast to navigate. You can instantly filter bugs by status(es), components and/or by searching in the bug summary.

The disadvantage of Buggy is that you can't see all fields, file new bugs or change all fields.

The code

The code is of course open source. It's available on https://github.com/peterbe/buggy and released under a MPL 2 license.

The code requires no server. It's just an HTML page with some CSS and Javascript.

Everything is done using AngularJS. It's only my second AngularJS project but this is also part of why I built this. To learn AngularJS better.

Much of the inspiration came from the CSS framework Pure and one of their sample layouts which I started with and hacked into shape.

The deployment

Because Buggy doesn't require a server, this is the very first time I've been able to deploy something entirely on CDN. Not just the images, CSS and Javascript but the main HTML page as well. Before I explain how I did that, let me explain about the make.py script.

I really wanted to use Grunt but it just didn't work for me. There are many positive things about Grunt such as the ease with which you can easily add plugins and I like how you just have one "standard" file that defines how a bunch of meta tasks should be done. However, I just couldn't get the concatenation and minification and stuff to work together. Individually each tool works fine, such as the grunt-contrib-uglify plugin but together none of them appeared to want to work. Perhaps I just required too much.

In the end I wrote a script in python that does exactly what I want for deployment. Its features are:

Hashes in the minified and concatenated CSS and Javascript files (e.g. vendor-8254f6b.min.js)
Custom names for the minified and concatenated CSS and Javascript files so I can easily set far-future cache headers (e.g. /_cache/vendor-8254f6b.min.js)
Ability to fold all CSS minified into the HTML (since there's only one page, theres little reason to make the CSS external)
A Git revision SHA into the HTML of the generated ./dist/index.html file
All files in ./client/static/ copied intelligently into ./dist/static/
Images in CSS to be given hashes so they too can have far-future cache headers

So, the way I have it set up is that, on my server, I have a it run python make.py and that generates a complete site in a ./dist/ directory. I then point Nginx to that directory and run it under http://buggy-origin.peterbe.com. Then I set up a Amazon Cloudfront distribution to that domain and then lastly I set up a CNAME for buggy.peterbe.com to point to the Cloudfront distribution.

The future

I try my best to maintain a TODO file inside the repo. That's where I write down things to come. (it's also works as a changelog) since I also use this file to write down what's been done.

One of the main features I want to add is the ability to add bugs that are outside your chosen products and components. It'll be a "fake" component called "Misc". This is for bugs outside the products and components you usually monitor and work in but perhaps bugs you've filed or been assigned to. Or just other bugs you're interested in in general.

Another major feature to work on is the ability to choose to see more fields and ability to edit these too. This will require some configuration on the individual users' behalf. For example, some people use the "Target Milestone" a lot. Some use the "Importance" a lot. So, some generic solution is needed to accomodate all these non-basic fields.

And last but not least, the Bugzilla team here at Mozilla is working on a very exciting project that allows you to register a certain list of bugs with a WebSocket and have it push to you as soon as these bugs change. That means that I won't have to periodically query bugzilla every 30 seconds if certain bugs have changed but instead get instant notifications when they do. That's going to be major! I confidently speculate that that will be implemented some time summer this year.

Give it a go. What are you waiting for? :) Go to http://buggy.peterbe.com/, pick your favorite products and components and try to use it for a week.

Github Pull Request Triage tool

March 6, 2014
0 comments Web development, AngularJS

Last week I built a little tools called github-pr-triage. It's a single page app that sits on top of the wonderful GitHub API v3.

Its goal is to try to get an overview of what needs to happen next to open pull requests. Or rather, what needs to happen next to get it closed. Or rather, who needs to act next to get it closed.

It's very common, at least in my team, that someone puts up a pull request, asks someone to review it and then walks away from it. She then doesn't notice that perhaps the integrated test runner fails on it and the reviewer is thinking to herself "I'll review the code once the tests don't fail" and all of a sudden the ball is not in anybody's court. Or someone makes a comment on a pull request that the author of the pull requests misses in her firehose of email notifictions. Now she doesn't know that the comment means that the ball is back in her court.

Ultimately, the responsibility lies with the author of the pull request to pester and nag till it gets landed or closed but oftentimes the ball is in someone elses court and hopefully this tool makes that clearer.

Here's an example instance: https://prs.paas.allizom.org/mozilla/socorro

Currently you can use prs.paas.allizom.org for any public Github repo but if too many projects eat up all the API rate limits we have I might need to narrow it down to use mozilla repos. Or, you can simply host your own. It's just a simple Flask server

About the technology

I'm getting more and more productive with Angular but I still consider myself a beginner. Saying that also buys me insurance when you laugh at my code.

So it's a single page app that uses HTML5 pushState and an angular $routeProvider to make different URLs.

The server simply acts as a proxy for making queries to api.github.com and bugzilla.mozilla.org/rest and the reason for that is for caching.

Every API request you make through this proxy gets cached for 10 minutes. But here's the clever part. Every time it fetches actual remote data it stores it in two caches. One for 10 minutes and one for 24 hours. And when it stores it for 24 hours it also stores its last ETag so that I can make conditional requests. The advantage of that is you quickly know if the data hasn't changed and more importantly it doesn't count against you in the rate limiter.