Peterbe.com

Peter Bengtsson's blog

Filtered by JavaScript, Python

Page 28

Autocompeter 1.1.8 and smooth typing

April 6, 2015
0 comments Go, JavaScript

One of the most constructive pieces of feedback I got when Autocompeter was on Hacker News was that when you type something with lots of predictable results the results overlay would "flicker".

E.g. you type "javascript" in a nice and steady pace and the overlay would shrink and grow and shrink and grow very rapidly.

The reason it happened was due to a bug in the javascript code that filtered results whilst waiting for the next AJAX request from the latest typed character. E.g. you type "ash" and the results comes back with "ashley", "ashes", "Ashford". Then you add a "l" so now we start a new AJAX query for "ashl" and whilst waiting for that output from the server we can start filtering out "ashes" and "Ashford" because we can pre-emptively know that that won't be in the new result set.

The bug was a bad function that filtered the existing results on a second rendering whilst waiting for the next AJAX. It was easy to fix and this is included in version 1.1.8.

The reason I failed to notice this was because I had inserted some necessary optimizations when the network latency was very very slow but hadn't tested it in a realistic network latency environment. E.g. a decent DSL connection but nevertheless something more advanced that just connecting to localhost.

Autocompeter.com

April 2, 2015
8 comments Go, JavaScript

About a year ago I found a great article about how to use Redis to index every prefix of every word as an index and thus a super fast way of building an autocomplete service. The idea is that you take all your titles and index them like this; if the title is "My Title" you store a key for m, my, t, ti, tit, titl and title. That means you can do very fast lookups as someone is typing unfinished words.

Anyway. I was running this merrily here on my personal blog but I liked it so much and I wanted to use it on aother site for work that I thought it'd be time to extract it into its own little microservice. All I needed was a name and my friend and colleague jezdez suggested I call it "autocompeter". So that it became.

The original implementation was written in Python and at the time I was learning Go and was eager to have something to build in Go. So I built this microservice in Go using the Negroni web framework.

The idea is that you own and run a website. You have a search feature on your website but you don't have a nifty autocomplete (aka. live search) thing on it. So, you send me all your titles, URLs and optionally their "popularity ranking" (basically a score). I'll index them on autocompeter.com under your domain. You have to sign in with GitHub to set up an API Auth Key.

Then, you put this into your HTML:


<script src="//cdn.jsdelivr.net/autocompeter/1/autocompeter.min.js"></script>
<script>
Autocompeter(document.querySelector('input[name="q"]');
</script>

Also, you'll need to download the CSS and put into your site. I don't recommend pointing to a CDN for CSS.

And that's all you have to do. The REST API, options for the Javascript integration and CSS integration in the documentation.

The Javascript is framework-free meaning it's just pure DOM manipulation and works in IE and modern browsers. The minified file is only 4.2Kb minified (2Kb gzipped).

All code is Open Source under a BSD license. Everything is free but there's no SLA as of yet.

I'm going to be blogging more and more about feature development, benchmarks and other curious things I learn developing this further.

Bye bye httpretty. Welcome back good old mock!

March 19, 2015
4 comments Python

After a day of pushing 9 commits to a PR to finally get Travis to build a simple python package on python 2.6, 2.7, 3.3 and 3.4 I finally gave up and ripped out all of httpretty and replaced it with good old mock.patch()

I was getting all sorts of strange warnings in py3.3 and 3.4 got stuck all the time.
This is not the first time httpretty has been causing confusion so from now on I'm giving up on httpretty. Ithink it was too good to be true to work reliably. Honestly, it might be python's fault for not being better made available to cool libs like httpretty.

By the way, here's one of those errors where Python 3.4 just hangs which stopped being the case once I took out httpretty. And here you can see the clear failure to deactivate the monkeypatch even after the test is complete in Python 3.3.

Median size of Javascript libs on jsDelivr

February 24, 2015
0 comments JavaScript

If you haven't heard of jsDelivr then you've missed out on a great project! It's basically a free CDN to use for Open Source projects. I added my own yesterday and it was as easy as making a pull request with the initial file, some metadata and a file that tells them where to pick up new versions from (e.g. GitHub).

Anyway, they now host A LOT of files. 8,941 to be exact (1,927 unique file names), at the time of writing. So I thought I'd check out what the median size is.

The median size is: 7.4Kb

The average is 28.6Kb and the standard deviation is 73Kb! so we can basically ignore that and just focus on the median.

Check out the histogram

That's pretty big! If you exclude those bigger than 100Kb the median shrinks to 6.5Kb. Still pretty big.

I'm proud to say that my own is only 50% the size of the median size.

Best non-cryptographic hashing function in Python (size and speed)

February 21, 2015
11 comments Python

First of all; hashing is hard. But fortunately it gets a little bit easier if it doesn't have to cryptographic. A non-cryptographic hashing function is basically something that takes a string and converts it to another string in a predictable fashion and it tries to do it with as few clashes as possible and as fast as possible.

MD5 is a non-cryptographic hashing function. Unlike things like sha256 or sha512 the MD5 one is a lot more predictable.

Now, how do you make a hashing function that yields a string that is as short as possible? The simple answer is to make the output use as many different characters as possible. If a hashing function only returns integers you only have 10 permutations per character. If you instead use a-z and A-Z and 0-9 you now have 26 + 26 + 10 permutations per character.

A hex on the other hand only uses 0-9 and a-f which is only 10 + 6 permutations. So you need a longer string to be sure it's unique and can't clash with another hash output. Git for example uses a 40 character log hex string to prepresent a git commit. GitHub is using an appreviated version of that in some of the web UI of only 7 characters which they get away with because things are often in a context of a repo name or something like that. For example github.com/peterbe/django-peterbecom/commit/462ae0c

So, what other choices do you have when it comes to returning a hash output that is sufficiently long that it's "almost guaranteed" to be unique but sufficiently short that it becomes practical in terms of storage space? I have an app for example that turns URLs into unique IDs because they're shorter that way and more space efficient to store as values in a big database. One such solution is to use a base64 encoding.

Base64 uses a-zA-Z0-9 but you'll notice it doesn't have the "hashing" nature in that it's just a direct translation character by character. E.g.


>>> base64.encodestring('peterbengtsson')
'cGV0ZXJiZW5ndHNzb24=\n'
>>> base64.encodestring('peterbengtsson2')
'cGV0ZXJiZW5ndHNzb24y\n'

I.e. these two strings are different but suppose you were to take only the first 10 characters these would be the same. Basically, here's a terrible hashing function:


def hasher(s):  # this is not a good hashing function
    return base64.encodestring(s)[:10]

So, what we want is a hashing function that returns output that is short and very rarely clashing and does this as fast as possible.

To test this I wrote a script that tried a bunch of different ad-hoc hashing functions. I generate a list of 130,000+ different words with an average length of 15 characters. Then I loop over these words until a hashed output is repeated for a second time. And for each, I take the time it takes to generate the 130,000+ hashes and I multiply that with the total number of bytes. For example, if the hash output is 9 characters each in length that's (130000 * 9) / 1024 ~= 1142Kb. And if it took 0.25 seconds to generate all of those the combined score is 1142 * 0.24 ~= 286 bytes second.

Anyway, here are the results:

h11 100.00  0.217s     1184.4 Kb   257.52 kbs
h6  100.00  1.015s  789.6 Kb    801.52 kbs
h10 100.00  1.096s  789.6 Kb    865.75 kbs
h1  100.00  0.215s  4211.2 Kb   903.46 kbs
h4  100.00  1.017s  921.2 Kb    936.59 kbs

(kbs means "kilobytes seconds")

These are the functions that returned 0 clashes amongst 134,758 unique words. There were others too that I'm not bothering to include because they had clashes. So let's look at these functions:



def h11(w):
    return hashlib.md5(w).hexdigest()[:9]

def h6(w):
    h = hashlib.md5(w)
    return h.digest().encode('base64')[:6]

def h10(w):
    h = hashlib.sha256(w)
    return h.digest().encode('base64')[:6]

def h1(w):
    return hashlib.md5(w).hexdigest()

def h4(w):
    h = hashlib.md5(w)
    return h.digest().encode('base64')[:7]

It's kinda arbitrary to say the "best" one is the one that takes the shortest time multipled by size. Perhaps the size matters more to you in that case the h6() function is better because it returns 6 character strings instead of 9 character strings in h11.

I'm apprehensive about publishing this blog post because I bet I'm doing this entirely wrong. Perhaps there are better ways to digest a hashing function that returns strings that don't need to be base64 encoded. I just haven't found any in the standard library yet.

Almost premature optimization

January 2, 2015
0 comments Python, Web development, Django

In airmozilla the tests almost all derive from one base class whose tearDown deletes the automatically generated settings.MEDIA_ROOT directory and everything in it.

Then there's some code that makes sure a certain thing from the fixtures has a picture uploaded to it.

That means it has do that shutil.rmtree(directory) and that shutil.copy(src, dst) on almost every single test. Some might also not need or depend on it but it's conveninent to put it here.

Anyway, I thought this is all a bit excessive and I could probably optimize that by defining a custom test runner that is first responsible for creating a clean settings.MEDIA_ROOT with the necessary file in it and secondly, when the test suite ends, it deletes the directory.

But before I write that, let's measure how many gazillion milliseconds this is chewing up.

Basically, the tearDown was called 361 times and the _upload_media 281 times. In total, this adds to a whopping total of 0.21 seconds! (of the total of 69.133 seconds it takes to run the whole thing).

I think I'll cancel that optimization idea. Doing some light shutil operations are dirt cheap.

AJAX or not

December 22, 2014
1 comment Web development, AngularJS, JavaScript

From the It-Depends-on-What-You're-Building department.

As a web developer you have a job:

Display a certain amount of database data on the screen
Do it as fast as possible

The first point is these days easily taken care of with the likes of Django or Rails which makes it über easy to write queries that you then use in templates to generate the HTML and voila you have a web page.

The second point is taken care of with a myriad of techniques. It's almost a paradox. The fastest way to render something on the screen is to generate everything on the server and send it wholesome. It means the browser can very quickly (and boosted by GPU) render something on the screen. But if you have a lot of data that needs to be displayed it's often better to send just a little bit of HTML and then let some Javascript kick in and take care of extracting the rest of the information using AJAX.

Here I have prepared three different versions of ways to display a bunch of information on the screen:

https://www.peterbe.com/ajaxornot/

What you should note and take away from this little experimental playground:

All server-side work is done in Django but it's served straight out of memcache so it should be fast server-side.
The content is NOT important. It's just a list of blog posts and their categories and keywords.
To make it somewhat realistic, each version needs to 1) display a JPG and 2) have a Javascript onclick event that throws a confirm() dialog box.
The AngularJS version loads significantly slower but it's not because AngularJS is slow, but because it's able to do so much more later. Loading a Javascript framework is like an investment. Big cost upfront and small cost later when you need more magic to happen without having a complete server refresh.
View 1, 2 and 3 are all three imperfect versions but they illustrate the three major groups of solving the problem stated at the top of this blog post. The other views are attempts of optimizations.
Clearly the "visually fastest" version is the optimization version 5 which is a fork of version 2 which loads, on the server-side, everything that is above the fold and then take care of the content below the fold with AJAX.
See this visual comparison
Optimization version 4 was a silly optimization. It depends on the fact that JSON is more "compact" than HTML. When you Gzip the content, the difference in size doesn't matter anymore. However, it's an interesting technique because it means you can do all business logic rendering stuff in one language without having to depend on AJAX.
Open the various versions in your browser and try to "feel" how pages the load. Ask your inner gutteral heart which version you prefer; do you prefer a completely blank screen and a browser loading spinner or do you prefer to see some skeleton structure first whilst waiting for the bulk content comes in?
See this as a basis of thoughts and demonstration. Remember the very first sentence in this blog post.

One-way data bindings in AngularJS 1.3

December 11, 2014
1 comment AngularJS, JavaScript

You might have heard that AngularJS 1.3 has "one-time bindings" which is that you can print the value of a scope variable with {{ ::somevar }} and that this is really good for performance because it means that once rendered it doesn't add to the list of things that the angular app needs to keep worrying about. I.e. it's one less thing to watch.

But what's a good use case of this? This is a good example.

Because ng-if="true" will cause the DOM element to be re-created it will go back to the scope variable and re-evaluate it.

To then() or to success() in AngularJS

November 27, 2014
19 comments AngularJS, JavaScript

By writing this I'm taking a risk of looking like an idiot who has failed to read the docs. So please be gentle.

AngularJS uses a promise module called $q. It originates from this beast of a project.

You use it like this for example:


angular.module('myapp')
.controller('MainCtrl', function($scope, $q) {
  $scope.name = 'Hello ';
  var wait = function() {
    var deferred = $q.defer();
    setTimeout(function() {
      // Reject 3 out of 10 times to simulate 
      // some business logic.
      if (Math.random() > 0.7) deferred.reject('hell');
      else deferred.resolve('world');
    }, 1000);
    return deferred.promise;
  };

  wait()
  .then(function(rest) {
    $scope.name += rest;
  })
  .catch(function(fallback) {
    $scope.name += fallback.toUpperCase() + '!!';
  });
});

Basically you construct a deferred object and return its promise. Then you can expect the .then and .catch to be called back if all goes well (or not).

There are other ways you can use it too but let's stick to the basics to drive home this point to come.

Then there's the $http module. It's where you do all your AJAX stuff and it's really powerful. However, it uses an abstraction of $q and because it is an abstraction it renames what it calls back. Instead of .then and .catch it's .success and .error and the arguments you get are different. Both expose a catch-all function called .finally. You can, if you want to, bypass this abstraction and do what the abstraction does yourself. So instead of:


$http.get('https://api.github.com/users/peterbe/gists')
.success(function(data) {
  $scope.gists = data;
})
.error(function(data, status) {
  console.error('Repos error', status, data);
})
.finally(function() {
  console.log("finally finished repos");
});

...you can do this yourself...:


$http.get('https://api.github.com/users/peterbe/gists')
.then(function(response) {
  $scope.gists = response.data;
})
.catch(function(response) {
  console.error('Gists error', response.status, response.data);
})
.finally(function() {
  console.log("finally finished gists");
});

It's like it's built specifically for doing HTTP stuff. The $q modules doesn't know that the response body, the HTTP status code and the HTTP headers are important.

However, there's a big caveat. You might not always know you're doing AJAX stuff. You might be using a service from somewhere and you don't care how it gets its data. You just want it to deliver some data. For example, suppose you have an AJAX request cached so that only the first time it needs to do an HTTP GET but all consecutive times you can use the stuff already in memory. E.g. Something like this:


angular.module('myapp')
.controller('MainCtrl', function($scope, $q, $http, $timeout) {

  $scope.name = 'Hello ';
  var getName = function() {
    var name = null;
    var deferred = $q.defer();
    if (name !== null) deferred.resolve(name);
    $http.get('https://api.github.com/users/peterbe')
    .success(function(data) {
      deferred.resolve(data.name);
    }).error(deferred.reject);
    return deferred.promise;
  };

  // Even though we're calling this 3 different times
  // you'll notice it only starts one AJAX request.
  $timeout(function() {
    getName().then(function(name) {
      $scope.name = "Hello " + name;
    });    
  }, 1000);

  $timeout(function() {
    getName().then(function(name) {
      $scope.name = "Hello " + name;
    });    
  }, 2000);

  $timeout(function() {
    getName().then(function(name) {
      $scope.name = "Hello " + name;
    });    
  }, 3000);
});

And with all the other promise frameworks laying around like jQuery's you will sooner or later forget if it's success() or then() or done() and your goldfish memory (like mine) will cause confusion and bugs.

So is there a way to make $http.<somemethod> return a $q like promise but with the benefit of the abstractions that the $http layer adds?

Here's one such possible solution maybe:


var app = angular.module('myapp');

app.factory('httpq', function($http, $q) {
  return {
    get: function() {
      var deferred = $q.defer();
      $http.get.apply(null, arguments)
      .success(deferred.resolve)
      .error(deferred.resolve);
      return deferred.promise;
    }
  }
});

app.controller('MainCtrl', function($scope, httpq) {

  httpq.get('https://api.github.com/users/peterbe/gists')
  .then(function(data) {
    $scope.gists = data;
  })
  .catch(function(data, status) {
    console.error('Gists error', response.status, response.data);
  })
  .finally(function() {
    console.log("finally finished gists");
  });
});

That way you get the benefit of a one same way for all things that get you data some way or another and you get the nice AJAXy signatures you like.

This is just a prototype and clearly it's not generic to work with any of the shortcut functions in $http like .post(), .put() etc. That can maybe be solved with a Proxy object or some other hack I haven't had time to think of yet.

So, what do you think? Am I splitting hairs or is this something attractive?

A "perma search" in AngularJS

November 18, 2014
0 comments AngularJS, JavaScript

A common thing in many (AngularJS) apps is to have an ng-model input whose content is used to as a filter on an ng-repeat somewhere within the page. Something like this:


<input ng-model="search">
<div ng-repeat="item in items | filter:search">...

Well, what if you want the search you make to automatically become part of the URL so that if you bookmark the search or copy the URL to someone else, the search is still there? It would be really practical. Granted, it's not always that you want this but that's something you can decide.

AngularJS 1.2 (I think) introduced the ability to set reloadOnSearch: false on a route provider and that means that you can do things like $location.hash('something') without it triggering the route provider to re-map the URL and re-start the revelant controller.

So here's a good example of (ab)using that to do a search filter which automatically updates the URL.

Check out the demo: https://www.peterbe.com/permasearch/index.html

This works in HTML5 mode too if you're wondering.

Suppose you use many more things in your filter function other than just a free text ng-modal. Like this:


<input type="text" ng-model="filters.search">
<select ng-model="filters.year">
<option value="">All</option>
<option value="2014">2014</option>
<option value="2013">2013</option>
</select>

You might have some checkboxes and stuff too. All you need to do then is to encode that information in the hash. Something like this might be a good start:


$scope.filters = {};
$scope.$watchCollection('filters', function(value) {
    $location.hash($.param(value)); // a jQuery function
});

And something like this to "unparse" the params.