On Wednesday this week, I managed to get a link to Wish List Granted onto Hacker News. It had enough upvotes to be featured on the front page for a couple of hours. I'm very grateful for the added traffic but not quite so impressed with the ultimate conversions.
So that's 1% conversion of people setting up a wish list. But kinda disappointing that no body ever made a payment. Actually, one friend did make a payment. But he's a colleague and a friend so not a stranger who stumbled onto it from Hacker News.
Also, it's now been 3 days since those 43 wish lists were created and still no payments. That's kinda disappointing too.
I'm starting to fear that Wish List Granted is one of those ideas that people think it's a great idea but have no interest in using.
It's a mash-up using Amazon.com's Wish List functionality. What you do is you hook up your Amazon wish list onto wishlistgranted.com and pick one item. Then you share that page with friends and familiy and they can then contribute a small amount each. When the full amount is reached, Wish List Granted will purchase the item and send it to you.
The Rules page has more details if you're interested.
The problem it tries to solve is that you have friends would want something and even if it's a good friend you might be hesitant to spend $50 on a gift to them. I'm sure you can afford it but if you have many friends it gets unpractical. However, spending $5 is another matter. Hopefully Wish List Granted solves that problem.
Wish List Granted started as one of those insomnia late-night project. I first wrote a scraper using pyQuery then a couple of Django models and views and then tied it up by integrating Balanced Payments. It was actually working on the first night. Flawed but working start to finish.
When it all started, I used Persona to require people to authenticate to set up a Wish List. After some thought I decided to ditch that and use "email authentication" meaning they have to enter an email address and click a secure link I send to them.
One thing I'm very proud of about Wish List Granted is that it does NOT store any passwords, any credit cards or any personal shipping addresses. Despite being so totally void of personal data I thought it'd look nicer if the whole site is on HTTPS.
I looked around for Javascript libs that do automatic input formatting for credit card inputs.
The first one was formatter.js which looked promising but it weighs over 6Kb minified and also, when you apply it the placeholder attribute you have on the input disappears.
So, in true software engineering fashion I wrote my own:
functioncc_format(value) {
var v = value.replace(/\s+/g, '').replace(/[^0-9]/gi, '')
var matches = v.match(/\d{4,16}/g);
var match = matches && matches[0] || ''var parts = []
for (i=0, len=match.length; i<len; i+=4) {
parts.push(match.substring(i, i+4))
}
if (parts.length) {
return parts.join(' ')
} else {
return value
}
}
This has served me well of the last couple of years of using Django:
from django import forms
class_BaseForm(object):
defclean(self):
cleaned_data = super(_BaseForm, self).clean()
for field in cleaned_data:
ifisinstance(cleaned_data[field], basestring):
cleaned_data[field] = (
cleaned_data[field].replace('\r\n', '\n')
.replace(u'\u2018', "'").replace(u'\u2019', "'").strip())
return cleaned_data
classBaseModelForm(_BaseForm, forms.ModelForm):
passclassBaseForm(_BaseForm, forms.Form):
pass
So instead of doing...
classSigupForm(forms.Form):
name = forms.CharField(max_length=100)
nick_name = forms.CharField(max_length=100, required=False)
...you do:
classSigupForm(BaseForm):
name = forms.CharField(max_length=100)
nick_name = forms.CharField(max_length=100, required=False)
What it does is that it makes sure that any form field that takes a string strips all preceeding and trailing whitespace. It also replaces the strange "curved" apostrophe ticks that Microsoft Windows sometimes uses.
Yes, this might all seem trivial and I'm sure there's something as good or better out there but isn't it a nice thing to never have to worry about doing things like this again:
classSignupForm(forms.Form):
...
defclean_name(self):
return self.cleaned_data['name'].strip()
#...or...
form = SignupForm(request.POST)
if form.is_valid():
name = form.cleaned_data['name'].strip()
No, this is not about the new JSON Type added in Postgres 9.2. This is about how you can get a record set from a Postgres database into a JSON string the best way possible using Python.
This is plain and nice but it's kinda annoying that you have to write down the columns you're selecting twice.
Also, it's annoying that you have to convert the results of fetchall() into a list of dicts in an extra loop.
So, there's a trick to the rescue! You can use the cursor_factory parameter. See below:
Isn't that much nicer? It's shorter and only lists the columns once.
But is it much faster? Sadly, no it's not. Not much faster. I ran various benchmarks comparing various ways of doing this and basically concluded that there's no significant difference. The latter one using RealDictCursor is around 5% faster. But I suspect all the time in the benchmark is spent doing things (the I/O) that is not different between the various versions.
Anyway. It's a keeper. I think it just looks nicer.
As of moving over hugepic.io to my new EC2 server I now have all my working sites all under one server.
If I list all sites in /etc/nginx/sites-enabled/ I count 14 sites. This blog being one of many. More listed here.
All but one of these services are Python. One is a Node server. About half of the Python services are Django and the other half is Tornado. There are four persistant databases (Postgres, Redis, Memcache, MongoDB) and two message queues (RabbitMQ and Python RQ).
I have this little script called ps_mem.py which does a decent job summorizing how much memory all of these take. Its output currently looks like this:
It's sorted by "RAM used" and I just showed here the bottom 7 ones.
Anyway, 3.3 Gb to run 14 sites isn't bad. All through one Nginx (which only uses 10Mb by the way).
The server is a Debian 7 on a reserved Large instance. I'll try to post an update later about this server with more details. I have a lot of work to do to set up all monitoring and backups for all these things.
I've started experimenting with my home page to make it load even faster.
Amazon famously does this too which you can read more about in this Steve Souders post. They make sure everything that needs to be visible above the fold is loaded first, then, it starts loading all the other "stuff" below the fold. The assumption is that the user requests the page, watches it render, and sometimes after it has rendered reaches for the mouse and starts scrolling down for more content. Or perhaps, never bothers to scroll down at all. Either way, everything below the fold can wait. We have more time, to load that in, later.
What we want to avoid is a load graph like this:
The graph is deliberately zoomed out so that we don't get stuck on the details of that particular graph. But basically, you have a very heavy document to load which needs to be fully loaded (and partially rendered) before it can load all other stuff that that page entails. As you can see, the first load (the HTML document) is taking up a majority of the load time. Once that's downloaded the browser can start parsing it and start rendering it. Simultaneously it can start downloading all the mentioned resources such as images, javascript, and CSS.
On WebPagetest they call this Speed Index; "The Speed Index is the average time at which visible parts of the page are displayed."
So basically, you want to display as much as you possibly can and then load in other things that are necessary but can wait in the background.
So, how did I accomplish this on my site?
Basically, the home page uses as piece of Django code that picks up the 10 most recent blog posts and includes them into the template. Instead, I made it only pick up the first 2 and then after window.onload a piece if AJAX code loads the HTML for the remaining 8 blog posts.
That means that much less is required to load the home page. The page is smaller and references less images. The AJAX code is very crude and simple but works enough:
The user probably won't notice a huge difference if she avoids looking at the loading spinner of her browser. Only if she is really really fast at scrolling down will she notice that the rest of the page (about 80% of its vertical space) comes in a little bit later.
So, did it work?
I hope so! The theory is sound. However, my home page is, unlike an Amazon.com product page, very sparse. The page weighs a total of 77Kb (excluding external resources) but now only the first 25Kb is loaded and the rest later.
Here's a measurement before and one after. It's kinda hard to compare because "fluctuations" on network I/O make measurements like this quite unpredictable. Also, there's various odd requests like New Relic and Google Analytics which clouds the waterfall view. However, what really matters is in the "First View" of the after measurement. If you look closely you'll see that now a bunch of images aren't loaded until after the "Document Complete" event has fired. That, to me, is a big win.
If you're interested in how it was done, check out this changeset.
When you use a web framework like Tornado, which is single threaded with an event loop (like nodejs familiar with that), and you need persistency (ie. a database) there is one important questions you need to ask yourself:
Is the query fast enough that I don't need to do it asynchronously?
If it's going to be a really fast query (for example, selecting a small recordset by key (which is indexed)) it'll be quicker to just do it in a blocking fashion. It means less CPU work to jump between the events.
However, if the query is going to be potentially slow (like a complex and data intensive report) it's better to execute the query asynchronously, do something else and continue once the database gets back a result. If you don't all other requests to your web server might time out.
Another important question whenever you work with a database is:
Would it be a disaster if you intend to store something that ends up not getting stored on disk?
This question is related to the D in ACID and doesn't have anything specific to do with Tornado. However, the reason you're using Tornado is probably because it's much more performant that more convenient alternatives like Django. So, if performance is so important, is durable writes important too?
Let's cut to the chase... I wanted to see how different databases perform when integrating them in Tornado. But let's not just look at different databases, let's also evaluate different ways of using them; either blocking or non-blocking.
What the benchmark does is:
On one single Python process...
For each database engine...
Create X records of something containing a string, a datetime, a list and a floating point number...
Edit each of these records which will require a fetch and an update...
Delete each of these records...
I can vary the number of records ("X") and sum the total wall clock time it takes for each database engine to complete all of these tasks. That way you get an insert, a select, an update and a delete. Realistically, it's likely you'll get a lot more selects than any of the other operations.
And the winner is:
pymongo!! Using the blocking version without doing safe writes.
Let me explain some of those engines
pymongo is the blocking pure python engine
with the redis, toredis and memcache a document ID is generated with uuid4, converted to JSON and stored as a key
when it says (safe) on the engine it means to tell MongoDB to not respond until it has with some confidence written the data
motor is an asynchronous MongoDB driver specifically for Tornado
MySQL doesn't support arrays (unlike PostgreSQL) so instead the tags field is stored as text and transformed back and fro as JSON
None of these database have been tuned for performance. They're all fresh out-of-the-box installs on OSX with homebrew
None of these database have indexes apart from ElasticSearch where all things are indexes
momoko is an awesome wrapper for psycopg2 which works asyncronously specifically with Tornado
memcache is not persistant but I wanted to include it as a reference
All JSON encoding and decoding is done using ultrajson which should work to memcache, redis, toredis and mysql's advantage.
mongokit is a thin wrapper on pymongo that makes it feel more like an ORM
A lot of these can be optimized by doing bulk operations but I don't think that's fair
I don't yet have a way of measuring memory usage for each driver+engine but that's not really what this blog post is about
I'd love to do more work on running these benchmarks on concurrent hits to the server. However, with blocking drivers what would happen is that each request (other than the first one) would have to sit there and wait so the user experience would be poor but it wouldn't be any faster in total time.
I use the official elasticsearch driver but am curious to also add Tornado-es some day which will do asynchronous HTTP calls over to ES.
You can run the benchmark yourself
The code is here on github. The following steps should work:
Then fire up http://localhost:8000/benchmark?how_many=10 and see if you can get it running.
Note: You might need to mess around with some of the hardcoded connection details in the file tornado_app.py.
Discussion
Before the lynch mob of HackerNews kill me for saying something positive about MongoDB; I'm perfectly aware of the discussions about large datasets and the complexities of managing them. Any flametroll comments about "web scale" will be deleted.
I think MongoDB does a really good job here. It's faster than Redis and Memcache but unlike those key-value stores, with MongoDB you can, if you need to, do actual queries (e.g. select all talks where the duration is greater than 0.5). MongoDB does its serialization between python and the database using a binary wrapper called BSON but mind you, the Redis and Memcache drivers also go to use a binary JSON encoding/decoder.
The conclusion is; be aware what you want to do with your data and what and where performance versus durability matters.
What's next
Some of those drivers will work on PyPy which I'm looking forward to testing. It should work with cffi like psycopg2cffi for example for PostgreSQL.
Also, an asynchronous version of elasticsearch should be interesting.
UPDATE 1
Today I installed RethinkDB 2.0 and included it in the test.
My first website was launched in 1997 but that one is long lost. The next version, which actually used a database and a real web framework was launched in 2001 and this is the oldest screenshot I could find.
Back then the site was built in Zope which at the time was the coolest shit you could possibly use. Back in 2003 I was renting a room in an apartment in London when I was studying at City University. The broad band (american's know this as DSL) we had had a static IP address so I could tie my domain name directly to my bedroom basically. If you're born in the nineties or anything sooner you wouldn't remember this but for almost 20 years you could either buy a laptop (small but slow) or a stationary computer (clunky but fast) and this laptop I was running on was no exception. Not to mention it was an abandonned laptop too. I think it had about 8 MB of RAM. I ran a stripped down version of Debian on it without any graphical interface. I managed the code by scp'ing files into it from my Windows computer.
Anyway, running on a home DSL line with on a rusty old laptop blinking away under my bed meant that site would be ultra-slow if I didn't pre-optimize it. And that was something I did. The site had a Squid cache in front of it and the HTML, CSS and Javascript was compressed by a script I wrote called slimmer.
Back in 2003 blogging was getting hotter than celebrity spotting and I was very much interested in something that later became called "SEO" and the rumor at the time was that "blogs" got penalized by Google because blogs usually just re-posted stuff from real web pages. So I decided to prefix all my content with the word "plog". It's was a mix of "p" for Peter and sufficiently different from the word "blog".
In the first couple of years of blogging I would blog about all sorts of stuff that caught my interested. Not just genuine thoughts or real technology notes but any fun link I came across. That became a massive trend later (and still is I guess) by the giants like Digg and Reddit so I stopped doing that with my own blog. In the last 7 years (give or take) I only blog about things that are genuinely close to heart or something I've actually worked on.
Some stats:
Total number of blog posts: 949
Total number of approved blog comments: 8,086
Number of email addresses collected: 4,292
Maximum number of comments on any one post: 2,749
Number of Cease or Desist letters received: 1
To me, blogging used to be a form of shouting out to the world what I found interesting in the hope that you'll also find it interesting and that you'll thank me for finding that. Now it's a way for me of either documenting something I've learned recently or some other announcement that is related to what I do on some technical thing.
I wonder how this will change for me in the next 10 years.
By default the sort and the sorted built-in function notices that the items are tuples so it sorts on the first element first and on the second element second.
However, notice how you get (0, 'B') appearing before (0, 'a'). That's because upper case comes before lower case characters. However, suppose you wanted to apply some "humanization" on that and sort case insensitively. You might try:
>>> sorted(items, key=str.lower)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: descriptor 'lower' requires a 'str'object but received a 'tuple'
which is an error we deserve because this won't work for the first part of each tuple.
We could try to write a lambda function (e.g. sorted(items, key=lambda x: x.lower() if isinstance(x, str) else x)) but that's not going to work because you'll only ever get to apply that to the first item.
Without further ado, here's how you do it. A lambda function that returns a tuple:
As a bonus item for people still reading...
I'm sure you know that you can reverse a sort order simply by passing in sorted(items, reverse=True, ...) but what if you want to have different directions depend on the key that you're sorting on.
Using the technique of a lambda function that returns a tuple, here's how we sort a slightly more advanced structure:
Makes sense, right? Bill comes before Ted and 500 comes before 1000. But how do you sort it like that on the name but reverse on the salary? Simple, negate it: