This is perhaps insanely obvious but it was a measurement I had to do and it might help you too if you use python-jsonschema
a lot too.
I have this project which has a migration script that needs to transfer about 1M records from one PostgreSQL database, transform it a bit, validate it, and store it in another PostgreSQL database. The validation step was done like this:
from jsonschema import validate
...
with open(os.path.join(settings.BASE_DIR, "schema.yaml")) as f:
SCHEMA = yaml.load(f)["schema"]
...
class Build(models.Model):
...
@classmethod
def validate_build(cls, build):
validate(build, SCHEMA)
That works fine when you have a slow trickle of these coming in with many seconds or minutes apart. But when you have to do about 1M of them, the speed overhead starts to really matter. Granted, in this context, it's just a migration which is hopefully only done once but it helps that it doesn't take too long since it makes it easier to not have any downtime.
What about python-fastjsonschema
?
The name python-fastjsonschema
just sounds very appealing but I'm just not sure how mature it is or what the subtle differences are between that and the more established python-jsonschema
which I was already using.
It has two ways of using it either...
fastjsonschema.validate(schema, data)
...or...
validator = fastjsonschema.compile(schema)
validator(data)
That got me thinking, why don't I just do that with regular python-jsonschema
!
All you need to do is crack open the validate
function and you can now re-used one instance for multiple pieces of data:
from jsonschema.validators import validator_for
klass = validator_for(schema)
klass.check_schema(schema) # optional
instance = klass(SCHEMA)
instance.validate(data)
I rewrote my projects code to this:
from jsonschema import validate
...
with open(os.path.join(settings.BASE_DIR, "schema.yaml")) as f:
SCHEMA = yaml.load(f)["schema"]
_validator_class = validator_for(SCHEMA)
_validator_class.check_schema(SCHEMA)
validator = _validator_class(SCHEMA)
...
class Build(models.Model):
...
@classmethod
def validate_build(cls, build):
validator.validate(build)
How do they compare, performance-wise?
Let this simple benchmark code speak for itself:
from buildhub.main.models import Build, SCHEMA
import fastjsonschema
from jsonschema import validate, ValidationError
from jsonschema.validators import validator_for
def f1(qs):
for build in qs:
validate(build.build, SCHEMA)
def f2(qs):
validator = validator_for(SCHEMA)
for build in qs:
validate(build.build, SCHEMA, cls=validator)
def f3(qs):
cls = validator_for(SCHEMA)
cls.check_schema(SCHEMA)
instance = cls(SCHEMA)
for build in qs:
instance.validate(build.build)
def f4(qs):
for build in qs:
fastjsonschema.validate(SCHEMA, build.build)
def f5(qs):
validator = fastjsonschema.compile(SCHEMA)
for build in qs:
validator(build.build)
# Reporting
import time
import statistics
import random
functions = f1, f2, f3, f4, f5
times = {f.__name__: [] for f in functions}
for _ in range(3):
qs = list(Build.objects.all().order_by("?")[:1000])
for func in functions:
t0 = time.time()
func(qs)
t1 = time.time()
times[func.__name__].append((t1 - t0) * 1000)
def f(ms):
return f"{ms:.1f}ms"
for name, numbers in times.items():
print("FUNCTION:", name, "Used", len(numbers), "times")
print("\tBEST ", f(min(numbers)))
print("\tMEDIAN", f(statistics.median(numbers)))
print("\tMEAN ", f(statistics.mean(numbers)))
print("\tSTDEV ", f(statistics.stdev(numbers)))
Basically, 3 times for each of the alternative implementations, do a validation on a 1,000 JSON blobs (technically Python dicts) that is around 1KB, each, in size.
The results:
FUNCTION: f1 Used 3 times BEST 1247.9ms MEDIAN 1309.0ms MEAN 1330.0ms STDEV 94.5ms FUNCTION: f2 Used 3 times BEST 1266.3ms MEDIAN 1267.5ms MEAN 1301.1ms STDEV 59.2ms FUNCTION: f3 Used 3 times BEST 125.5ms MEDIAN 131.1ms MEAN 133.9ms STDEV 10.1ms FUNCTION: f4 Used 3 times BEST 2032.3ms MEDIAN 2033.4ms MEAN 2143.9ms STDEV 192.3ms FUNCTION: f5 Used 3 times BEST 16.7ms MEDIAN 17.1ms MEAN 21.0ms STDEV 7.1ms
Basically, if you use python-jsonschema
and create a reusable instance it's 10 times faster than the "default way". And if you do the same but with python-fastjsonscham
it's 100 times faster.
By the way, in version f5
it validated 1,000 1KB records in 16.7ms. That's insanely fast!