Before you even go there, I’ll preface this with YMMV.
This little post is to document a benchmark that I did for an internal use case, in the hopes that it’ll be helpful for others. As a benchmark, I wasn’t attempting to fully characterize the performance of a specific system – I just wanted a pole against which to measure changes in the environment or underlying infrastructure. I was curious what the performance was of Celery, using RabbitMQ. The tests I ran were pretty much straight sample code of the project (a simple “add” task) and a client making multiple requests and combining the results with the veritable python “timeit“.
The code
All the code for this test (and more, I got excited…) is stashed up on Github: https://github.com/heckj/openstack-benchmarks/tree/CeleryBenchmark. Really, the parts you’re likely to be interested in is the worker class: tasks.py, the configuration: celeryconfig.py, and the actual benchmarking code: celery-benchmark.py.
And before you ask, no – the OpenStack project doesn’t currently use Celery – in fact they use Carrot right now. I’m just intending to add on more benchmarks and profile tools into this codebase around the OpenStack project in the future.
The config
The configuration was held constant – a stock Ubuntu 10.10 server with all the various dependencies installed. Since I’m sure someone will want to know about the versions:
- rabbitmq-server 1.8.0-1ubuntu2
- python2.6 2.6.6-5ubuntu1
- python-amqplib 0.6.1-1
- celery 2.2.7
- kombu 1.1.6
- anyjson 0.3.1
The host was a Shuttle PC, 8GB ram, Core 2 Duo processors. The host was never heavily burdened by the processing that took place (load < 1.0, no significant swaping). The benchmarking was done on the same host as RabbitMQ to remove any network latency effects.
The results
I was totally abusing MS Excel’s “stock graph” to show variability in the results (of which there wasn’t a hell of a lot). In the graph, the thin line represents the range (min to max) and the thicker box in the middle is standard deviation +/- the average result. The gist – the round trip time was pretty much straight up at 160ms per requests, and that sampled over 1,000,000 requests. The image above shows a portion of the sequence. The relevant code:
from benchmark.celerybench.tasks import add result = add.apply_async(args=[4, 4]_ result.get()
(As I mentioned earlier, you can see the whole code on github).
I did more tests, but I need to keep some of those to myself, as they’re testing variations of configurations for my job.
Random side notes
I hadn’t done anything in depth with Celery before. I’d heard about it from friends, and in the community in general. The author pinged me a couple of times with help as well. Overall, I found the Celery setup to be incredibly easy to use and a very straightforward API (always nice). There were lots of options available, but everything was set with very usable defaults from the start. I’m totally looking forward to using Celery in some projects, as well as taking advantage of Kombu – a drop-in/compatibility layer for Carrot.
Update:
Ask mentioned some suggestions for optimizing in twitter – seemed a good place to put them. Try:
- CELERYD_PREFETCH_MULTIPLIER=0
- CELERY_DISABLE_RATE_LIMITS=True
- and BROKER_TRANSPORT=”librabbitmq” to use the pylibrabbitmq C library
Great analysis! Thank you for posting your results!
LikeLike