Is it possible? Probably not until recently. Many large companies have been investigating migrating to other programming languages to boost their operation performance and save on server prices but there is no need really. Python can be right tool for the job and there is a lot of work around performance in the community happening. CPython 3.6 boosted overall interpreter performance with new dictionary implementation, CPython 3.7 is gonna be even faster thanks to introducing faster call convention and dictionary lookup caches. For number crunching tasks you can use PyPy with its just-in-time code compilation. Since recently it can run NumPy test suite and improved overall compatibility with C extensions drastically. Later this year PyPy is expected to reach Python 3.5 conformance.
All this great work inspired me to innovate in one of the areas which Python is used extensively, web and micro-services development.
Enter Japronto!
Japronto is a brand new micro-framework tailored for your micro-services needs. It’s main goals include being fast, scalable and lightweight. It lets you do synchronous and asynchronous programming with asyncio and it’s shamelessly fast. Even faster than NodeJS and Go.
This micro benchmark was done using a “Hello world!” application but it clearly demonstrates server-framework overhead for a number of solutions. These results were obtained on AWS c4.2xlarge instance that has 8 VCPUs launched in São Paulo region with default shared tenancy, HVM virtualization and magnetic storage. The machine was running Ubuntu 16.04.1 LTS (Xenial Xerus) with Linux 4.4.0–53-generic x86_64 kernel. The OS was reporting Xeon® CPU E5–2666 v3 @ 2.90GHz CPU. I used Python 3.6 which I freshly compiled from source code. To be fair all the contestants (including Go) were running single worker process. Servers were load tested using wrk with 1 thread, 100 connections and 24 simultaneous (pipelined) requests per connection (cumulative parallelism of 2400 requests).
HTTP pipelining is crucial here since it’s one of the optimizations that Japronto takes into account when executing requests. Most of the servers execute requests from pipelining clients in the same fashion they would do from non-pipelining clients and don’t try to optimize it (in fact Sanic and Meinheld would also silently drop requests from pipelining clients which is a violation of HTTP 1.1 protocol). In simple words pipelining is a technique in which the client doesn’t need to wait for the response before sending following request over the same TCP connection. To ensure integrity of the communication server sends back several responses in the same order requests were received.
The gory details of optimizations
When many small GET requests are pipelined together by the client there is a great chance they are gonna arrive in one TCP packet (thanks to Nagle’s algorithm) on the server side and read back by one system call. Doing a system call and moving data from kernel-space to user-space is a very expensive operation compared to e.g. moving memory inside process space. That’s why doing as little as possible (but not less) system calls is important. When Japronto receives data and successfully parses out many requests out of it it tries to execute all requests as fast as possible, glue back responses in correct order and write back in one system call. In fact the kernel can aid with the gluing part thanks to scatter/gather IO system calls which Japronto doesn’t use yet. Beware that all this is not always possible since some of the requests could take too long and waiting for them would needlessly increase latency. Care needs to be taken when tuning heuristics weighting between the cost of system calls and expected request completion time.
Besides delaying writes for pipelined clients there are several other techniques employed in the code. Japronto is written almost entirely in C. The parser, protocol, connection reaper, router, request and response objects are written as C extensions. Japronto tries hard to delay creation of Python counterparts of its internal structures until asked explicitly. For example headers dictionary won’t be created until requested in a view. All the token boundaries are already marked before but normalization of header keys and creation of several str objects is done when accessed for the first time.
Japronto relies on the excellent picohttpparser C library for parsing status line, headers and chunked HTTP message body. Picohttpparser directly employs text processing instructions found in modern CPUs with SSE4.2 extensions (almost any 10 year old x86_64 CPU has it) to quickly match boundaries of HTTP tokens. The I/O is handled by the super awesome uvloop, which itself is a wrapper around libuv. At the lowest level this is a bridge to epoll system call providing asynchronous notifications on read-write readiness.
Python is a garbage collected language and care needs to be taken when designing high performance systems not to needlessly increase pressure on the GC. The internal design of Japronto tries to avoid reference cycles and do as little allocations/deallocations as possible. It does so by preallocating some objects in so called arenas. It also tries to reuse Python objects for future requests if they are no longer referenced instead of throwing them away.
All the allocations are done as multiples of 4KB, internal structures are laid out carefully so that data used frequently together is close enough in memory minimizing possibility of cache misses. Japronto tries not to copy between buffers unnecessarily and does many operations in-place. For example percent-decoding the path before matching in the router process.
Call for help
I’ve been working on Japronto continuously for last 3 months often during weekends as well as normal labor days. This was only possible due to taking a break from my regular programmer job and putting all the efforts into this project. I think it’s time to share fruit of my work with the community.
Currently Japronto implements pretty solid feature-set:
- HTTP 1.x implementation with support for chunked uploads
- Full support for HTTP pipelining
- Keep-alive connections with configurable reaper
- Support for synchronous and asynchronous views
- Master-multiworker model based on forking
- Support for code reloading on changes
- Simple routing
I would like to look into Websockets and streaming HTTP responses asynchronously next. There is a lot of work to be done in documentation space and testing could definitely benefit from some help. If you would like to help me please contact me directly on Twitter (@squeaky_pl) or Github. The project repository is located at https://github.com/squeaky-pl/japronto.
Also if your company is looking for a Python developer who is a performance freak and also does DevOps I am open to hearing about that. I am going to consider positions worldwide.
The other contestants
Looking closer at the other contestants we can see that shiny new NodeJS is almost as fast as Go (I was pretty disappointed with Go performance in this micro-benchmark to be honest). We can also see that Meinheld WSGI server is almost on par with NodeJS and Go. Despite of its inherently blocking design, it is a great performer compared to preceding four which are Python asynchronous solutions. Never trust anyone who says that asynchronous systems are always speedier, they are almost always more concurrent but there’s much more to it.
Final words
All the techniques that were mentioned here are not really specific to Python. They could be probably employed in other languages like Ruby, JavaScript or PHP even. I would be interested in doing such work as well but sadly this will not happen unless somebody funds it.
I would like to thank Python community for continuous investments in performance engineering. Namely Victor Stinner @VictorStinner, INADA Naoki @methane and Yury Selivanov @1st1 and entire PyPy team.
For the love of Python.