15 Oct 2014: Letter to David Wilson and Howard Chu regarding Python-LMDB hi david, hi howard, i just wanted to say thank you for the work you did and the help you provided in investigating the use of py-lmdb in the proprietary secure environment i was working in. during the initial investigation my immediate assertion when the requirements were presented to me were that it would be flat-out impossible to achieve the requirements with a standard SQL or NOSQL database, and that much simpler key-value stores or advanced memory layout systems (such as http://datadraw.sf.net) would be needed. this assertion was met with some significant skepticism. after a thorough 3 week investigation the extreme nature of the requirements and the drastic short-fall between the requirements and the available SQL and NOSQL technology (postgresql, mariadb and others) was very very clear. the requirements were for a minimum of 160,000 reads plus writes per second; the best available SQL and NOSQL technology was barely able to manage 30,000 with absolutely no indexing, heavy optimisation, safety checks disabled and yet still there were often huge pauses of several seconds for write-back of various caches and transaction logs, many of which increased dramatically to completely unacceptable levels within minutes of the tests starting, as the number of records increased. after eliminating SQL and NOSQL, the second phase of the evaluation continued with a search for simpler key-value stores and even SQLite due to it being known, in the literature, to use highly efficient B-Tree stores for its back-end. Both LevelDB and SQLite ultimately also proved inadequate. by a complete accident Lightning Memory Database came up. the python bindings have no debian packaging so it was not discovered earlier. the initial tests were so staggering that it was immediately obvious that LMDB would be the correct choice. the performance for random access was around an order of magnitude greater than the required read-write cycles, leaving plenty of room for expansion at a later date. to give you some idea of what was involved (because it has to go on my CV, but without giving away too many details because this was in a secure environment, we will use the word "hypothetical" a lot. hypothetically, the requirements were to receive data (at a hypothetical example rate of 20,000 packets per second), execute some tasks (say, hypothetically for example 10 tasks per packet) and (hypothetically) report some hypothetical information. however, the tasks needed access to data, which needed to be updated on every packet coming in. So we were (say) looking at, hypothetically, a sustained and simultaneous set of 200,000 database reads, 200,000 writes and 200,000 deletes per second in this hypothetical example of a non-real-world but entirely hypothetical scenario. so when benchmark tests of Py-LMDB showed a random access write speed - bearing in mind that this is python - in excess of 250,000 records per second, and random read access of almost 900,000 records per second from a single process, it was pretty obvious that LMDB was the right tool for the job. to summarise: Py-LMDB has been successfully deployed as the core back-end in quite literally the most complex programming task i have ever done, to date. i especially wanted to say thank you to you both because of the very responsive and useful help on the openldap-devel mailing list that you both gave. firstly, david, on working with you to add putmulti (we should really also work on adding getmulti as well, perhaps?), and secondly, both you and howard helping to resolve the read transaction issues which turned out to be due to not shutting down the application correctly. this ensured that we were able to deliver a stable, reliable and high-performance product to the client, which is absolutely fantastic. lastly i should cover a minor issue as a "gotcha" for other people, and it also should probably be raised as a bug with the linux kernel. i had not realised that i had made the mistake of opening (hypothetically) 10 LMDB databases (each with separate environments) prior to forking (hypothetically let us say 20) multiple processes. bear in mind that that's now - hypothetically - something like 200 shared open file handles to the same 10 shm mmap'd files, none of them (at all) used by the parent. also, until much later in the debugging process i did not realise that, at the same time, i had a huge amount of CPU-based processing which was being carried out in a write transaction rather than outside of the transaction. these two mistakes (corrected now) had the most extreme and drastic effect on the loadavg of the multi-core system that i have ever encountered. under even the most modest task handling rate, with the above mistakes the loadavg jumped to over 30 within a few seconds. the effect was so extreme that i was able to discover a race condition in vim's file-save capabilities: i have never lost a file under vim before, but the linux kernel was so I/O overloaded that vim was blocked even trying to create a file for several seconds just after it had deleted the old one. if i am honest several days possibly weeks were spent wondering if there were other areas of the application that were causing such drastic slow-downs. networking and process inter-communication was improved considerably: there was such an unrealistic amount of time being spent in epoll, select, read and write, that it was believed that the problem lay there. this had the ironical and comical side-effect of, once the source of problem was truly found, network and process inter-communication dropped almost out of sight on the python cProfile top time-consuming functions! so it was not epoll, select, read or write themselves that were consuming the time but the linux kernel being so heavily I/O bound that it appeared to be the case that these system calls were at fault. now, what was interesting was that when creating a single LMDB, creating multiple databases within the same Environment, and letting the child processes open up that single LMDB rather than have the parent do it, the loadavg dropped to levels that almost matched htop's reported CPU usage: a clear sign that the source of the problem had been found. (note: this was done before fixing the issue of the huge CPU-intensive delay within write transactions). whilst this is a fairly crazy scenario it does have me concerned that there is a fundamental performance-related bug in the linux kernel in its handling of shm memory-mapped files. unfortunately, i can no longer investigate this issue. anyway, that was all. summary: fantastic software, great support, and please don't stop the improvements.