15 Oct 2014: Letter to David Wilson and Howard Chu regarding Python-LMDB

hi david, hi howard,

i just wanted to say thank you for the work you did and the help you
provided in investigating the use of py-lmdb in the proprietary secure
environment i was working in.  during the initial investigation my
immediate assertion when the requirements were presented to me were
that it would be flat-out impossible to achieve the requirements with a
standard SQL or NOSQL database, and that much simpler key-value stores or
advanced memory layout systems (such as http://datadraw.sf.net) would be
needed.

this assertion was met with some significant skepticism.

after a thorough 3 week investigation the extreme nature of the
requirements and the drastic short-fall between the requirements and the
available SQL and NOSQL technology (postgresql, mariadb and others) was
very very clear.  the requirements were for a minimum of 160,000 reads
plus writes per second; the best available SQL and NOSQL technology
was barely able to manage 30,000 with absolutely no indexing, heavy
optimisation, safety checks disabled and yet still there were often
huge pauses of several seconds for write-back of various caches and
transaction logs, many of which increased dramatically to completely
unacceptable levels within minutes of the tests starting, as the number
of records increased.

after eliminating SQL and NOSQL, the second phase of the evaluation
continued with a search for simpler key-value stores and even SQLite
due to it being known, in the literature, to use highly efficient B-Tree
stores for its back-end.  Both LevelDB and SQLite ultimately also proved
inadequate.

by a complete accident Lightning Memory Database came up.  the python
bindings have no debian packaging so it was not discovered earlier.
the initial tests were so staggering that it was immediately obvious that
LMDB would be the correct choice.  the performance for random access was
around an order of magnitude greater than the required read-write cycles,
leaving plenty of room for expansion at a later date.

to give you some idea of what was involved (because it has to go on my
CV, but without giving away too many details because this was
in a secure environment, we will use the word "hypothetical" a lot.
hypothetically, the requirements were to receive data (at a
hypothetical example rate of 20,000 packets per second), execute some tasks
(say, hypothetically for example 10 tasks per packet) and (hypothetically)
report some hypothetical information.  however, the tasks needed access to
data, which needed to be updated on every packet coming in.  So we were
(say) looking at, hypothetically, a sustained and simultaneous set of
200,000 database reads, 200,000 writes and 200,000 deletes 
per second in this hypothetical example of a non-real-world but
entirely hypothetical scenario.

so when benchmark tests of Py-LMDB showed a random access write speed - bearing
in mind that this is python - in excess of 250,000 records per second, and
random read access of almost 900,000 records per second from a single
process, it was pretty obvious that LMDB was the right tool for the job.

to summarise: Py-LMDB has been successfully deployed as the core back-end in
quite literally the most complex programming task i have ever done, to date.

i especially wanted to say thank you to you both because of the very
responsive and useful help on the openldap-devel mailing list that you
both gave.  firstly, david, on working with you to add putmulti (we
should really also work on adding getmulti as well, perhaps?), and
secondly, both you and howard helping to resolve the read transaction
issues which turned out to be due to not shutting down the application
correctly.  this ensured that we were able to deliver a stable, reliable
and high-performance product to the client, which is absolutely fantastic.

lastly i should cover a minor issue as a "gotcha" for other people, and
it also should probably be raised as a bug with the linux kernel.  i had
not realised that i had made the mistake of opening (hypothetically)
10 LMDB databases (each with separate environments) prior to
forking (hypothetically let us say 20) multiple processes.  bear in mind that
that's now - hypothetically - something like 200 shared open file handles
to the same 10 shm mmap'd files, none of them (at all) used by the parent.
also, until much later in the debugging process i did not realise that, at
the same time, i had a huge amount of CPU-based processing which was being
carried out in a write transaction rather than outside of the transaction.

these two mistakes (corrected now) had the most extreme and drastic
effect on the loadavg of the multi-core system that i have ever encountered.
under even the most modest task handling rate, with the above mistakes
the loadavg jumped to over 30 within a few seconds.  the effect was so extreme
that i was able to discover a race condition in vim's file-save capabilities:
i have never lost a file under vim before, but the linux kernel was so
I/O overloaded that vim was blocked even trying to create a file
for several seconds just after it had deleted the old one.

if i am honest several days possibly weeks were spent wondering if there
were other areas of the application that were causing such drastic
slow-downs.  networking and process inter-communication was improved
considerably: there was such an unrealistic amount of time being spent
in epoll, select, read and write, that it was believed that the problem
lay there.  this had the ironical and comical side-effect of, once the
source of problem was truly found, network and process inter-communication
dropped almost out of sight on the python cProfile top time-consuming
functions!  so it was not epoll, select, read or write themselves that
were consuming the time but the linux kernel being so heavily I/O bound
that it appeared to be the case that these system calls were at fault.

now, what was interesting was that when creating a single LMDB, creating
multiple databases within the same Environment, and letting the child
processes open up that single LMDB rather than have the parent do it,
the loadavg dropped to levels that almost matched htop's reported
CPU usage: a clear sign that the source of the problem had been found.
(note: this was done before fixing the issue of the huge CPU-intensive
delay within write transactions).

whilst this is a fairly crazy scenario it does have me concerned that
there is a fundamental performance-related bug in the linux kernel in its
handling of shm memory-mapped files.  unfortunately, i can no longer
investigate this issue.

anyway, that was all.  summary: fantastic software, great support, and
please don't stop the improvements.