This cuts down the number of bisections dramatically, basically by pre-caching the first step. On real-world data it drops the steps from 587 to 156. Or from 4.9/key to 1.3/key. This drops the time to lookup from 23.7us to 20.3us. Note that (k in dict) is 12.2us. I do wish we were just a bit closer to that. However, with _LeafNode inherited from dict, I get 26us, so maybe there is something in the interpreter that does a PyDict_CheckExact call, and there isn't much we can do about it.