For the past several years, I’ve thought stackless python was just plain cool. It’s a variation on the straight python-written-in-c that is extraordinarily well optimized for concurrent programming due to its implementation of continuations and tasklets. If you’re curious for a read-up on the whole “stackless thing”, I’d recommend Why Stackless by Grant Olson.
The thing that slowed me down for ages was that it was not a straight-up copy of python (meaning I had to compile up the binaries myself) and I was honestly not too sure what its future would be. So I vaguely kept my eye on it, of course I caught the Stackless Python in EVE slides, and so forth.
While I was working on other projects a year or so ago, I ran across the project Kamaelia. It’s been somewhat quiet of late, but it is still definitely an ongoing development project. What I caught from this project was they I could implement something that looked very much like a stackless tasklet – and run it in straight python. It (like another bit of python code called fibra) takes advantage of generators within python to enable explicitly shared concurrent methods. Now most of Kamaelia is actually rather optimized a little differently, but the underlying component (Axon.Component) is pretty much exactly the thing I needed.
Of course the proof is in the pudding… so I poked and hacked away over the past day or two to replicate the hackysack example from Why Stackless. And yeah – then I timed it.
The good news is that it happily handled 10,000 and 100,000 concurrent objects working about in there. The bad news is that it’s really just about as slow as threaded python – stackless still has the margin in just raw speed there. My quick results for timing the code, based on my implementation of the hackysack example in Why Stackless:
10 concurrent objects, 1000 loops
python timeit.py -s "import hackysack" "hackysack.runit(10,1000)"
10 loops, best of 3: 127 msec per loop
100 concurrent objects, 1000 loops
python timeit.py -s "import hackysack" "hackysack.runit(100,1000)"
10 loops, best of 3: 587 msec per loop
1000 concurrent objects, 1000 loops
python timeit.py -s "import hackysack" "hackysack.runit(1000,1000)"
10 loops, best of 3: 6.05 sec per loop
10000 concurrent objects, 1000 loops
python timeit.py -s "import hackysack" "hackysack.runit(10000,1000)"
10 loops, best of 3: 60.4 sec per loop
(these are running on a MacBook Pro: 2GHz Intel Core Duo, 2GB RAM, Python 2.5 – and that last iteration took up 728MB of ram to run)
I’m going to assume I’m missing some efficiencies here that I could take advantage of – I’m more than willing to hear about a “better way” to do this.
If you’re curious about the code, here it is:
import Axon import time import random import sys class hackymsg: def __init__(self,name): self.name = name class counter: def __init__(self): self.count = 0 def inc(self): self.count +=1 class hackysacker(Axon.Component.component): def __init__(self,name,circle,cntr,loops): Axon.Component.component.__init__(self) self.cntr = cntr self.name = name self.loops = loops # terminating condition self.circle = circle # a list of all the hackysackers circle.append(self) def main(self): yield 1 while 1: if self.dataReady('inbox'): msg = self.recv('inbox') if msg == 'exit': return if self.cntr.count > self.loops: for z in self.circle: z.inboxes['inbox'].append('exit') return #print "%s got hackysack from %s" % (self.name, msg.name) kickto = self.circle[random.randint(0,len(self.circle)-1)] while kickto is self: kickto = self.circle[random.randint(0,len(self.circle)-1)] #print "%s kicking hackysack to %s" %(self.name, kickto.name) msg = hackymsg(self.name) kickto.inboxes['inbox'].append(msg) self.cntr.inc() #print self.cntr.count yield 1 def runit(num_hackysackers=5,loops=100): cntr = counter() circle=[] first_hackysacker = hackysacker('1',circle,cntr,loops) first_hackysacker.activate() for i in range(num_hackysackers): foo = hackysacker(`i`,circle,cntr,loops) foo.activate() # throw in the first sack... msg = hackymsg('me') first_hackysacker.inboxes['inbox'].append(msg) Axon.Component.scheduler.run.runThreads() if __name__ == "__main__": runit(num_hackysackers=1000,loops=1000)
Hi,
Yes the project is still going đŸ™‚ Interesting benchmark.
First of all, your code, on my machine:
/media/usbdisk/Kamaelia> python /usr/lib/python2.5/timeit.py “import hackysack” “hackysack.runit(10,1000)”
10 loops, best of 3: 182 msec per loop
/media/usbdisk/Kamaelia> python /usr/lib/python2.5/timeit.py “import hackysack” “hackysack.runit(100,1000)”
10 loops, best of 3: 820 msec per loop
/media/usbdisk/Kamaelia> python /usr/lib/python2.5/timeit.py “import hackysack” “hackysack.runit(1000,1000)”
10 loops, best of 3: 9.23 sec per loop
I kinda lost patience at this point – so I made a simple change:
Changing this part of your code:
def main(self):
yield 1
while 1:
if self.dataReady(‘inbox’):
To
def main(self):
yield 1
while 1:
while not self.anyReady():
self.pause()
yield 1
while self.dataReady(‘inbox’): # Note change from “if”
Yields the following change:
I suppose the difference here from threads is that when we pause, we do actually
pause and get removed from the run queue.
/media/usbdisk/Kamaelia> python /usr/lib/python2.5/timeit.py “import hackysack” “hackysack.runit(10,1000)”
10 loops, best of 3: 206 msec per loop
/media/usbdisk/Kamaelia> python /usr/lib/python2.5/timeit.py “import hackysack” “hackysack.runit(100,1000)”
10 loops, best of 3: 267 msec per loop
/media/usbdisk/Kamaelia> python /usr/lib/python2.5/timeit.py “import hackysack” “hackysack.runit(1000,1000)”
10 loops, best of 3: 816 msec per loop
I can’t reliably test above that because I don’t have enough memory. However, even with that caveat:
/media/usbdisk/Kamaelia> python /usr/lib/python2.5/timeit.py “import hackysack” “hackysack.runit(5000,1000)”
10 loops, best of 3: 2.77 sec per loop
That figure includes my machine swapping. Despite that, I think that’s somewhat better đŸ™‚
I’ve got a feeling that this is much more memory hungry – probably need to look into that. The optimisations to allow this were aimed at allowing the system to use next to zero cpu when it can (since we were doing the changes to allow shelling out to transcoders to utlise as much CPU as possible). As a result memory wasn’t a consideration at the time.
(these are running on an Open SuSE 10.3 laptop : Intel(R) Pentium(R) M processor 1.60GHz, 496 MB RAM, Python 2.5.1 & firefox sitting in the background eating up 372MB RAM….)
Looking at the Why Stackless link, this is still slower than stackless (which makes sense, stackless *does* remove a layer of complexity), but it appears to have similar scaling properties. Also, I suspect part of the issue is the fact that our scheduler is implemented in python rather than C.
It’s tempting to see what would happen with the same code running under stackless, and then changed such the the components were tasklets rather than generators. (inboxes/outboxes could then be implemented using channels after all)
LikeLike
Gah, formatting is broken there, the code looks like this:
Changing this part of your code:
….def main(self):
……..yield 1
……..while 1:
…………if self.dataReady(‘inbox’):
To
….def main(self):
……..yield 1
……..while 1:
…………while not self.anyReady():
…………….self.pause()
…………….yield 1
…………while self.dataReady(‘inbox’): # Note change from “if”
Oh, the reason why the stackless version would be interesting is because it’d give a relatively simple optimisation path for code if needed đŸ™‚
LikeLike
You helped to me a while back on the Kamaelia mailing list and got me headed in this direction – so thank you! And thank you for this tweak on the code!
I expected there were some optimizations in there that I could make, but hadn’t yet teased out. In fact, I knew about pause() keeping things off the run loop until it was “woken” – but I hadn’t gone back and read through the code to see if the way I was injecting messages into an “inbox” would properly awake the target.
It is more memory hungry than Stackless – to be expected since they ripped out the whole stack frame setup, their smallest tasklets are just tiny amounts of memory. Even still – it was the scaling aspect that I really wanted to dig into and see.
I fully expect I’ll keep fiddling and teasing with this setup for a while. I’ll post more on my blog as I roll it forward.
LikeLike