Yes indeed, I am definitely talking a very single, very specific, low level performance optimisation here. Hence the post in 'tools and methodologies' rather than more specifically related to any individual project.

Having now gone and looked at the Linux kernel implementation since I first posted, I am more certain that in very high bandwidth - for example, but by no means an exclusive example, 10Gbit network - this is the way to do it.

They have a very elegant mechanism, essentially shared memory with status flags, that allow synchronised access to buffers between the two parties. This is a much better idea than buffer passing, and better and simpler than what I was thinking of.

I should have thought of this because I recall that some many years ago, Falcon 4 used a similar technique to provide state information, rather than providing a full API. It worked very well.

Thanks for all the thoughts.