Hmm... this idea may sound a bit exccessive, but could you not 'mix in' the click sound right into the buffer? Or have a pre-buffer that does this mixing? You may end up using some pretty low-level(in-line ASM) to get the job done fast, but it may allow you to keep to the one sound buffer.

Only question is, can it be done fast enough?