dlang wrote:one longer-term thing to look at is alternate queing approaches.
there is work being done for the linux kernel on a high-performance tracing infrastructure (info at
http://lwn.net/Articles/300994/ )
among the things that they are doing as part of this is a very high performance circular buffer that doesn't require locking for inserting or removing messages (it also does multiple buffers that get combined after the fact, but you shouldn't need that extra layer of things)
I am very skeptic for moving into this direction. First of all, it is obviously very platform-specific, so at least two different code pathes would need to be maintained. But, even more importantly, I don't think that would bring so much benefit. The queue is much more than just a circular memory buffer. The queue object does all multithreading, schedules flow of messages through the various parts of the system and also maintains persistence across sessions. Not to mention extended disk-buffering or the ultra-reliable disk queue mode. The actual in-memory circular buffer part is minimalistic. For example, the linkedlist queue driver is around 100 lines of code. It may be possible, though, to utilize this kernel work in the form of a new queue driver, which could provide lock-free enqueuing and dequeueing. But than we would need the number of workers with this approach to a maximum of 1, because, as you say, it may not be possible that multiple workers dequeue. I am not sure if that is desirable.
dlang wrote:this thread would
read from it's input queue
for each output that this thread loop through
if it goes to another queue, add it to that queue (which would have another thread like this one reading from it)
else filter and format the message then send it to the output routine
hehe - we are getting closer

This is very close to what is actually done today! It's done in a more abstracted way by utilizing DIRECT queue mode where appropriate, but from a point of "what" happens it is correct. It is even mostly correct from the "how it happens" point, at least if you tear apart object encapsulation.
dlang wrote:depending on the filters involved it may make sense for this thread to do some filtering even if it's sending it to another queue (cheap filters like priority would make sense to do here, expensive ones like regex matches would not)
this is very similar to what you are doing today as I understand it, with a couple (possibly important) differences.
1. today multiple output modules read from the same message queue. under this approach each queue will only ever have one reader.
Nope! not different outputs read the same queue, but different workers. The number of workers is a configuration limit, so one could limit them to 1.
dlang wrote: this may simplify locking/contention issues on reads if you don't have to allow for multiple readers, and it simplifies the queue entry as you don't have to record which outputs have accessed it and which haven't yet.
All of this does NOT happen today, because we do not need it

dlang wrote:2. if you have separate queues for different outputs, all the formatting and filtering for that output is going to be done in separate threads from any other output, allowing for the output load to be spread across more CPUs and keeping the overhead of the main message queue loop very light
I mostly agree, and this is how it works today

I do not agree on the filtering, as filtering decides which output is needed. This is done in the main message queue. If not, you would submit messages to the output queue that the output does not need - something that in a typical configuration frequently happens. So the overall time to process would increase if you unnecessarily submit messages to the output queue. The main message queue uses multiple workers to do the decision-making in parallel, so that the main message queue does do this decision making does not limit parallel processing.