FYI: while gathering the information, it appeared to me that there are some situations where it is safe to enqueue messages without locking. I have modified the algorithm so that it enables lock-free enqueue, when it is safe to do. One criteria is that there must be at least 100 messages in the queue, so the initial enqueues are not lock-free. But once the system gets busy they are. The queue must NOT be disk-enabled at this point (I can enhance the algorithm to support this, but have not done so yet). Also, the current implementation does NOT correctly start up all worker threads. Again, this can be done, but I would like to keep it simple.
In my (low-end) test environment, I did NOT see more work done in parallel. However, I have checked on the mutex calls and the enqueue does not do any. So I think/hope the missing concurrent operation is just due to my test environment. I would be really interested so see how it works on good equipment

Beware that the code may be racy, I made serious changes to the engine and have NOT done any testing beyond the simple cases. For anything complex, it may run into trouble (but let's first see if there is any benefit in what I did).
Rainer