rsyslog performance

This is the place for developers to discuss bugs, new features and everything else about code changes.

Re: rsyslog performance

Postby rgerhards on Fri Oct 03, 2008 4:11 pm

dlang wrote:
rgerhards wrote:
dlang wrote:one longer-term thing to look at is alternate queing approaches.

there is work being done for the linux kernel on a high-performance tracing infrastructure (info at http://lwn.net/Articles/300994/ )
among the things that they are doing as part of this is a very high performance circular buffer that doesn't require locking for inserting or removing messages (it also does multiple buffers that get combined after the fact, but you shouldn't need that extra layer of things)


I am very skeptic for moving into this direction. First of all, it is obviously very platform-specific, so at least two different code pathes would need to be maintained. But, even more importantly, I don't think that would bring so much benefit. The queue is much more than just a circular memory buffer. The queue object does all multithreading, schedules flow of messages through the various parts of the system and also maintains persistence across sessions. Not to mention extended disk-buffering or the ultra-reliable disk queue mode. The actual in-memory circular buffer part is minimalistic. For example, the linkedlist queue driver is around 100 lines of code. It may be possible, though, to utilize this kernel work in the form of a new queue driver, which could provide lock-free enqueuing and dequeueing. But than we would need the number of workers with this approach to a maximum of 1, because, as you say, it may not be possible that multiple workers dequeue. I am not sure if that is desirable.

the intent was not that you would use this feature on linus (it's an in-kernel tool, I don't think userspace would have access to it anyway), but rather as code that would be used for a new queue driver (possibly replacing some existing queue drivers if it's enough better).

dlang wrote:this thread would
read from it's input queue
for each output that this thread loop through
if it goes to another queue, add it to that queue (which would have another thread like this one reading from it)
else filter and format the message then send it to the output routine


hehe - we are getting closer ;) This is very close to what is actually done today! It's done in a more abstracted way by utilizing DIRECT queue mode where appropriate, but from a point of "what" happens it is correct. It is even mostly correct from the "how it happens" point, at least if you tear apart object encapsulation.
dlang wrote:depending on the filters involved it may make sense for this thread to do some filtering even if it's sending it to another queue (cheap filters like priority would make sense to do here, expensive ones like regex matches would not)

this is very similar to what you are doing today as I understand it, with a couple (possibly important) differences.

1. today multiple output modules read from the same message queue. under this approach each queue will only ever have one reader.

Nope! not different outputs read the same queue, but different workers. The number of workers is a configuration limit, so one could limit them to 1.
dlang wrote: this may simplify locking/contention issues on reads if you don't have to allow for multiple readers, and it simplifies the queue entry as you don't have to record which outputs have accessed it and which haven't yet.

All of this does NOT happen today, because we do not need it ;)
you say this isn't happening today, if you have multiple workers, how does this work?


I was answering in the context of different outputs. Different outputs do not access the main message queue multiple times. Different workers of course do. But this does no carry a co-notation of that we need to track anything in regard to the outputs.
User avatar
rgerhards
Site Admin
 
Posts: 1659
Joined: Thu Feb 13, 2003 11:57 am

Professional Services Information

  • Custom written rsyslog.conf?
  • Maintenance Contract?
  • Installation support?

Re: rsyslog performance

Postby rgerhards on Fri Oct 03, 2008 4:17 pm

dlang wrote:That can work. since that would only kick in while you have a lot of messages queued, and even with two gettimeofday calls per syscall (in a systrace) handling one loop takes around 0.00015 seconds, the loss of precision should not matter to anyone. you may want to make this a tunable so that people who care can set it to 1, but default it to 10 or 100.

the advantage of the seperate thread is that it would be able to eliminate gettimeofday calls everywhere, including in output modules. I agree that adding a seperate thread is not a pretty thing to do, but it's doing almost nothing (a select sleep for just under a second, followed by a gettimeofday call and updating one variable). If another thread is still a concern could you do this in the otherwise idle main thread instead of a seperate thread?


The issue I see is not only the additional thread. It is that this means rsyslogd runs on a tick. I just did a lot of work to keep it tickless so that we do not prevent the system it is running on to go to powersafe mode.

Also, you would need to synrchonize access to that timestamp. Granted, if you assume it to be 32 bits and if you assume that all machines will have atomic operations to do so, that's not a big deal. But it is far from clean programming and it is highly platform-dependent.

Also, I do not see the ultimate value in it (assuming we could solve the "atomic issue"): If you receive 80,000 msgs/sec and set the counter to query the time every 1,000 record (which should bring an even better precision in this case), then you do 80 time calls per second. So you could potentially save 79 of them if you do it in a separate thread. Now let's assume that the cost of the memory barrier is zero (very questionable). During this 79 time calls you could probably process another 300 to 500 messages. Does that really matter and justify such a problematic change? I don't think so... And if you expect that scenario, you could set the setting to get the date only every 10,000 time - which leads you to only 7 additional calls...
User avatar
rgerhards
Site Admin
 
Posts: 1659
Joined: Thu Feb 13, 2003 11:57 am

Re: rsyslog performance

Postby dlang on Fri Oct 03, 2008 4:31 pm

rgerhards wrote:On re-ordering: as soon as you run things asynchronous, messages can be re-ordered. There is no way around this. If order of events is important, you need to limit the main message queue to a single worker thread and do NOT run the outputs in any asynchronous mode.

On multi-line messages: there is no such beast in syslog. However, depending on the transport and framing used, rsyslog supports embedded LF's inside the message and treats them as single messages. A multi-line message in the sense of multiple messages being a single one does also cause potential problem on the UDP layer. UDP does not guarantee order of delivery.

good point on UDP potentially re-ordering messages.

while syslog does not define multi-line messages, there is a limit to the message length, and as a result some devices send what's logicly one message as multiple syslog messages (sidewinder firewalls for example will send some lines that have [continued] at the beginning of the message).

thanks for clarifying on the multiple worker queues, that clarifies things a lot.
dlang
Frequent Poster
 
Posts: 125
Joined: Mon Sep 15, 2008 7:44 am

Re: rsyslog performance

Postby rgerhards on Fri Oct 03, 2008 4:35 pm

dlang wrote:while syslog does not define multi-line messages, there is a limit to the message length, and as a result some devices send what's logicly one message as multiple syslog messages (sidewinder firewalls for example will send some lines that have [continued] at the beginning of the message).


If it just contains "[continued]" and not at least an unique message ID, I would consider this implementation to be broken. If so, this forces you do do anything serially (and also forces you to hope that the network, IP stack and any other component did not drop or reorder things). I don't know what to do against this. If you need order, you need to stay away from multiple workers and async outputs. You can configure the engine to do that, but obviously at a high performance cost. I don't see any way to prevent this.

As a side note, the upcoming syslog RFC series will require an higher max message size limit and also permits to use larger sizes. In the long term, this can be a solution to the problem with sidewinder. In the short term, they should add an ID. Very easily to do (but, yes, granted, you need to convince them ;)).
User avatar
rgerhards
Site Admin
 
Posts: 1659
Joined: Thu Feb 13, 2003 11:57 am

Re: rsyslog performance

Postby rgerhards on Fri Oct 03, 2008 4:42 pm

rgerhards wrote:The issue I see is not only the additional thread. It is that this means rsyslogd runs on a tick. I just did a lot of work to keep it tickless so that we do not prevent the system it is running on to go to powersafe mode.

Also, you would need to synrchonize access to that timestamp. Granted, if you assume it to be 32 bits and if you assume that all machines will have atomic operations to do so, that's not a big deal. But it is far from clean programming and it is highly platform-dependent.

Also, I do not see the ultimate value in it (assuming we could solve the "atomic issue"): If you receive 80,000 msgs/sec and set the counter to query the time every 1,000 record (which should bring an even better precision in this case), then you do 80 time calls per second. So you could potentially save 79 of them if you do it in a separate thread. Now let's assume that the cost of the memory barrier is zero (very questionable). During this 79 time calls you could probably process another 300 to 500 messages. Does that really matter and justify such a problematic change? I don't think so... And if you expect that scenario, you could set the setting to get the date only every 10,000 time - which leads you to only 7 additional calls...


Trying it out is probably a good idea in this case: I'll add this option the next time I touch the code. It requires a bit more work than it initially looks like,because the message constructor must be changed. Not a big deal, but needs to be done.
User avatar
rgerhards
Site Admin
 
Posts: 1659
Joined: Thu Feb 13, 2003 11:57 am

Re: rsyslog performance

Postby dlang on Fri Oct 03, 2008 4:58 pm

rgerhards wrote:I was answering in the context of different outputs. Different outputs do not access the main message queue multiple times. Different workers of course do. But this does no carry a co-notation of that we need to track anything in regard to the outputs.

right, because each message is handled in it's entirety by a single de-queueing thread.

thanks for your patience in explaining this to me.

David Lang
dlang
Frequent Poster
 
Posts: 125
Joined: Mon Sep 15, 2008 7:44 am

Re: rsyslog performance

Postby dlang on Fri Oct 03, 2008 5:24 pm

rgerhards wrote:
dlang wrote:That can work. since that would only kick in while you have a lot of messages queued, and even with two gettimeofday calls per syscall (in a systrace) handling one loop takes around 0.00015 seconds, the loss of precision should not matter to anyone. you may want to make this a tunable so that people who care can set it to 1, but default it to 10 or 100.

the advantage of the seperate thread is that it would be able to eliminate gettimeofday calls everywhere, including in output modules. I agree that adding a seperate thread is not a pretty thing to do, but it's doing almost nothing (a select sleep for just under a second, followed by a gettimeofday call and updating one variable). If another thread is still a concern could you do this in the otherwise idle main thread instead of a seperate thread?


The issue I see is not only the additional thread. It is that this means rsyslogd runs on a tick. I just did a lot of work to keep it tickless so that we do not prevent the system it is running on to go to powersafe mode.

Also, you would need to synrchonize access to that timestamp. Granted, if you assume it to be 32 bits and if you assume that all machines will have atomic operations to do so, that's not a big deal. But it is far from clean programming and it is highly platform-dependent.
I believe that every unix system does atomic actions on 4-byte aligned 32 bit values. 64 bit values are a different issue
Also, I do not see the ultimate value in it (assuming we could solve the "atomic issue"): If you receive 80,000 msgs/sec and set the counter to query the time every 1,000 record (which should bring an even better precision in this case), then you do 80 time calls per second. So you could potentially save 79 of them if you do it in a separate thread. Now let's assume that the cost of the memory barrier is zero (very questionable). During this 79 time calls you could probably process another 300 to 500 messages. Does that really matter and justify such a problematic change? I don't think so... And if you expect that scenario, you could set the setting to get the date only every 10,000 time - which leads you to only 7 additional calls...
if you were only doing timestamps in the input modules you would be right, but you are currently doing timestamps at other times (I know you do it during the output modules as well), and all of those cases may need to be mitigated.

for the output side of things, if it were possible to output multiple messages at once, then doing one timestamp per output cycle would be a reasonable approach, but since that doesn't exist it would be a lot harder to mitigate things there.
dlang
Frequent Poster
 
Posts: 125
Joined: Mon Sep 15, 2008 7:44 am

Re: rsyslog performance

Postby rgerhards on Fri Oct 03, 2008 5:43 pm

time_t is not guaranteed to be 32 bits, and with multiple CPUs (and caches) operations are only atomic if a memory barrier is issued (aka it is not atomic unless you request it). On some CPUs (Intel being an example), atomic operations are quite limited. With more parallel hardware, things get worse. Memory barriers are a big overall system performance hit.

On the timestamps: We are already past this - I guess you overlooked this in the volume of messages exchanged ;) You and the feedback from the mailing list convinced me that time() shall no longer be called in the output part. The current perf branch does no longer do that (but use mktime() as a not well working interim solution). So the input is the only place where the time is taken.
User avatar
rgerhards
Site Admin
 
Posts: 1659
Joined: Thu Feb 13, 2003 11:57 am

Re: rsyslog performance

Postby dlang on Fri Oct 03, 2008 6:29 pm

rgerhards wrote:time_t is not guaranteed to be 32 bits,
in theory you are correct, in practice decades of programs have been written with the assumption that time is a 32 bit signed number of the number of seconds since Jan 1 1970, when this runs out in 2038 it's going to cause a _lot_ of problems (the type of things that Y2K was 'supposed' to cause ;)), but because of the huge installed base of programs that do it wrong it can't just be changed
and with multiple CPUs (and caches) operations are only atomic if a memory barrier is issued (aka it is not atomic unless you request it). On some CPUs (Intel being an example), atomic operations are quite limited. With more parallel hardware, things get worse. Memory barriers are a big overall system performance hit.
yes, most atomic operations are expensive on multiple CPUs, but I believe that writing one word to properly aligned memory (where you don't care what was there before) is atomic on everything.

not that it really matters much since you have a better approach for the input modules and have eliminated it from the output modules (which I had missed)

David Lang
dlang
Frequent Poster
 
Posts: 125
Joined: Mon Sep 15, 2008 7:44 am

Re: rsyslog performance

Postby rgerhards on Fri Oct 03, 2008 8:43 pm

Just to clarify, agreeing that it does no longer matter ;)

dlang wrote:
rgerhards wrote:time_t is not guaranteed to be 32 bits,
in theory you are correct, in practice decades of programs have been written with the assumption that time is a 32 bit signed number of the number of seconds since Jan 1 1970, when this runs out in 2038 it's going to cause a _lot_ of problems (the type of things that Y2K was 'supposed' to cause ;)), but because of the huge installed base of programs that do it wrong it can't just be changed


Well, as far as I know there is work being done on moving that interface to 64 bits (via new library calls). Even thought the problem is roughly 29 years in the future, I plan to eliminate it as soon as possible.

dlang wrote:yes, most atomic operations are expensive on multiple CPUs, but I believe that writing one word to properly aligned memory (where you don't care what was there before) is atomic on everything.


To the best of my knowledge, this is not true on current CPUs. The problem is that a write to a word does not go to memory but instead to the CPU-internal caches. Then, at some time, the cache line is propagated to memory (probably via stages of different intermediary caches). Even then, the other CPUs do not necessarily have the correct value, but rather what is in (one of) their caches. It's probably not a big deal for what we inteded to do, but it is racy - and becomes more problematic as cache levels and cache sizes increase. So we may run into trouble we do not yet notice but have an extremely hard to find problem in 5 years or so. This is why I don't like to introduce potentially racy constructs.

Rainer
User avatar
rgerhards
Site Admin
 
Posts: 1659
Joined: Thu Feb 13, 2003 11:57 am

Re: rsyslog performance

Postby dlang on Fri Oct 03, 2008 9:56 pm

rgerhards wrote:Just to clarify, agreeing that it does no longer matter ;)

dlang wrote:
rgerhards wrote:time_t is not guaranteed to be 32 bits,
in theory you are correct, in practice decades of programs have been written with the assumption that time is a 32 bit signed number of the number of seconds since Jan 1 1970, when this runs out in 2038 it's going to cause a _lot_ of problems (the type of things that Y2K was 'supposed' to cause ;)), but because of the huge installed base of programs that do it wrong it can't just be changed


Well, as far as I know there is work being done on moving that interface to 64 bits (via new library calls). Even thought the problem is roughly 29 years in the future, I plan to eliminate it as soon as possible.
I'm in the camp that believes that by that time all our systems are going to be 64 bit and so the rules that currently apply to 32 bit will apply to 64 bit by then.
dlang wrote:yes, most atomic operations are expensive on multiple CPUs, but I believe that writing one word to properly aligned memory (where you don't care what was there before) is atomic on everything.


To the best of my knowledge, this is not true on current CPUs. The problem is that a write to a word does not go to memory but instead to the CPU-internal caches. Then, at some time, the cache line is propagated to memory (probably via stages of different intermediary caches). Even then, the other CPUs do not necessarily have the correct value, but rather what is in (one of) their caches. It's probably not a big deal for what we inteded to do, but it is racy - and becomes more problematic as cache levels and cache sizes increase. So we may run into trouble we do not yet notice but have an extremely hard to find problem in 5 years or so. This is why I don't like to introduce potentially racy constructs.
In this case the exact syncing between the CPUs doesn't matter, as long as any CPU will see either the old 32 bit value or the new 32 bit value you are OK. the only time you would run into problems would be if one CPU was able to see something that was a combination of the two values.

this is why 64 bit counters are so expensive to do on 32 bit machines, there's a possibility that another CPU will see the top 32 bits of the old value and the bottom 32 bits of the new value. on 64 bit machines (again with properly aligned variables) this can't happen and all CPUs in the system will see one complete value or the other.

but if you use word-sized objects and only have one thread updating them, all other threads will see either the old value or the new value. if it's absolutely critical that all CPUs will see the new value as soon as it changes then you may need to use memory barriers, but you almost never need that (becouse the thread on the other CPU could have run a few ns earlier anyway when the old value was correct)
dlang
Frequent Poster
 
Posts: 125
Joined: Mon Sep 15, 2008 7:44 am

Re: rsyslog performance

Postby rgerhards on Sat Oct 04, 2008 5:19 pm

Not that it really matters, but... Of course, you are right under the current state of technology and goal to achieve. However, much of that is really tied into things "that are because they are this way" and not into things "that are because they are specified to be this way". The subtle difference is that things not specified may change over time. There are too many examples of people assuming something because the machinery worked in that way at a given point in time. But technology changes over time and machinery may work totally different in 5 or 10 or whatever years from now. How do we know our assumption are still valid then? Problems of this kind have happened. They are extremely hard to find at the time the surface, because people have either forgotten about the assumption or nobody is reading the subtle issues inside the doc ;) I think those things are for good reason considered a questionable programming practice.

This is why I am very skeptic about utilizing things that I would call a "side-effect" at best. However, I concur with you that in some situations it may be good to do so. The decision driver here is importance. And, even if we don't agree on my view, I think we have agreed that the situation here is not worth going the risk. So I guess I am beating the dead cow, but I could not resist to clarify on my POV ;)

Rainer
User avatar
rgerhards
Site Admin
 
Posts: 1659
Joined: Thu Feb 13, 2003 11:57 am

Re: rsyslog performance

Postby rgerhards on Mon Oct 06, 2008 11:29 am

Just a side-note: looking at the message parser (early stages), I see some room for performance improvement there. However, this is obviously much less potential than when saving system calls. Just FYI.
User avatar
rgerhards
Site Admin
 
Posts: 1659
Joined: Thu Feb 13, 2003 11:57 am

Re: rsyslog performance

Postby rgerhards on Mon Oct 06, 2008 1:24 pm

OK, I now have added a $UDPServerTimeRequery <nbr> config option (in the perf branch). The enables to reduce the number of gettime() calls in the input. Note that the output still has the previous logic, which consumes a bit more performance.

Specify
Code: Select all
$UDPServerTimeRequery 100

to have a time query only done every 100th message. Default is 2, value 0 means "always" (same as 1).

I would appreciate if you could run this version and let me know both the CPU utilization as well as absolute throughput. The later is of interest so that I can see whether or not we have improved or not (thinking about lokcing contention).

Thanks,
Rainer
User avatar
rgerhards
Site Admin
 
Posts: 1659
Joined: Thu Feb 13, 2003 11:57 am

Re: rsyslog performance

Postby dlang on Mon Oct 06, 2008 7:52 pm

this doesn't seem to be helping much. I see from the strace that it's doing the smaller number of gettimeofday calls, but the overall cpu utilization is still low (and the throughput corresponds to the cpu utilization)

I'm wondering if it would be useful to make a 'fake input' module to help test this, something that would just generate messages and queue them.

I'm thinking of something like
$DummyInputMessageSize <size of message>
$DummyInputMessageCount <# of messages to send in a burst>
$DummyInputMessageDelay <time to delay between bursts>
$DummyInputServerTimeRequery <# of messages in a burst to use the same timestamp on>

the input module would make a dummy message something with something like sprintf( "test %(SIZE-5)d", count) where it sends MessageCount messages as quickly as it can queue them, then sleep, recount the counters, and repeat.

I've started looking at doing this, but it is probably something trivial for someone familiar with the codebase to do, while it will take me longer to do this (I probably won't have a chance to try to code this for a couple of days)
dlang
Frequent Poster
 
Posts: 125
Joined: Mon Sep 15, 2008 7:44 am

Google Ads


PreviousNext

Return to Developer's Corner

Who is online

Users browsing this forum: No registered users and 0 guests

cron