-
Notifications
You must be signed in to change notification settings - Fork 33
[RFC] trappy: add support for event callbacks #241
Conversation
This patch reduces the frequency of calling getrusage resulting in a ~3-5% performance improvement in parsing. We do a check in the beginning and then check once every 10000 events. Signed-off-by: Joel Fernandes <[email protected]>
In preparation of introducing event callbacks, split out string parsing into separate function. Signed-off-by: Joel Fernandes <[email protected]>
Event callback are called per-event depending on the trace class, the callback is passed the event's timestamp and a dict of all parsed fields. Signed-off-by: Joel Fernandes <[email protected]>
For event callback driven scripts, its useful to save time to not build data frames that aren't going to be used anyway, add a parameter to skip building of data frames. This saves ~30% of the time an event callback based script takes. Signed-off-by: Joel Fernandes <[email protected]>
bjackman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't properly read your example notebook so might be misunderstanding this - is this basically doing the same thing as parsing a trace into a DataFrame then using DataFrame.apply? Is the idea to avoid building the DataFrame for performance reasons? Is it parsing the trace that takes too long? Because if so, it may be preferable to speed up trace parsing rather than adding a new API to get around perf shortcomings.
| # to be getting close to the system's limit, check it only once | ||
| # in the beginning and then every CHECK_MEM_COUNT events | ||
| check_memory_count -= 1 | ||
| if check_memory_count != 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this confusing - IMO this would be much clearer (no continue):
if check_memory_usage and check_memory_count == 0:
kb_used = (getrusage(RUSAGE_SELF).ru_maxrss - starting_maxrss)
if kb_free and kb_used > kb_free * 0.9:
warnings.warn("TRAPpy: Appear to be low on memory. "
"If errors arise, try providing more RAM")
check_memory_usage = False
check_memory_count = CHECK_MEM_COUNT
yield data_dict
Not sure if this is just me..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PS this getrusage stuff was originally added (by me) to get around the fact that Pandas was segfaulting, instead of raising a proper Python error, when it ran out of memory. It might be worth checking that the latest Pandas still has that problem; if it doesn't maybe we can drop this stuff entirely as it's pretty ugly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brendan, the memory limit has never been a problem for me even for large data frames, I don't mind ripping it out altogether unless anyone else has an objection?
Not sure how much RAM you had, but why not rip it out yourself and see if you still hit it with new pandas? I agree its pretty ugly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll give it a try (I was using a VM with 1Gb memory).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I tested it and pandas still segfaults when you run out of memory, so we'll need to keep this hack.
|
Hi Brendan, regarding your comment "Is the idea to avoid building the DataFrame for performance reasons? Is it parsing the trace that takes too long?". Performance is part of the reason, but also that its impossible to use data frames in all situations. The data frames are build per-event, so you have 1 df per-event, like 1 for sched_switch and 1 for sched_wakeup. Inorder to build analysis script that process a state machine, the .apply() will not solve the problem because you would .apply the first data frame (which corresponds to the first event), and then .apply on the second data frame for the next event. However, what is needed is something that processes event as a function of time. So for example, if the events are: I don't see how this can be done using data frames. Even if it can be done using dfs, it will still not be as fast. Using in data frames in WIP script [1], I found that its 3 times slower than [2]. As such, the API I am proposing is clean so I don't see an issue with it if someone doesn't want to use data frames for a particular usecase as I am. |
|
Ah, I see. I think I will have to take some time to get to grips with your use case and how it compares to the latency analysis stuff in LISA. It can take a bit of experimentation, but Pandas can be really fast when you find the right way of expressing your operations. The API addition could definitely make sense but first I think we should be really certain that it's necessary. |
|
An alternative and more PANDAS friendly way to perform that apply "in order" is to "join" the two DF and operate in the new one, which is something I've did here: BTW, that API returns the latency DF which, AFAIU, it's exactly what your example above suggests. This notebook: |
|
Its totally possible to do this with join and sorting, but this trappy feature is useful for cases of analysis where you cannot join for example and still need to process in a time-correlated fashion (and also do it fast) |
derkling
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if this feature can be extended to cover also the user-case I'm trying to address in #243
| if not self.callback: | ||
| return | ||
| data_dict = self.generate_data_dict(comm, pid, cpu, data) | ||
| self.callback(time, data_dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not making the callback even more generic by passing back the exact signature used by generate_data_dict?
I'm asking that for two main reasons:
- the
generate_data_dictcan still be used from the callback if required - I have a WIP series which aims at adding parsing for non key-value formatted events, like for example the events injected from user-space by the Android run-time. This is an example of these events:
surfaceflinger-543 ( 543) [002] ...1 3210.843141: tracing_mark_write: B|543|query
surfaceflinger-543 ( 543) [002] ...1 3210.843145: tracing_mark_write: E
surfaceflinger-543 ( 543) [002] ...1 3210.843149: tracing_mark_write: B|543|updateTexImage
surfaceflinger-543 ( 543) [002] ...1 3210.843168: tracing_mark_write: B|543|acquireBuffer
surfaceflinger-543 ( 543) [002] ...1 3210.843178: tracing_mark_write: B|543|com.prefabulated.touchlatency/com.prefabulated.touchlatency.TouchLatencyActivity#0: 1
surfaceflinger-543 ( 543) [002] ...1 3210.843180: tracing_mark_write: E
surfaceflinger-543 ( 543) [002] ...1 3210.843187: tracing_mark_write: C|543|com.prefabulated.touchlatency/com.prefabulated.touchlatency.TouchLatencyActivity#0|0
surfaceflinger-543 ( 543) [002] ...1 3210.843208: tracing_mark_write: E
As you can see, the data portion of these events cannot be parsed by generate_data_dict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't follow how you wanted this done, can you give an example?
I think if this is to be made generic, it should be made so internal to trappy. The user's callback implementation shouldn't have any business with parsing strings in my opinion, trappy should parse them in a generic way and pass some generic data type to the callback. so in your trace for systrace, it should pass to the callback something like { '__comm': 'surfaceflinger', __pid: 543, 'mark': 'begin', 'name': 'query' }
|
I assigned myself to this PR, since it seems like an interesting feature, will take a look at this later in the day. |
|
I took a look through your example scripts and got up to speed on LISA's
I can't think of any use cases where you can't solve it using joins. Also I can't see how the callbacks can be (necessarily) faster than using I'm not a big fan of the fact that the callback method can only happen when you construct the trace object. I feel the constructor should really be about building & collating data structures, and rich analysis tools should be provided as methods on an existing object. This isn't just an abstract artisanal concern, it's also true that having the methods that call your callbacks is a more flexible solution as you can have your callbacks called multiple times. So.. you start thinking "how would I implement methods to provide these callbacks?" and you pretty quickly get to using |
|
@bjackman @derkling, how do you feel about API that looks like: Internally trappy can then go over the dataframes and then do whatever it wants with it like join them, and then apply the callbacks passed. Do you think that's something that you're Ok with? |
Yeah, I think if we were to have a callback API, that would be a sensible design. But the reason I brought it up is that if it's just a thin wrapper around DataFrame operations, it's not much use - although it might be useful for encapsulating some awkwardness that's necessary to make it fast. |
Instead of processing line at create_dataframe time, move it up to during the append, this avoids passing "==>" to the callbacks. No change in performance or functionality to existing users. Signed-off-by: Joel Fernandes <[email protected]>
normalize the time earlier than before, this is needed to make sure normalized time is passed to callbacks, and also removes the need to reset the index after dataframe generation thus improving perf for the regular cases. Signed-off-by: Joel Fernandes <[email protected]>
|
Posted some more patches to rearrange things and improve the callbacks. I tried to convert the data_dict to pandas.Series and pass that to callback, but my 100MB trace parsing for latency analysis went from 16s to 1m 42s so I am leaving it in data_dict form for now. |
|
I think I broke a test that expects the normalize_time function which I removed because of lack of need/users. Since this is an RFC still, I'll update it a bit later. |
|
One of the draw backs of using df joins instead of this method I am proposing is the join messes time correlation when several events have the same timestamp. In this case its not possible to know which order the events were received on the same CPU. This can lead to incorrect analysis... Whereas when iterating over a trace file, as done by my patch, you do go over the events in the same order as they were recorded on the local CPU, even if the timestamps of the events were the same on the same CPU. Here's what happens after join of sched_switch and sched_wakeup dfs. You essentially lose all information of whether switches happened first or wake ups did. time wake-pid switch-in-pid switch-out-pid Also, I am now down to 45 seconds and I can't get it any lower (barring the obvious correctness issues). For this and the above reasons, I feel joining is simply inadequate, not scalable and just plain wrong. So I am still falling back to my proposal for event-callbacks. Its also confusing from the code when you do a join, each row represents columns from events that have nothing to do with each other, causing confusing logic in the code. |
|
I tried multi-level index (where I store line number along with time) and use both as a combined index for the df. While this solves the problem of events with same timestamps, the '.apply' has some serious performance issues with multilevel indexes: here are the lengths of the data frames:
I think I and others have spent enough time on getting dfs to do what the event callbacks do, and its clear to me atleast that there are enough disadvantages with df apply that event callbacks is a good enough idea and there's a need for it. |
|
Hi, So I got it down to 33s using just join! But for that I had to do the following:
|
|
With #247 , I am down from 37s to 24s in my usecase (when I process traces for the second time, first time always is slow). Almost all of the time is spent in .apply(). This is again with single-level indexing. |
|
Yep, I think using time as a separate column is valid. I wonder if in this particular case we can just use a priory knowledge about the nature of However I think your solution of using the line number as index and making time a column is better - I think it's more generic and simpler. There's no way we can make line-number-as-index the default format for TRAPpy's DataFrames, that's just too fundamental a change. But I think it could be worth having some abstraction to solve this kind of issue i.e. something like this callback API, implemented using the |
|
So @sinkap , what do you think? ;-) |
|
So, I remember handling such a situation before: Line 29 in 5d488a4
Where we simply add a very small value to the conficting index. The problem with this approach and using the line number as the index as well is that what if the sequence is not correctly generated and we end up getting We should probably have an interface to allow the user to specify a way to resolve conflicts. e.g. a list of events in chronological ordering and we could default to using line numbers (or the handle_duplicate_index) |
|
Ok. To begin with, can we add a __line_num column to every event data frame? That will be needed anyway to do the default line number resolving you're suggesting. |
|
Also how will the ordering problem also be there for line numbers? Maybe you're talking about some other line number? The line number I mentioned is incremented for every line in the trace during parsing so it's guaranteed to be unique. |
|
I am not saying it wont be unique. What I am saying is that if the timing ends up colliding and we get a For example We do resolve the problem of index collision but not of semantically correct ordering. |
|
and yes we can add |
|
Ok, can we do that as a next step and first add line numbers to all data frames? This is the most common case. You literally have duplication of events during joins of event dfs without this. |
|
Sorry missed your last comment, just refreshed. Can you explain more about epsilon adding? I didn't get that |
|
Have a look at Line 29 in 5d488a4
Why not just use this to handle index duplication? |
|
Oh I think you were referring to your earlier comment about adding deltas to resolve the time. Actually what I don't like about that is it kind of screws up the timestamp as reported by the trace in the sent it's not what's in the trace anymore right? I would like to preserve the trace data as much as possible. I do like the idea about resolving collisions themselves so they can probably pass s lambda function if they really want to resolve it their way - like by adding epsilons and what not. |
|
In the sense* |
|
Users resolving conflicts themselves* |
|
Regarding only using that index duplication stuff in the commit you pointed at instead of line numbers, consider the case where you have multiple different events on the same CPU that have the same timestamp. Now since you separate them out during parsing you lose the trace ordering completely. Then when you join them later you cannot reconstruct the original ordering because you lost that info no matter what index resolution algorithm you choose. But with line numbers added to the dfs, you preserve it and order correctly on join. |
Can this actually happen? |
|
We should consider that |
|
About Patrick's point, I don't think ftrace does any intelligent merging in the unlikely event of timestamps on different CPUs colliding. In that case any post-processing index resolution method cannot help correct the ordering because the ordering information is lost so its beyond the scope of this discussion I feel (I feel its more of an ftrace issue than a trappy one). Where line numbers will help is timestamps on same cpu colliding which we should address first. |
|
So just summarizing the email discussion to enlist the cons of apply with DF vs event callbacks (this PR): The following are the disadvantages of doing an analysis with DF join+apply without using an event callback approach as I initially posted in this PR:
It would be nice if we can all agree on approach by next week since it has been a while since this is posted (I would even say we get on a call on next week with @sinkap and @JaviMerino as well and finalize which direction to go). |
|
Also one more comment, since #243 also is going in the direction of adding callbacks to events for parsing, it could help that PR. |
|
So I tried something like the following on a 100MB systrace, but the trace time went up from 33s to 1 minute. This does solve the problems 1, 2, 3 in the last comment, however the apply time doubles. Any ideas on how to speed it up? This is for measuring wakeup latencies: |
Do you mean you tried it with a |
|
No with join and apply it was much faster. But I was thinking if in the future if we didn't want to use join and apply for the reasons mentioned above, then how fast would it be. Also I think your example wasn't working for my particular trace as it wasn't handling all corner cases (I'm not sure whether join and apply will work there but I will give it a look soon) |
6f40d93 to
a84dd23
Compare
This series adds event callbacks support to trappy so that trappy users can register callbacks for trace events, that can be used to drive a state machine to do some analysis.
Example analysis script using this is here:
https://github.com/joelagnel/lisa/blob/40047eab82ed43152ef9d587ba89c5145cdb5d5b/get_wlat_cbs.py
(disclaimer, the example script in the above link is very wip and needs to cleaned up and polished properly)