Skip to content
This repository was archived by the owner on Nov 26, 2021. It is now read-only.

Conversation

@joelagnel
Copy link

@joelagnel joelagnel commented Apr 23, 2017

This series adds event callbacks support to trappy so that trappy users can register callbacks for trace events, that can be used to drive a state machine to do some analysis.

Example analysis script using this is here:
https://github.com/joelagnel/lisa/blob/40047eab82ed43152ef9d587ba89c5145cdb5d5b/get_wlat_cbs.py
(disclaimer, the example script in the above link is very wip and needs to cleaned up and polished properly)

Joel Fernandes added 4 commits April 23, 2017 04:01
This patch reduces the frequency of calling getrusage resulting in a ~3-5%
performance improvement in parsing. We do a check in the beginning and then
check once every 10000 events.

Signed-off-by: Joel Fernandes <[email protected]>
In preparation of introducing event callbacks, split out string
parsing into separate function.

Signed-off-by: Joel Fernandes <[email protected]>
Event callback are called per-event depending on the trace class, the callback
is passed the event's timestamp and a dict of all parsed fields.

Signed-off-by: Joel Fernandes <[email protected]>
For event callback driven scripts, its useful to save time to not build data
frames that aren't going to be used anyway, add a parameter to skip building of
data frames. This saves ~30% of the time an event callback based script takes.

Signed-off-by: Joel Fernandes <[email protected]>
@joelagnel joelagnel changed the title trappy: Improve performance of parser [RFC] trappy: add support for event callbacks Apr 23, 2017
Copy link
Contributor

@bjackman bjackman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't properly read your example notebook so might be misunderstanding this - is this basically doing the same thing as parsing a trace into a DataFrame then using DataFrame.apply? Is the idea to avoid building the DataFrame for performance reasons? Is it parsing the trace that takes too long? Because if so, it may be preferable to speed up trace parsing rather than adding a new API to get around perf shortcomings.

# to be getting close to the system's limit, check it only once
# in the beginning and then every CHECK_MEM_COUNT events
check_memory_count -= 1
if check_memory_count != 0:
Copy link
Contributor

@bjackman bjackman Apr 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this confusing - IMO this would be much clearer (no continue):

        if check_memory_usage and check_memory_count == 0:
             kb_used = (getrusage(RUSAGE_SELF).ru_maxrss - starting_maxrss)
             if  kb_free and kb_used > kb_free * 0.9:
                 warnings.warn("TRAPpy: Appear to be low on memory. "
                               "If errors arise, try providing more RAM")
                 check_memory_usage = False
             check_memory_count = CHECK_MEM_COUNT
        
        yield data_dict

Not sure if this is just me..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS this getrusage stuff was originally added (by me) to get around the fact that Pandas was segfaulting, instead of raising a proper Python error, when it ran out of memory. It might be worth checking that the latest Pandas still has that problem; if it doesn't maybe we can drop this stuff entirely as it's pretty ugly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brendan, the memory limit has never been a problem for me even for large data frames, I don't mind ripping it out altogether unless anyone else has an objection?

Not sure how much RAM you had, but why not rip it out yourself and see if you still hit it with new pandas? I agree its pretty ugly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll give it a try (I was using a VM with 1Gb memory).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I tested it and pandas still segfaults when you run out of memory, so we'll need to keep this hack.

@joelagnel
Copy link
Author

joelagnel commented Apr 25, 2017

Hi Brendan, regarding your comment "Is the idea to avoid building the DataFrame for performance reasons? Is it parsing the trace that takes too long?".

Performance is part of the reason, but also that its impossible to use data frames in all situations. The data frames are build per-event, so you have 1 df per-event, like 1 for sched_switch and 1 for sched_wakeup. Inorder to build analysis script that process a state machine, the .apply() will not solve the problem because you would .apply the first data frame (which corresponds to the first event), and then .apply on the second data frame for the next event.

However, what is needed is something that processes event as a function of time. So for example, if the events are:

Time   event
1       switch
2       wakeup
3       switch

Then, using .apply, it will be perceived by the program:
switch_df.apply(..)
wake_df.apply(..)

As:
apply(1)
apply(3)
apply(2)

Instead what is needed is:
apply(1)
apply(2)
apply(3)

I don't see how this can be done using data frames. Even if it can be done using dfs, it will still not be as fast. Using in data frames in WIP script [1], I found that its 3 times slower than [2].
I don't think we should make using data frames as a rule if something can be solved in a different cleaner faster way. Also, citing the above example - I am not convinced that using data frames can be used to solve all analysis problems using joins and other data frame operations but I'd love to proved wrong!

As such, the API I am proposing is clean so I don't see an issue with it if someone doesn't want to use data frames for a particular usecase as I am.
[1] https://github.com/joelagnel/lisa/blob/wake-lat-script/get_wlat.py
[2 ]https://github.com/joelagnel/lisa/blob/wake-lat-script/get_wlat_cbs.py

@bjackman
Copy link
Contributor

bjackman commented Apr 25, 2017

Ah, I see. I think I will have to take some time to get to grips with your use case and how it compares to the latency analysis stuff in LISA.

It can take a bit of experimentation, but Pandas can be really fast when you find the right way of expressing your operations. The API addition could definitely make sense but first I think we should be really certain that it's necessary.

@derkling
Copy link
Contributor

An alternative and more PANDAS friendly way to perform that apply "in order" is to "join" the two DF and operate in the new one, which is something I've did here:
https://github.com/ARM-software/lisa/blob/master/libs/utils/analysis/latency_analysis.py#L53

BTW, that API returns the latency DF which, AFAIU, it's exactly what your example above suggests.
IOW, if you are after a latency analysis support perhaps we already have what you need in LISA.

This notebook:
https://github.com/ARM-software/lisa/blob/master/ipynb/examples/trace_analysis/TraceAnalysis_TasksLatencies.ipynb
shows how to get DF, stats and plots for both wakeup and preemption latencies.

@joelagnel
Copy link
Author

Its totally possible to do this with join and sorting, but this trappy feature is useful for cases of analysis where you cannot join for example and still need to process in a time-correlated fashion (and also do it fast)

Copy link
Contributor

@derkling derkling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if this feature can be extended to cover also the user-case I'm trying to address in #243

if not self.callback:
return
data_dict = self.generate_data_dict(comm, pid, cpu, data)
self.callback(time, data_dict)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not making the callback even more generic by passing back the exact signature used by generate_data_dict?

I'm asking that for two main reasons:

  1. the generate_data_dict can still be used from the callback if required
  2. I have a WIP series which aims at adding parsing for non key-value formatted events, like for example the events injected from user-space by the Android run-time. This is an example of these events:
surfaceflinger-543   (  543) [002] ...1  3210.843141: tracing_mark_write: B|543|query
surfaceflinger-543   (  543) [002] ...1  3210.843145: tracing_mark_write: E
surfaceflinger-543   (  543) [002] ...1  3210.843149: tracing_mark_write: B|543|updateTexImage
surfaceflinger-543   (  543) [002] ...1  3210.843168: tracing_mark_write: B|543|acquireBuffer
surfaceflinger-543   (  543) [002] ...1  3210.843178: tracing_mark_write: B|543|com.prefabulated.touchlatency/com.prefabulated.touchlatency.TouchLatencyActivity#0: 1
surfaceflinger-543   (  543) [002] ...1  3210.843180: tracing_mark_write: E
surfaceflinger-543   (  543) [002] ...1  3210.843187: tracing_mark_write: C|543|com.prefabulated.touchlatency/com.prefabulated.touchlatency.TouchLatencyActivity#0|0
surfaceflinger-543   (  543) [002] ...1  3210.843208: tracing_mark_write: E

As you can see, the data portion of these events cannot be parsed by generate_data_dict.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't follow how you wanted this done, can you give an example?

I think if this is to be made generic, it should be made so internal to trappy. The user's callback implementation shouldn't have any business with parsing strings in my opinion, trappy should parse them in a generic way and pass some generic data type to the callback. so in your trace for systrace, it should pass to the callback something like { '__comm': 'surfaceflinger', __pid: 543, 'mark': 'begin', 'name': 'query' }

@sinkap
Copy link
Collaborator

sinkap commented Apr 26, 2017

I assigned myself to this PR, since it seems like an interesting feature, will take a look at this later in the day.

@bjackman
Copy link
Contributor

bjackman commented Apr 26, 2017

I took a look through your example scripts and got up to speed on LISA's latency_analysis module and they do the same thing. Have you tried LISA's analysis module to see how it performs on the traces you're looking at?

Its totally possible to do this with join and sorting, but this trappy feature is useful for cases of analysis where you cannot join for example and still need to process in a time-correlated fashion (and also do it fast)

I can't think of any use cases where you can't solve it using joins. Also I can't see how the callbacks can be (necessarily) faster than using DataFrame.apply - the only thing that adds overhead is constructing the joined DataFrame which can be done pretty fast.


I'm not a big fan of the fact that the callback method can only happen when you construct the trace object. I feel the constructor should really be about building & collating data structures, and rich analysis tools should be provided as methods on an existing object. This isn't just an abstract artisanal concern, it's also true that having the methods that call your callbacks is a more flexible solution as you can have your callbacks called multiple times.

So.. you start thinking "how would I implement methods to provide these callbacks?" and you pretty quickly get to using DataFrame joins and applys..

@joelagnel
Copy link
Author

joelagnel commented Apr 28, 2017

@bjackman @derkling, how do you feel about API that looks like:
ftrace_object.apply(callbacks = { 'sched_switch': switch_cb, 'sched_wakeup': wake_cb })

Internally trappy can then go over the dataframes and then do whatever it wants with it like join them, and then apply the callbacks passed. Do you think that's something that you're Ok with?

@bjackman
Copy link
Contributor

bjackman commented Apr 28, 2017

@bjackman @derkling, how do you feel about API that looks like:
ftrace_object.apply(callbacks = { 'sched_switch': switch_cb, 'sched_wakeup': wake_cb })

Internally trappy can then go over the dataframes and then do whatever it wants with it like join them, and then apply the callbacks passed. Do you think that's something that you're Ok with?

Yeah, I think if we were to have a callback API, that would be a sensible design. But the reason I brought it up is that if it's just a thin wrapper around DataFrame operations, it's not much use - although it might be useful for encapsulating some awkwardness that's necessary to make it fast.

Joel Fernandes added 2 commits April 28, 2017 21:59
Instead of processing line at create_dataframe time, move it up to
during the append, this avoids passing "==>" to the callbacks. No change
in performance or functionality to existing users.

Signed-off-by: Joel Fernandes <[email protected]>
normalize the time earlier than before, this is needed to make sure
normalized time is passed to callbacks, and also removes the need
to reset the index after dataframe generation thus improving perf
for the regular cases.

Signed-off-by: Joel Fernandes <[email protected]>
@joelagnel
Copy link
Author

Posted some more patches to rearrange things and improve the callbacks. I tried to convert the data_dict to pandas.Series and pass that to callback, but my 100MB trace parsing for latency analysis went from 16s to 1m 42s so I am leaving it in data_dict form for now.
Also about the join idea, I don't think that'll work well in cases where you have a lot of events. In this case, you would have Nevents * (avgs columns/event) number of columns in the final joined frame. Considering the slowness and the memory usage we've seen, I don't even want to go there.
At the moment, I think my implementation is the best solution we have performance-wise and it doesn't affect functionality for folks not using this stuff, so I think its worthwhile to consider. The modifications I made are also simple and non-invasive. I will let you experts decide ;-)

@joelagnel
Copy link
Author

I think I broke a test that expects the normalize_time function which I removed because of lack of need/users. Since this is an RFC still, I'll update it a bit later.

@joelagnel
Copy link
Author

joelagnel commented May 14, 2017

One of the draw backs of using df joins instead of this method I am proposing is the join messes time correlation when several events have the same timestamp. In this case its not possible to know which order the events were received on the same CPU. This can lead to incorrect analysis...

Whereas when iterating over a trace file, as done by my patch, you do go over the events in the same order as they were recorded on the local CPU, even if the timestamps of the events were the same on the same CPU.

Here's what happens after join of sched_switch and sched_wakeup dfs. You essentially lose all information of whether switches happened first or wake ups did.
Infact after the join, the df incorrect shows that the the switch-in to switch-out (pid 2700 -> 701) happened twice where as in the trace it happened just once.

time wake-pid switch-in-pid switch-out-pid
0.557056 5176.0 701.0 2700.0
0.557056 5263.0 701.0 2700.0

Also, I am now down to 45 seconds and I can't get it any lower (barring the obvious correctness issues). For this and the above reasons, I feel joining is simply inadequate, not scalable and just plain wrong. So I am still falling back to my proposal for event-callbacks. Its also confusing from the code when you do a join, each row represents columns from events that have nothing to do with each other, causing confusing logic in the code.

@joelagnel
Copy link
Author

joelagnel commented May 14, 2017

I tried multi-level index (where I store line number along with time) and use both as a combined index for the df. While this solves the problem of events with same timestamps, the '.apply' has some serious performance issues with multilevel indexes:
Even when my .apply function just returns and does nothing, .apply still takes forever:
2017-05-14 18:16:17 <- start
2017-05-14 18:16:38 <- parse trace file and build df (21 s)
2017-05-14 18:16:39 <- join (1s)
2017-05-14 18:17:22 <- .apply f() which just does return (43s!)

here are the lengths of the data frames:

len(wake)
164045
len(switch_in)
270216
len(df) # joined
434261

I think I and others have spent enough time on getting dfs to do what the event callbacks do, and its clear to me atleast that there are enough disadvantages with df apply that event callbacks is a good enough idea and there's a need for it.

@joelagnel
Copy link
Author

joelagnel commented May 15, 2017

Hi, So I got it down to 33s using just join! But for that I had to do the following:

  • instead of multi-level index with time + line# , I just did a single level index with line#, that solves the problem with duplicates that I had and makes .apply fast. I added a separate column for time in base.py since it isn't the index anymore. Also normalize broke but that can probably be fixed easily later.
    Total time: 16s
    Analysis (.apply): 17s
    Do you think its Ok to make 'time' a column instead of an index, since it looks like it isn't unique anyway? Let me know if that's Ok to do (I know other parts or users may need updates so it might be a bit painful but maybe those users should access time using a more generic/abstracted API anyway).

@joelagnel
Copy link
Author

With #247 , I am down from 37s to 24s in my usecase (when I process traces for the second time, first time always is slow). Almost all of the time is spent in .apply(). This is again with single-level indexing.

@bjackman
Copy link
Contributor

bjackman commented May 15, 2017

Yep, I think using time as a separate column is valid. I wonder if in this particular case we can just use a priory knowledge about the nature of sched_switch and sched_wakeup to solve the issue? I.e. if we see a task waking up and switching in with the same timestamp we know that the wakeup happened first.

However I think your solution of using the line number as index and making time a column is better - I think it's more generic and simpler.

There's no way we can make line-number-as-index the default format for TRAPpy's DataFrames, that's just too fundamental a change. But I think it could be worth having some abstraction to solve this kind of issue i.e. something like this callback API, implemented using the join and apply with a line number index. Or perhaps even just an API to get a DataFrame of combined events with a line number index (i.e. just the join part). I'm interested to get @sinkap's input on this..

@joelagnel
Copy link
Author

So @sinkap , what do you think? ;-)

@sinkap
Copy link
Collaborator

sinkap commented May 29, 2017

So, I remember handling such a situation before:

def handle_duplicate_index(data,

Where we simply add a very small value to the conficting index. The problem with this approach and using the line number as the index as well is that what if the sequence is not correctly generated and we end up getting sched_switch events before the sched_wakeup events. This would some break correctness assumptions for any behaviour code.

We should probably have an interface to allow the user to specify a way to resolve conflicts. e.g. a list of events in chronological ordering and we could default to using line numbers (or the handle_duplicate_index)

@joelagnel
Copy link
Author

Ok. To begin with, can we add a __line_num column to every event data frame? That will be needed anyway to do the default line number resolving you're suggesting.

@joelagnel
Copy link
Author

Also how will the ordering problem also be there for line numbers? Maybe you're talking about some other line number? The line number I mentioned is incremented for every line in the trace during parsing so it's guaranteed to be unique.

@sinkap
Copy link
Collaborator

sinkap commented May 30, 2017

I am not saying it wont be unique. What I am saying is that if the timing ends up colliding and we get a sched_switch in a line before the sched_wakeup we could get semantically incorrect data, which is why we can provide some way for the user to specify a conflict resolution,

For example

LineNum                 TimeStamp     event                      pid
13                        000012               sched_switch         8086
14                        000012               sched_wakeup      8086

We do resolve the problem of index collision but not of semantically correct ordering.

@sinkap
Copy link
Collaborator

sinkap commented May 30, 2017

and yes we can add __line_num but remember we wold still get collisions on a graph when plotted agains time (e.g on a timeline). Would it not be better to add an epsilon to the timestamp as well?

@joelagnel
Copy link
Author

Ok, can we do that as a next step and first add line numbers to all data frames? This is the most common case. You literally have duplication of events during joins of event dfs without this.

@joelagnel
Copy link
Author

Sorry missed your last comment, just refreshed. Can you explain more about epsilon adding? I didn't get that

@sinkap
Copy link
Collaborator

sinkap commented May 30, 2017

Have a look at

def handle_duplicate_index(data,
.

Why not just use this to handle index duplication?

@joelagnel
Copy link
Author

Oh I think you were referring to your earlier comment about adding deltas to resolve the time. Actually what I don't like about that is it kind of screws up the timestamp as reported by the trace in the sent it's not what's in the trace anymore right? I would like to preserve the trace data as much as possible. I do like the idea about resolving collisions themselves so they can probably pass s lambda function if they really want to resolve it their way - like by adding epsilons and what not.

@joelagnel
Copy link
Author

In the sense*

@joelagnel
Copy link
Author

Users resolving conflicts themselves*
Sorry for the errors in typing , I'm on a phone here :)

@joelagnel
Copy link
Author

Regarding only using that index duplication stuff in the commit you pointed at instead of line numbers, consider the case where you have multiple different events on the same CPU that have the same timestamp. Now since you separate them out during parsing you lose the trace ordering completely. Then when you join them later you cannot reconstruct the original ordering because you lost that info no matter what index resolution algorithm you choose. But with line numbers added to the dfs, you preserve it and order correctly on join.
Line numbers provide hidden information about ordering on the same CPU so it's kind of a timestamp in some sense.

@bjackman
Copy link
Contributor

@sinkap

What I am saying is that if the timing ends up colliding and we get a sched_switch in a line before the sched_wakeup we could get semantically incorrect data

Can this actually happen?

@derkling
Copy link
Contributor

derkling commented May 30, 2017

We should consider that sched_switch and sched_wakeup events for a give task cab be generated on different CPUs, isn't it?
In that case we could expect a time delta between the clock of the two CPUs but I would say that ftrace take care about timing differences... what instead I'm not sure is that ftrace can somehow force an ordering when it merge events form different CPU's buffers which appears to have the same timestamp... in that case probably it's just appending events in per-cpu ID order... :-/

@joelagnel
Copy link
Author

joelagnel commented May 31, 2017

About Patrick's point, I don't think ftrace does any intelligent merging in the unlikely event of timestamps on different CPUs colliding. In that case any post-processing index resolution method cannot help correct the ordering because the ordering information is lost so its beyond the scope of this discussion I feel (I feel its more of an ftrace issue than a trappy one). Where line numbers will help is timestamps on same cpu colliding which we should address first.

@joelagnel
Copy link
Author

So just summarizing the email discussion to enlist the cons of apply with DF vs event callbacks (this PR):

The following are the disadvantages of doing an analysis with DF join+apply without using an event callback approach as I initially posted in this PR:

  1. You need to specify which columns to filter before doing the joins
  2. prefixes need to be added to colliding columns just to keep the join happy
  3. columns not relevant to an event will still appear during the callback although with NaN values. This is a waste of memory I feel especially when doing apply on more than 2 events.
  4. Speed - when you have more than 2 events I it is slow as we've seen (the cache PR should improve things though and is still good to go but its still just working around the real problem). Also, you will have NR_COLUMNS_PER_EVENT * NR_EVENTS total columns in the final join for every event which is a waste of memory.
  5. Line numbers need to be specified to prevent timestamp collisions as mentioned last few comments.

It would be nice if we can all agree on approach by next week since it has been a while since this is posted (I would even say we get on a call on next week with @sinkap and @JaviMerino as well and finalize which direction to go).

@joelagnel
Copy link
Author

Also one more comment, since #243 also is going in the direction of adding callbacks to events for parsing, it could help that PR.

@joelagnel
Copy link
Author

joelagnel commented Jun 10, 2017

So I tried something like the following on a 100MB systrace, but the trace time went up from 33s to 1 minute. This does solve the problems 1, 2, 3 in the last comment, however the apply time doubles. Any ideas on how to speed it up? This is for measuring wakeup latencies:

sit = switch_in.iterrows()
wit = wake.iterrows()

s = sit.next()
w = wit.next()

while s and w:
        if s[1]['__line'] < w[1]['__line']:
                switch_cb(s)
                s = next(sit, None)
                continue
        else:
                wake_cb(w)
                w = next(wit, None)
                continue

while s:
        switch_cb(s)
        s = next(sit, None)
        continue

while w:
        wake_cb(w)
        w = next(wit, None)
        continue

@bjackman
Copy link
Contributor

This does solve the problems 1, 2, 3 in the last comment, however the apply time doubles.

Do you mean you tried it with a join and apply like in my example, and it took twice as long?

@joelagnel
Copy link
Author

No with join and apply it was much faster. But I was thinking if in the future if we didn't want to use join and apply for the reasons mentioned above, then how fast would it be. Also I think your example wasn't working for my particular trace as it wasn't handling all corner cases (I'm not sure whether join and apply will work there but I will give it a look soon)

@sinkap sinkap force-pushed the master branch 3 times, most recently from 6f40d93 to a84dd23 Compare June 14, 2017 21:41
@joelagnel joelagnel closed this Jul 8, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants