Conversation
Signed-off-by: Tom Bentley <tbentley@redhat.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
| * Similarly events covering the following API Keys: `CREATE_DELEGATION_TOKEN`, `RENEW_DELEGATION_TOKEN`, `EXPIRE_DELEGATION_TOKEN`, `DESCRIBE_DELEGATION_TOKEN` | ||
| * `ClientClose` — Emitted when a client connection is closed (whether client or proxy initiated) | ||
| * `BrokerClose` — Emitted when a broker connection is closed (whether broker or proxy initiated). | ||
|
|
There was a problem hiding this comment.
What about KMS events? Should we think about how those would be modelled?
There was a problem hiding this comment.
Simply knowing that a KEK has been used at least once seems to be good enough for answering questions like:
- "How many KEKs does the proxy use?"
- "What KEKs have been used by the proxy (over the last N days)?"
- "Has the proxy used this key which we belive has been accidentally disclosed?"
More broadly, this is "Can plugins-to-plugins generate security relevant events?". Probably.
In any case, I'm inclined not to specify such events right now, and but aim for a way for plugins to be able to publish security events of their own. That way we can roll-out support for better audit logging piecemeal, and based on identified requirements, rather than go imaging all the things we thing might be useful.
|
|
||
| Goals: | ||
|
|
||
| * enable users to _easily_ collect a _complete_ log of security-related events |
There was a problem hiding this comment.
we should be clear that a complete log should include both the actions performed by the Kafka client and any (async) operations cause by the filters themselves.
There was a problem hiding this comment.
This is a good point.
I suppose for the purpose of being able to correlate with Broker logs it would be better to know that a certain request originated in the proxy not with a client accessing the proxy. The alternative, of not audit logging proxy-orginated requests, would be confusing at best, and possibly indistinguishable from log tampering to someone who was looking at the broker logs closely enough.
It should be noted that there can things like queries to Authorizers which should not be logged, because they're not an attempt to perform the action being queried. (E.g. the implementing the IncludeTopicAuthorizedOperations in a Metadata request).
So the answer to the question of "what to log?" isn't always "everything". I think if we tried to make it "everything" we could end up in a mire of event modelling for the many edge cases which in theory someone might care about distinguishing from each other, but in practice someone or something has to analyse those logs and draw conclusions. The closer we model the complex and evolving reality, the harder it is for someone to draw the correct conclusions, and the more we end up being constrained by the API aspect of this proposal.
There was a problem hiding this comment.
How to allow for logging of events within plugins. The Authorization plugin provides a great example. The runtime doesn't really know about Authorizers in a deep way (just a plugin), but they're actually implementing logic which deserves specific audit logging. And ideally that logging would be consistent over Authorizer implementations (e.g. a Deny from the AclAuthorizer is the same as a Deny from an OpaAuthorizer).
One way to do this, I think, is for the Filter API to provide a method for logging an event. At the level of the Filter API we don't need to be prescriptive about what those events look like (we could just say java.lang.Record, so we knew they were Jackson serializable). We're just promising that they'll be emitted to the same things as the events generated natively by the runtime, and with the right attributes (like the event time and the sessionId and I guess the filterId). The Authorization filter would then take on responsibility for calling that method. Crucially the event classes could be defined alongside the Authorizer API, which how we'd end up with consistency of the event schema across different Authorizer impls.
|
|
||
| * enable users to _easily_ collect a _complete_ log of security-related events | ||
| * for the security events to be structured and amenable to automated post-processing | ||
| * for the security events to be an API of the project, with the same compatibility guarantees as other APIs |
There was a problem hiding this comment.
FIlters can effectively rename entities in Kafka (e.g. map a topic or group name). It needs to be up to the user to decide which point(s) along the filter chain should be "tapped" for audit.
There was a problem hiding this comment.
I've not yet described how any of this would work, but I think the most natural way for it to work for the events which arise from requests and responses is obviously to use a filter. Using that approach would allow the user to place it where in the chain they wished.
There was a problem hiding this comment.
I wasn't suggesting you describe a solution in this section, just call out that it is something a proposed solution must handle.
There was a problem hiding this comment.
I wasn't suggesting you describe a solution in this section
I haven't described it in the document at all yet. Still cogitating...
| - `resourceType` — The type of the resource (e.g. `Topic`) | ||
| - `resourceName` — The name of the resource (e.g. `my-topic` | ||
| * `Read` — Emitted when a client successfully reads records from a topic. It is called `Read` rather than `Fetch` because it covers reads generally, including the `ShareFetch` API key. It will be possible to disable these events, because of the potential for high volume. | ||
| - `topicName` — The name of the topic. |
There was a problem hiding this comment.
what about the audit of client ids, group names and possibly, transactional ids?
There was a problem hiding this comment.
None of those pertain to the record data itself. I suppose a bad actor might try (and possibly succeed) to use a transactional id of some other service to cause a kind of denial of service attack by fencing off the legitimate producer. Likewise with groups, maybe Eve can prevent processing of some partitions by getting them assigned to her rogue app. But those things just seem a bit far-fetched, so I'm not super-keen to go adding them up-front.
There was a problem hiding this comment.
None of those pertain to the record data itself.
why aren't we considering events such as resetting a consumer group offset a security event? Causing a consumer to skip a record or fetch a record twice seems very interesting.
There was a problem hiding this comment.
On the one hand you're right. Someone could use that as an attack vector in the right circumstances.
But I think there are lots of reasons not to go over-broad on what we're trying to cover:
- For offset commit... well it doesn't look like a terrible strong signal of something security related going on. Clients commit offsets all the time. Re-processing happens and is not unusual most of the time. The one thing I can think of which could be a bit more specific is fetching from the start of the log. A data exfiltration might look like that. But even that is quite weak: Consumers don't have to store their offets in Kafka at all, so such a check is easily evaded.
- The Kafka broker's logging already covers requests which get as far as the broker. All we need in the proxy is logging which allow correlation with that. This might be enough of a reason to scale back parts of this proposal.
- I think we could come up with a security angle for most RPCs. People expect systems to work and anything which makes them not work could hypothetically manifest, at least, as an attack on the availability of the system/DoS. So then we would end up with an audit log that is really more like a protocol trace.
- The more types of event you define the more API you're committing to. That inhibits our ability to evolve things in the future.
- If we were going to implement something in the proxy we should use normal logging for protocol tracing. It doesn't really need to be an API, as is proposed here.
- The more more types of event we define, and the more data produced, the harder it is to analyse.
- If you research what sorts of events, and event categories, SIEM systems are interested in they're relatively coarse grained.
- We can always add more events in the future: We don't need to achieve perfect coverage in this proposal, so long as it's not too inflexible for the future.
There was a problem hiding this comment.
This might be enough of a reason to scale back parts of this proposal.
@k-wall I was thinking about what this would look like if we took the position of not logging all the details of requests and responses in the proxy, but taking the position that those should be logged on the broker cluster if you want that kind of depth. We would still log all the runtime-local things, like connections, authentications, authorizations and so on, as described in this proposal. I think if we did that we could model events like this:
RequestIngress(from client)RequestEgress(to broker)RequestInject(originator is a filter)RequestShortcircuitResponseEgress(to the client)
If we took that position then we'd only need to log the correlationId, sessionId (and maybe the API key) for RequestEgress because you could recover what was sent by correlation with the broker's kafka.request.logger logger. We could reduce the scope of this proposal, because we'd not be ending up with a "higher level" API that for example was trying to have a single read event which covered Fetch and ShareFetch. This seems to me to be a better decomposition into event types what I've proposed.
Aside: This starts to feel like OTel traces and spans. However, it doesn't seem to be compatible with OTel. OTel (i.e. app-level) "requests" would tend to correspond with Kafka records. But you can't meaningfully propagate an OTel context kept within records with the events above because records can be batched together, so there's no single "parent span".
|
@tombentley thanks for getting this ball rolling. |
| - `partition` — The index of the partition. | ||
| - `offsets` — Offsets are included so that it's possible to record exactly which data has been read by a client. | ||
|
|
||
| * `Write` — Emitted when a client successfully writes records to a topic. It is called `Write` rather than `Produce` for symmetry with `Read` (which also allows introduction of other produce-like APIs in the future). It will be possible to disable these events, because of the potential for high volume. |
There was a problem hiding this comment.
So if the Broker denies the produce (rather than the proxy), would there be an audit event emitted by the Proxy?
There was a problem hiding this comment.
Hmm. This Write event is something that I think I dropped from the code I'm playing with locally. But I think the question stands, in a way. Perhaps we could re-word it: "To what extent should the proxy try to produce an audit log that covers the broker too?"
I think there are events, which are not authz related, which might be of interest from an audit PoV. Trying to capture those from a proxy would seem to be a lot of work, and I suspect there are some which are simply going to be invisible to a proxy because they're internal to the broker and don't result in anything visible that the proxy can use to decide what to log.
For this reason I preferred to argue that the broker has its own audit logging which should be used, and really we were interested in producing a log which could be accurately correlated with the brokers (which involves correlating the connection and then correlating the requests and responses via the correlation id).
| The intent of offering this emitter is to provide a _simple_ way for users to set up basic alerting on security events, such as a sudden increase in the number of failed authentication attempts. | ||
| A more detailed understanding would require consulting a log of security events obtained using one of the other emitters. | ||
|
|
||
| ### Kafka emitter |
There was a problem hiding this comment.
I like the fact we are including this.
There was a problem hiding this comment.
I'm struggling with this one.
It feels like its opening up a can of worms, in terms of infinite writes and and arbitrary bootstraps, credentials etc but for little value.
My understanding is most downstream systems for audit logging are based around standard log shipping pipelines rather than arbitrary Kafka topics.
I wonder if we should be making the Emitter a plugin interface and allow people who need Kafka to write their own, or to allow them to support non JSON formats.
There was a problem hiding this comment.
My understanding is most downstream systems for audit logging are based around standard log shipping pipelines rather than arbitrary Kafka topics.
I wasn't sure how accurate that was. Rather than cluttering this review with an chatbot back-end-forth I'll just link to it. But the summary is that according to an AI there would appear to be real value in emitting to Kafka.
It feels like its opening up a can of worms, in terms of infinite writes and and arbitrary bootstraps, credentials etc but for little value.
This will be a problem to be solved whether or not it was done via an Emitter API. Though with an API it's not necessarily our problem to solve. We should not assume that the events are necessarily being emitted to the same cluster as the one we're proxying. So I think we can achieve something useful even without solving this problem, albeit something with sharp edges.
One route to a solution might be to use an internal Producer with a UUID client id which we then detect when it connects via the proxy.
SamBarker
left a comment
There was a problem hiding this comment.
The scoped event model (proxy → session) is efficient to produce but requires stateful stream processing to interpret — a consumer must join session-scoped events against ProxyStartup (and potentially a VirtualClusterStartup, so that we could log SASL inspection vs termination) to reconstruct basic context like wall-clock time. Standard SIEM ingest pipelines perform lookup enrichment against static reference data; they cannot perform this kind of temporal join across a live event stream.
This has compliance implications. NIST SP 800-92 and PCI DSS Requirement 10.3 both require that each audit record independently contain the event's timestamp, origin, and identity context. An event whose timestamp can only be derived by joining against a prior event in the same stream does not satisfy this requirement without preprocessing.
An audit log that requires a stream processor to be interpretable also has a more fundamental problem: who audits the stream processor? The integrity of the audit trail becomes dependent on the correctness of the join logic, which is an unappealing property for a security primitive.
Sources:
|
|
||
| Session-scoped events all have at least the following attributes: | ||
| - `processAgeMicros` — The time of the event, measured as the number of microseconds since proxy startup. | ||
| - `sessionId` — A UUID that uniquely identifies the session in time and space. |
There was a problem hiding this comment.
Is this the same sessionId as the KafkaSession or is this a different concept of sessionId?
There was a problem hiding this comment.
I think thats worth making explicit, even if just because the KafkaSessionId is relatively new to the codebase.
| #### Session-scoped events | ||
|
|
||
| Session-scoped events all have at least the following attributes: | ||
| - `processAgeMicros` — The time of the event, measured as the number of microseconds since proxy startup. |
There was a problem hiding this comment.
Presumably your using this ensure event ordering and to avoid clock skew/injection issues?
Post fact computing the actual event timestamp feels quite an expensive choice, requiring stateful processing.
There was a problem hiding this comment.
System.currentTimeMillis() has at least a couple of drawbacks:
- It's not very granular. A lot can happen between ticks of the clock.
- It's not guaranteed to be monotonic
My personal feeling was that these are sufficient justification to prefer something based on System.nanoTime().
You make a good point about the need for stateful post-processing to recover a wall clock time. I didn't really feel that was a huge problem because:
- For the metrics emitter (which is basically a counter for each event type) the time gets dropped anyway
- For the logging emitter the log events would end up with a log4j-supplied timestamp anyway.
- For the Kafka emitter I expected is that someone would have to write some kind of processing application anyway. Perhaps forcing that to be stateful from the off is a bit too much to ask.
I suppose we either consolidate on a single timestamp, or we include both.
| * `ClientSaslAuthFailure` — Emitted when a client completes SASL authentication unsuccessfully | ||
| - `attemptedAuthorizedId` — The authorized id the client attempted to use, if known. | ||
| * `ClientSaslAuthSuccess` — Emitted when a client completes SASL authentication successfully | ||
| - `authorizedId` — The authorised id |
There was a problem hiding this comment.
What about M_TLS auth events?
I can see there is a distinction between each in that security teams might want to know about each, but are they really different types of event?
There was a problem hiding this comment.
The model I'm aiming for is that we use different events which each have a well-defined schema. It seems to me that the schema for mTLS and SASL are different. The TLS certificates are quite complex things and are often controlled by a whole different set of processes within an org than SASL credentials are. There's also the fact that both can be in used at the same time. For these reasons I think it's clearer to treat them as separate things.
| * `OperationAllowed` — Emitted when an `Authorizer` allows access to a resource. | ||
| - `op` — The operation that was allowed (e.g. `READ`) | ||
| - `resourceType` — The type of the resource (e.g. `Topic`) | ||
| - `resourceName` — The name of the resource (e.g. `my-topic` | ||
| * `OperationDenied` — Emitted when an `Authorizer` denies access to a resource. | ||
| - `op` — The operation that was denied (e.g. `READ`) | ||
| - `resourceType` — The type of the resource (e.g. `Topic`) | ||
| - `resourceName` — The name of the resource (e.g. `my-topic` |
There was a problem hiding this comment.
The proxy can only authoritatively attest to denials. When the proxy's authorizer permits an operation, the request is forwarded to the broker, which may still deny it. OperationDenied is therefore a definitive statement about the operation's fate, but the symmetrical event is not OperationAllowed — it is OperationForwarded. A complete audit trail requires correlating proxy OperationForwarded events with broker-side authorization logs.
There was a problem hiding this comment.
I see the point you're making and it seems reasonable. But I'm don't really like the terminology of "forwarding operations". We forward requests. Some requests represent (collections of) operations.
| * `ClientSaslAuthFailure` — Emitted when a client completes SASL authentication unsuccessfully | ||
| - `attemptedAuthorizedId` — The authorized id the client attempted to use, if known. | ||
| * `ClientSaslAuthSuccess` — Emitted when a client completes SASL authentication successfully | ||
| - `authorizedId` — The authorised id |
There was a problem hiding this comment.
This pair of events conflates two very distinct things. Given the proxy works in two modes SASL inspector and SASL terminator.
The audit event is a witness (SASL inspection) or a Decision (terminator) while those events look identical from a logging point of view they would be interpreted quite differently by users of SIEM systems.
There was a problem hiding this comment.
This pair of events
I think you're referring to SASL success vs SASL failure?
Do we really need to export these distinctions outside the system though? Once the user has chosen which mode they're going to use I don't really see that the distinction matters. Or at least not to the extent that the thing on the receiving end needs to tell the difference (because its handling will be different). What I mean is: I assume the users knows they decided to use inspector and can make their own inferences about the consequences of SASL events, or they know they chose terminator and can make different inferences.
There was a problem hiding this comment.
think you're referring to SASL success vs SASL failure?
Yes.
I assume the users knows they decided to use inspector and can make their own inferences about the consequences of SASL events, or they know they chose terminator and can make different inferences.
I was assuming the people consuming the audit log/ SIEM system were distinct from the people administering the proxy and thus they don't know how the proxy is configured thus any understanding of what was being reported based on the deployment context is gone.
The guide from the NIST and PCI DSS specs is that events should be self contained so I think the witness vs arbiter distinction is important to the semantic understanding of the event.
| Goals: | ||
|
|
||
| * enable users to _easily_ collect a _complete_ log of security-related events | ||
| * for the security events to be structured and amenable to automated post-processing | ||
| * for the security events to be an API of the project, with the same compatibility guarantees as other APIs | ||
|
|
||
| Non-goals: | ||
|
|
||
| * collecting events which are *not* security-related. | ||
| * create a replacement for a logging facade API (like the existing use of SLF4J already used by the proxy). | ||
| * creating audit logs which are tamper-resistent (this could be a future extension) | ||
|
|
There was a problem hiding this comment.
At standup you asked for explicit feedback on goals. So here it is
| Goals: | |
| * enable users to _easily_ collect a _complete_ log of security-related events | |
| * for the security events to be structured and amenable to automated post-processing | |
| * for the security events to be an API of the project, with the same compatibility guarantees as other APIs | |
| Non-goals: | |
| * collecting events which are *not* security-related. | |
| * create a replacement for a logging facade API (like the existing use of SLF4J already used by the proxy). | |
| * creating audit logs which are tamper-resistent (this could be a future extension) | |
| Goals: | |
| * enable users to _easily_ collect a _complete_ log of security-related events | |
| * for the security events to be structured and amenable to automated post-processing | |
| * for the security events to be an API of the project, with the same compatibility guarantees as other APIs | |
| * audit events must be self-contained and independently interpretable without joining against other events [1] | |
| * the proxy logs its own decisions, not the outcomes of operations beyond its control | |
| * enable correlation with broker audit logs via shared identifiers (`sessionId`, `correlationId`) | |
| * filters can contribute proxy decision events to the audit stream | |
| Non-goals: | |
| * collecting events which are *not* security-related. | |
| * create a replacement for a logging facade API (like the existing use of SLF4J already used by the proxy). | |
| * creating audit logs which are tamper-resistent (this could be a future extension) | |
| * capturing the broker's authorization decisions — those are the broker's to log | |
| * witness events (proxy observations rather than proxy decisions) — deferred pending team consensus | |
| * deeper integrations with specific SIEM systems | |
| [1] The self-contained event goal has compliance implications worth noting: https://doi.org/10.6028/NIST.SP.800-92 and PCI DSS Requirement 10.3 both require that each audit record independently contain the event's timestamp, origin, and identity context. |
| * `BrokerClose` — Emitted when a broker connection is closed (whether broker or proxy initiated). | ||
|
|
||
|
|
||
| ### Log emitter |
There was a problem hiding this comment.
The proposal talks in terms of singular emmitters but to I think we need to support both the logging and metrics emitters at the same time. As I think they are for different audiences.
Logs for security / SIEM
Metrics for SRE. A spike in denied events might well be interesting from a DDOS protection standpoint.
| The intent of offering this emitter is to provide a _simple_ way for users to set up basic alerting on security events, such as a sudden increase in the number of failed authentication attempts. | ||
| A more detailed understanding would require consulting a log of security events obtained using one of the other emitters. | ||
|
|
||
| ### Kafka emitter |
There was a problem hiding this comment.
I'm struggling with this one.
It feels like its opening up a can of worms, in terms of infinite writes and and arbitrary bootstraps, credentials etc but for little value.
My understanding is most downstream systems for audit logging are based around standard log shipping pipelines rather than arbitrary Kafka topics.
I wonder if we should be making the Emitter a plugin interface and allow people who need Kafka to write their own, or to allow them to support non JSON formats.
|
I'm happy with the overall direction of the proposal. I think the idea to focus a key set of auditable events (read, write etc) rather than the fully set of api keys is the write one. I do wonder if users might want the audit trail to include the allow/deny decisions of the broker too. This would be useful in the case where the user cannot change the Kafka Broker (they are using a Kafka Service of some kind), but we can always add this at a later point. |
No description provided.