-
Notifications
You must be signed in to change notification settings - Fork 7
feat(adr): ADR-0010 (TrustyAI SDK) #53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(adr): ADR-0010 (TrustyAI SDK) #53
Conversation
| - **CLI interface** providing command-line access to all provider functionality | ||
|
|
||
| 3. **Distribution**: | ||
| - Installable via `pip install trustyai` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect this to be more like pip install trustyai-sdk or pip install trustyai[sdk]. Or is the proposal to replace the current trustyai python library that includes the core algorithms with the SDK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question @danielezonca. The idea is to replace the TrustyAI "core" indeed.
Actually, I think we could do it the other way around, i.e. pip install trustyai for the SDK and pip install trustyai[cli] for the CLI, for instance, since the SDK doesn't need the CLI to work, but not the other way around.
| - Provides programmatic access to all AI safety capabilities | ||
| - Can be used directly in Python applications or as a core foundation for services | ||
|
|
||
| 4. **Broad target support**: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would include in this list "distribution projects" like Kubeflow (see Kubeflow SDK) or llama-stack.
We want to make sure it is easy to use the SDK to bring TrustyAI capabilities in similar distributions
|
|
||
| **Provider Design Principles:** | ||
|
|
||
| 1. **Provider-based design**: Each AI safety scope will be represented by a `Provider` interface that defines the capabilities for that domain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This goal seems to overlap with llama-stack Safety provider.
As far as I see it has some overlaps and the local vs kubernetes abstraction looks similar to the local vs remote concept of llama-stack.
As far as I see the scope of APIs (providers) expected to be covered here is larger than llama-stack but I would like to clarify here the correlation: i.e. TrustyAI EvaluationProvider compared to llama-stack Eval API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielezonca @evaline-ju Very good question (I'll adress also here @evaline-ju's comment).
It's no secret that Llama Stack's design has been a big inspiration for the TrustyAI SDK proposal 🙂
Regarding the local vs. remote and local vs. Kubernetes distinction, I see a few key differences. The TrustyAI SDK specialises in two specific deployment targets: local and Kubernetes. These have been TrustyAI's infrastructure priorities. By targeting Kubernetes specifically (rather than a generic "remote" approach), I believe we can have deeper integration with Kubernetes and OpenShift than Llama Stack currently offers.
This specialised approach would allow us to provide common "core" methods and patterns for handling cluster resources, including general resource factories (utilities for translating parameters to Custom Resources and validating them), error handling frameworks, and other capabilities that can be reused across all SDK Providers.
The TrustyAI SDK also addresses different concerns than Llama Stack. Rather than focusing on LLM operations like inference, TrustyAI provides a unified API layer specialised in AI safety capabilities. I see this as complementary to Llama Stack rather than a duplication.
For example, the TrustyAI SDK could simplify the addition of new safety-focused providers to Llama Stack. By using TrustyAI as a dependency, implementing Llama Stack Providers would become almost trivial; requiring only minimal glue code to convert Llama Stack requests to TrustyAI SDK parameters. We could even make SDK providers directly "pluggable" as Llama Stack Providers.
I'll add a concrete example to the proposal demonstrating how the current LMEval Llama Stack Provider could be simplified using a TrustyAI SDK LMEval Provider.
| class FileDataset(BaseDataset): | ||
| """Dataset implementation for file-based data sources (CSV, JSON, Parquet, etc.).""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please clarify if this option does cover PVC mounting scenario too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielezonca I added a note on PVC.
IMHO, wouldn't PVC just be a storage abstraction? The FileDataset would behave the same way on a local and Kubernetes provider. Happy to change this, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments/questions
|
|
||
| TBD | ||
|
|
||
| ## 8. Alternatives Considered / Rejected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps similar to @danielezonca 's "This goal seems to overlap with llama-stack Safety provider" comment, it might be helpful to spell out why this separate SDK is the best decision going forward as opposed to leveraging something existent like llama-stack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@evaline-ju Very good point. I've answered the two comments here
|
|
||
| #### 6.4.1.1. Dataset Abstraction | ||
|
|
||
| The TrustyAI SDK implements a unified Dataset abstraction that provides a consistent interface for accessing data from various sources while using pandas DataFrame as the universal data format. This abstraction allows providers to work with data from databases, files, cloud storage, and other sources without requiring knowledge of the underlying storage implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q about TrustyAI datasets in general - are these mostly static after access i.e. there aren't additional manipulations that would warrant any need to save or persist a manipulated dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@evaline-ju great question!
In this proposal Datasets are indeed considered immutable data loaders using Dataframes (or numpy even, if we want to consider multi-dimensional data) and just a common format for providers.
In this case, manipulations would happen in user code with pandas and users handle their own persistence needs.
| model: ModelReference | ||
| tasks: List[str] = Field(description="Evaluation tasks to run") | ||
| dataset: Optional[BaseDataset] = None | ||
| parameters: Dict[str, Any] = Field(default_factory=dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not specific to evaluation, but with various algorithm implementations, I’ve seen args/kwargs parameters get fairly numerous and potentially nested - while the Dict is fairly flexible in python, how will the nesting be accounted for in the CLI translation i.e. the --parameters "embeddings_model=openai/text-embedding-ada-002…” portion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@evaline-ju Very interesting question! 🙂
Personally, I'm partial to either:
Nested data serialisation
This would be similar to cURL's approach, i.e.
--parameters '{"embeddings_model": "openai/text-embedding-ada-002"}' # or
--parameters @parameters.jsonor
Nested key serialisation
Similar to Helm's value setting
--parameter embeddings.model=openai/text-embedding-ada-002 --parameter embeddings.other.nested=....| """Request model for evaluation operations.""" | ||
| model: ModelReference | ||
| tasks: List[str] = Field(description="Evaluation tasks to run") | ||
| dataset: Optional[BaseDataset] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this is just an example implementation but I'm curious about the expectations for batch cases, which I assume could be popular, whether it's usage of multiple models or multiple datasets in a request - will there be an expectation to update an existing request or implement new classes to accommodate these use cases or say that the user of trustyai is responsible for those cases?
| - Returns results directly from the local process | ||
|
|
||
| 2. **Kubernetes Implementation:** | ||
| - Provider builds a Custom Resource (CR) from the input parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the "TrustyAI operator integration for Kubernetes deployments" in the dependency notes but am still a bit confused - will the dev here have to write the logic, or is this a matter of leveraging an existing operator that can do this?
dahlem
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work on this ADR 🚀
| self._cached_data: Optional[pd.DataFrame] = None | ||
|
|
||
| @abstractmethod | ||
| def load(self, **kwargs) -> pd.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the assumptions we are making on the underlying data asset and implications in materializing those to pandas DataFrames with respect to:
- memory and scale: do we expect that the data asset can be fully loaded into memory?
- parallelism: do we expect no further data manipulations since pandas is single-threaded?
- versioning/lineage: @evaline-ju asked whether the data is static. If it isn't, do we need to track provenance?
- serialization: if data is dynamic does it need to be serialized?
- schema validation: pandas has week data schema support; does not enforce a schema throughout the DataFrame lifecycle; type coercion is implicit and error prone; missing data is loosely handled (NaNs for any column and non-nullable columns not enforced); and no built-in validation/constraints
| fairness_provider = FairnessProvider(implementation="fairlearn") | ||
|
|
||
| # Provider handles data source abstraction internally | ||
| spd_score = fairness_provider.statistical_parity_difference( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does the logic sit, if a dataset does not fit into memory. Do the metric providers need to stream through the data and update the metric incrementally with bounded memory constraints? If so, some metrics are not easily streamable like (PR)-AUC, confusion matrices with arbitrary thresholds, ranking-based metrics, etc.
| ```python | ||
| # TrustyAI service specific metrics endpoints implementation | ||
| @app.post("/v2/metrics/spd/calculate") | ||
| async def calculate_spd(request: SPDRequest): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the pattern to go from ad-hoc metrics to runtime metrics? E.g., I'd like to monitor SPD over configurable windows of data continuously in deployment using some form of FIFO buffer or discrete non-overlapping buckets (weekly, monthly, etc). Does this proposal imply that the data needs to materialize first?
| **Local Execution:** | ||
| ```bash | ||
| # Local evaluation execution | ||
| trustyai eval execute --provider lm-evaluation-harness --execution-mode local --model "hf/microsoft/DialoGPT-medium" --limit 10 --tasks "hellaswag,arc" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can these model strings come potentially from the model registry / model catalogue? How might we enforce a secure supply chain?
|
Thank you @ruivieira for this ADR!! This is super thorough and clearly written. As the n00b that I am, it's not obvious to me that our problem statement is fragmentation and a steep learning curve, and that therefore we lack a unified API. I am also concerned that this isn't just inspired by Llama Stack, but actually doing a lot of the same things. Having said that, I do see the need/value for us to stand on our own feet and have our own SDK. For example for a) customers that don't use LS, or b) the community of AI experts/tinkerers that is looking to experiment or develop without the bloat of LS, etc. However, given Red Hat's decision to adopt LS as the glue and basis of our GenAI product, it seems to me that this need/value for us to stand on our own feet should manifest itself a bit more loudly in the form of customers looking for trustworthy solutions that don't / won't use LS, or an active community raising issues in our public repos that point at fragmentation issues, etc. In other words it would arise as a "pull" instead of a "push". Therefore, while I agree with this ADR in the sense of the end state, I would like to propose that we consider a different order of operations. Specifically:
|
|
@dmaniloff Thank you for the feedback! You raise excellent points about priorities and the relationship with Llama Stack (LS). Let me try to give my view on the fragmentation problem and how I think TrustyAI and LS can work well together. You're absolutely right to point if the fragmentation problem is clear! Here's my view of the current situation: TrustyAI Core (Java-based with its own API patterns), Metrics Service (REST server with ad-hoc endpoints where each metric has different request/response formats), LM-Eval integration (deployed as Kubernetes Jobs with custom configuration), and Guardrails (separate configuration logic and deployment patterns). For someone wanting to implement AI safety, they need to learn four different interfaces, deployment methods, and configuration approaches. The SDK wouldn't replace these components' APIs, but provide a unified way to use them together. I agree that making LS interoperability easy is critical. Rather than building TrustyAI "around" LS, I see advantages in making our provider architecture LS-compatible. This would actually increase the reach of TrustyAI providers: direct SDK usage for users wanting specialised AI safety tools, LS integration for broader LLM workflows, and "mixed" scenarios where teams start with direct SDK usage and later integrate with LS infrastructure. For instance, a TrustyAI evaluation provider could execute locally, on Kubernetes, or as an LS provider—all through the same interface. Llama Stack provides a good general-purpose API for LLM inference and workflows. TrustyAI can complement this by offering an API for AI safety that moves faster in this specific domain. TrustyAI can be the safety SDK that plugs into LS infrastructure when needed, but that is also used by the community at large (which might not need LS). I agree with your point about the datasets abstraction, but this work needs to happen regardless. In the TrustyAI service, we'd still need to write data abstractions to send data from databases, CSV files, and HDF5 to different algorithms from different libraries like AIF360, fairlearn, and Deon. Whether we implement it in the SDK or in the TrustyAI service, we need a unified way to handle different storage backends. Another advantage of the SDK is that it can be used as a library dependency or from Jupyter as "pure" Python—no need for LS at all. This would be helpful for researchers and developers who want to work directly with AI safety tools without additional dependencies (such as a LS distro). Looking at the effort: LS-first approach means implementing LS out-of-tree providers plus backends (K8s/local) plus a TrustyAI community LS distribution. SDK-first approach means building providers directly in SDK plus creating LS compatibility layer. In my view, the SDK-first approach seems more efficient because we develop each capability once with multiple deployment targets, can iterate faster on AI safety-specific features, and still achieve full LS compatibility. Perhaps we can find a middle ground: develop the core SDK providers with LS compatibility built-in from day one, deploy these providers both standalone and as LS integrations. This way, we're not building separately from LS, but ensuring TrustyAI can work with LS deployments while also the broader community reach. A good test run of this would be LMEval and Guardrails. By writing them as SDK providers (for LMEval for instance, the current LS provider is a good example—it's almost an "SDK provider" already) and then turning the out-of-tree provider into a very thin wrapper around the SDK provider. What do you think about this approach? |
No description provided.