Yeti is a framework for building agentic AI applications with support for open-source large language models, tool calling, and modular extensions.
Yeti leverages Mistral-Nemo, providing compatibility with the OpenAI API specification without requiring an API subscription. This approach enables:
- Seamless use of open-source models.
- Future capability to train, fine-tune, or update models.
- Flexible model swapping (similar to
LoRAadapters but for open-source models). - Greater control and ownership over intelligence, avoiding proprietary paywalls.
Why Mistral-Nemo?
- Handles meaningful conversations effectively.
- Supports tool and function calling for agentic AI development.
- Fully open-source and powerful.
- Compatible with
OpenAI API, zero-shot,ReAct-based flows, andLangGraph’s tool-calling framework. - Can run quantized versions in limited GPU environments.
Out-of-the-box support includes:
- Fetching weather for a given city.
- Getting the current date and time.
- Fetching exchange rates (via private API).
- Searching and summarizing results from the internet.
- Text embeddings and vector database for overcoming context limits.
- Session and thread IDs for topic-based conversation classification.
- Integrated search backend for browsing the internet.
- Voice controls and conversational interaction (low priority).
- Image analysis (low priority).
- Host OS: Runs
llama_cppinference - Docker: Runs database, frontend and
FastAPIbackend.
git clone https://github.com/deepyes02/yeti-ai- Install
llama_cpp(compile for your specific architecture; see documentation). - Install
Docker Desktop. - Download the
Mistral-Nemoquantized GGUF model from Hugging Face. - Serve the model locally:
llama-server -m ~/llms/mistral-nemo-15.gguf --jinja -c 4096 # Adjust context length based on available GPU
- Run the backend (
FastAPI+WebSocket):uvicorn app.main:app --host 0.0.0.0 --port 8000
- Start
Dockercontainers in the project root:docker compose up -d
- Ensure the model name is correctly configured in
load_model.py.
For testing, type checking, and script execution in scripts/, it is recommended to set up a virtual environment in the project root:
python -m venv env # Python 3.11 recommended
source ./env/bin/activate
pip install -r requirements.txtMistral-Nemo is OpenAI API-compatible. Wrapping it in LangGraph works just like using OpenAI, except no real API key is required:
def load_model():
model = ChatOpenAI(
base_url="http://localhost:8080/v1",
model="mistral-nemo",
api_key=SecretStr("any_string_here"), # any placeholder string works
temperature=0.9,
top_p=0.95,
)
return model- Backend server: Port
8000 - Frontend server: Port
3000(seedocker-compose.yml)
Visit: http://localhost:3000
DeepSeek– Works, but limited by lack of quantized non-thinking model.Qwen 3– Has a “thinking mode” toggle, but not yet supported viaOllama. (Issue raised withLangGraph.)Llama 3.2– Handles tools but often produces incoherent results.Granite 3.3 (8B)– Promising IBM model, but tool-calling not yet functional (needs more testing).

