Skip to main content

Chat Completions

warning

🚧 Cortex.cpp is currently under development. Our documentation outlines the intended behavior of Cortex, which may not yet be fully implemented in the codebase.

Cortex's Chat API is compatible with OpenAI’s Chat Completions endpoint. It is a drop-in replacement for local inference.

For local inference, Cortex is multi-engine and supports the following model formats:

  • GGUF: A generalizable LLM format that runs across CPUs and GPUs. Cortex implements a GGUF runtime through llama.cpp.
  • TensorRT: A production-ready, enterprise-grade LLM format optimized for fast inference on NVIDIA GPUs. Cortex implements a TensorRT runtime through TensorRT-LLM.
  • ONNX: A cross-platform machine learning accelerator for inference. Cortex implements an ONNX runtime through ONNX Runtime.

Cortex routes requests to multiple APIs for remote inference while providing a single, easy-to-use, OpenAI-compatible endpoint.

Usage​

CLI​


# Streaming
cortex chat --model mistral

API​


curl http://localhost:3928/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "",
"messages": [
{
"role": "user",
"content": "Hello"
},
],
"model": "",
"stream": true,
"max_tokens": 1,
"stop": [
null
],
"frequency_penalty": 1,
"presence_penalty": 1,
"temperature": 1,
"top_p": 1
}'

Capabilities​

Multiple Local Engines​

Cortex scales applications from prototype to production, running on CPU-only laptops with llama.cpp and GPU-accelerated with TensorRT-LLM.

To configure each engine, refer to the cortex engines init command.

Learn more about our engine architecture:

Multiple Remote APIs​

Cortex also acts as an aggregator for remote inference requests from a single endpoint. Currently, Cortex supports:

  • OpenAI
  • Groq
  • Anthropic
  • MistralAI
note

Learn more about Chat Completions capabilities: