Chat Completions

warning

🚧 Cortex.cpp is currently under development. Our documentation outlines the intended behavior of Cortex, which may not yet be fully implemented in the codebase.

Cortex's Chat API is compatible with OpenAI’s Chat Completions endpoint. It is a drop-in replacement for local inference.

For local inference, Cortex is multi-engine and supports the following model formats:

GGUF: A generalizable LLM format that runs across CPUs and GPUs. Cortex implements a GGUF runtime through llama.cpp.
TensorRT: A production-ready, enterprise-grade LLM format optimized for fast inference on NVIDIA GPUs. Cortex implements a TensorRT runtime through TensorRT-LLM.
ONNX: A cross-platform machine learning accelerator for inference. Cortex implements an ONNX runtime through ONNX Runtime.

Cortex routes requests to multiple APIs for remote inference while providing a single, easy-to-use, OpenAI-compatible endpoint.

Usage

CLI


# Streaming 
cortex chat --model mistral

API

Single Request Example
Dialogue Request Example
Endpoint Response


curl http://localhost:3928/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "",
  "messages": [
    {
      "role": "user",
      "content": "Hello"
    },
  ],
  "model": "",
  "stream": true,
  "max_tokens": 1,
  "stop": [
      null
  ],
  "frequency_penalty": 1,
  "presence_penalty": 1,
  "temperature": 1,
  "top_p": 1
}'


curl http://localhost:3928/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "Who won the world series in 2020?"
    },
    {
      "role": "assistant",
      "content": "The Los Angeles Dodgers won the World Series in 2020."
    },
    {
      "role": "user",
      "content": "Where was it played?"
    }
  ],
  "model": "",
  "stream": true,
  "max_tokens": 1,
  "stop": [
      null
  ],
  "frequency_penalty": 1,
  "presence_penalty": 1,
  "temperature": 1,
  "top_p": 1
}'


{
"choices": [
  {
    "finish_reason": "stop",
    "index": 0,
    "message": {
      "content": "Hello, how may I assist you this evening?",
      "role": "assistant"
    }
  }
],
"created": 1700215278,
"id": "sofpJrnBGUnchO8QhA0s",
"model": "_",
"object": "chat.completion",
"system_fingerprint": "_",
"usage": {
  "completion_tokens": 13,
  "prompt_tokens": 90,
  "total_tokens": 103
}
}

Capabilities

Multiple Local Engines

Cortex scales applications from prototype to production, running on CPU-only laptops with llama.cpp and GPU-accelerated with TensorRT-LLM.

To configure each engine, refer to the cortex engines init command.

Learn more about our engine architecture:

cortex.cpp
llamacpp
tensorrt-llm
onnx

Multiple Remote APIs

Cortex also acts as an aggregator for remote inference requests from a single endpoint. Currently, Cortex supports:

OpenAI
Groq
Anthropic
MistralAI

note

Learn more about Chat Completions capabilities:

Usage​

CLI​

API​

Capabilities​

Multiple Local Engines​

Multiple Remote APIs​

Usage

CLI

API

Capabilities

Multiple Local Engines

Multiple Remote APIs