ONNX

warning

🚧 Cortex.cpp is currently under development. Our documentation outlines the intended behavior of Cortex, which may not yet be fully implemented in the codebase.

Introduction

Cortex.onnx is a C++ inference library for Windows that relies on onnxruntime-genai, utilizing DirectML for hardware acceleration. DirectML is a high-performance DirectX 12 library for machine learning, providing GPU acceleration across various hardware and drivers, including AMD, Intel, NVIDIA, and Qualcomm GPUs. It integrates and sometimes upstreams onnxruntime-genai for inference tasks.

info

The current valid combinations for executor and precision are:

FP32 CPU
FP32 CUDA
FP16 CUDA
FP16 DML
INT4 CPU
INT4 CUDA
INT4 DML

Usage


cortex engines onnx init

The command will check, download, and install these dependencies for Windows:


- engine.dll
- D3D12Core.dll
- DirectML.dll
- onnxruntime.rel.dll
- onnxruntime-genai.dll
- MSBuild libraries:
   - msvcp140.dll
   - vcruntime140.dll
   - vcruntime140_1.dll

info

To include onnx in your own server implementation, follow the steps here.

Get ONNX Models

You can download precompiled ONNX models from the Cortex Hub on Hugging Face. These models include configurations, tokenizers, and dependencies tailored for optimal performance with the onnx engine.

Interface

onnx has the following Interfaces:

HandleChatCompletion: Processes chat completion tasks.
void HandleChatCompletion( std::shared_ptr<Json::Value> json_body, std::function<void(Json::Value&&, Json::Value&&)>&& callback);
LoadModel: Loads a model based on the specifications.
void LoadModel( std::shared_ptr<Json::Value> json_body, std::function<void(Json::Value&&, Json::Value&&)>&& callback);
UnloadModel: Unloads a model as specified.
void UnloadModel( std::shared_ptr<Json::Value> json_body, std::function<void(Json::Value&&, Json::Value&&)>&& callback);
GetModelStatus: Retrieves the status of a model.
void GetModelStatus( std::shared_ptr<Json::Value> json_body, std::function<void(Json::Value&&, Json::Value&&)>&& callback);

All the interfaces above contain the following parameters:

Parameter	Description
`jsonBody`	The requested content is in JSON format.
`callback`	A function that handles the response.

Architecture

Main Components

These are the main components that interact to provide an API for inference tasks using the onnxruntime-genai library:

cortex-cpp: Responsible for handling API requests and responses.
enginei: Engine interface for inference.
onnx: It makes APIs accessible through an engine interface, allowing others to use its features easily.
onnx_engine: Exposes APIs for inference. It loads and unloads models and simplifies API calls to onnxruntime_genai.
onnxruntime_genai: A submodule from the onnxruntime_genai repository that provides the core functionality for inferences.

Communication Protocols

Load a Model

The diagram above illustrates the interaction between three components: cortex-js, cortex-cpp, and onnx when using the onnx engine in Cortex:

HTTP Request from cortex-js to cortex-cpp:
- cortex-js sends an HTTP request to cortex-cpp to load a model.
Engine Loading in cortex-cpp:
- Upon receiving the HTTP request, cortex-cpp initiates the loading of the engine.
Model Loading from cortex-cpp to onnx:
- cortex-cpp then requests onnx to load the model.
Model Preparation in onnx:
- onnx performs the following tasks:
  - Create Tokenizer: Initializes a tokenizer for the model.
  - Create ONNX Model: Sets up the ONNX model for inference.
  - Cache Chat Template: Caches the chat template for future use.
Callback from onnx to cortex-cpp:
- Once the model is loaded and ready, onnx sends a callback to cortex-cpp to indicate the completion of the model loading process.
HTTP Response from cortex-cpp to cortex-js:
- cortex-cpp sends an HTTP response back to cortex-js, indicating that the model has been successfully loaded and is ready for use.

Stream Inference

The diagram above illustrates the interaction between three components: cortex-js, cortex-cpp, and onnx when using the onnx engine to call the chat completions endpoint with the stream inference option:

HTTP Request from cortex-js to cortex-cpp:
- cortex-js sends an HTTP request to cortex-cpp for chat completion.
Request Chat Completion from cortex-cpp to onnx:
- cortex-cpp forwards the request to onnx to process the chat completion.
Chat Processing in onnx:
- onnx performs the following tasks:
  - Apply Chat Template: Applies the chat template.
  - Encode: Encodes the input data.
  - Set Search Options: Configures search options for inference.
  - Create Generator: Creates a generator for token generation.
Token Generation in onnx:
- onnx executes the following steps in a loop to generate the response:
  - Compute Logits: Computes the logits.
  - Generate Next Token: Generates the next token.
  - Decode New Token: Decodes the newly generated token.
Callback from onnx to cortex-cpp:
- Once a token is generated, onnx sends a callback to cortex-cpp.
HTTP Stream Response from cortex-cpp to cortex-js:
- cortex-cpp streams the response back to cortex-js as the tokens are generated.
Wait for Done in cortex-js:
- cortex-js waits until the entire response is received and the process is completed.

Non-stream Inference

HTTP Request from cortex-js to cortex-cpp:
- cortex-js sends an HTTP request to cortex-cpp for chat completion.
Request Chat Completion from cortex-cpp to onnx:
- cortex-cpp forwards the request to onnx to process the chat completion.
Chat Processing in onnx:
- onnx performs the following tasks:
  - Apply Chat Template: Applies the chat template.
  - Encode: Encodes the input data.
  - Set Search Options: Configures search options for inference.
  - Create Generator: Creates a generator to process the request.
Output Generation in onnx:
- onnx executes the following steps to generate the response:
  - Generate Output: Generates the output based on the processed data.
  - Decode Output: Decodes the generated output.
Callback from onnx to cortex-cpp:
- Once the output is generated and ready, onnx sends a callback to cortex-cpp to indicate the completion of the chat completion process.
HTTP Response from cortex-cpp to cortex-js:
- cortex-cpp sends an HTTP response back to cortex-js, providing the generated output.

Code Structure


.
├── base                              # Engine interface definition
|   └── cortex-common                 # Common interfaces used for all engines
|      └── enginei.h                  # Define abstract classes and interface methods for engines
├── examples                          # Server example to integrate engine
│   └── server.cc                     # Example server demonstrating engine integration
├── onnxruntime-genai
│   └── (files from upstream onnxruntime-genai)
├── src                               # Source implementation for onnx engine
│   ├── chat_completion_request.h     # OpenAI compatible request handling
│   ├── onnx_engine.h                   # Implementation onnx engine of model loading and inference 
|   ├── onnx_engine.cc
├── third-party                       # Dependencies of the onnx project
    └── (list of third-party dependencies)

Introduction​

Usage​

Get ONNX Models​

Interface​

Architecture​

Main Components​

Communication Protocols​

Load a Model​

Stream Inference​

Non-stream Inference​

Code Structure​