Skip to main content

ONNX

warning

🚧 Cortex.cpp is currently under development. Our documentation outlines the intended behavior of Cortex, which may not yet be fully implemented in the codebase.

Introduction​

Cortex.onnx is a C++ inference library for Windows that relies on onnxruntime-genai, utilizing DirectML for hardware acceleration. DirectML is a high-performance DirectX 12 library for machine learning, providing GPU acceleration across various hardware and drivers, including AMD, Intel, NVIDIA, and Qualcomm GPUs. It integrates and sometimes upstreams onnxruntime-genai for inference tasks.

info

The current valid combinations for executor and precision are:

  • FP32 CPU
  • FP32 CUDA
  • FP16 CUDA
  • FP16 DML
  • INT4 CPU
  • INT4 CUDA
  • INT4 DML

Usage​


cortex engines onnx init

The command will check, download, and install these dependencies for Windows:


- engine.dll
- D3D12Core.dll
- DirectML.dll
- onnxruntime.rel.dll
- onnxruntime-genai.dll
- MSBuild libraries:
- msvcp140.dll
- vcruntime140.dll
- vcruntime140_1.dll

info

To include onnx in your own server implementation, follow the steps here.

Get ONNX Models​

You can download precompiled ONNX models from the Cortex Hub on Hugging Face. These models include configurations, tokenizers, and dependencies tailored for optimal performance with the onnx engine.

Interface​

onnx has the following Interfaces:

  • HandleChatCompletion: Processes chat completion tasks.

    void HandleChatCompletion(
    std::shared_ptr<Json::Value> json_body,
    std::function<void(Json::Value&&, Json::Value&&)>&& callback);

  • LoadModel: Loads a model based on the specifications.

    void LoadModel(
    std::shared_ptr<Json::Value> json_body,
    std::function<void(Json::Value&&, Json::Value&&)>&& callback);

  • UnloadModel: Unloads a model as specified.

    void UnloadModel(
    std::shared_ptr<Json::Value> json_body,
    std::function<void(Json::Value&&, Json::Value&&)>&& callback);

  • GetModelStatus: Retrieves the status of a model.

    void GetModelStatus(
    std::shared_ptr<Json::Value> json_body,
    std::function<void(Json::Value&&, Json::Value&&)>&& callback);

All the interfaces above contain the following parameters:

ParameterDescription
jsonBodyThe requested content is in JSON format.
callbackA function that handles the response.

Architecture​

Main Components​

These are the main components that interact to provide an API for inference tasks using the onnxruntime-genai library:

  • cortex-cpp: Responsible for handling API requests and responses.
  • enginei: Engine interface for inference.
  • onnx: It makes APIs accessible through an engine interface, allowing others to use its features easily.
  • onnx_engine: Exposes APIs for inference. It loads and unloads models and simplifies API calls to onnxruntime_genai.
  • onnxruntime_genai: A submodule from the onnxruntime_genai repository that provides the core functionality for inferences.

Communication Protocols​

Load a Model​

The diagram above illustrates the interaction between three components: cortex-js, cortex-cpp, and onnx when using the onnx engine in Cortex:

  1. HTTP Request from cortex-js to cortex-cpp:

    • cortex-js sends an HTTP request to cortex-cpp to load a model.
  2. Engine Loading in cortex-cpp:

    • Upon receiving the HTTP request, cortex-cpp initiates the loading of the engine.
  3. Model Loading from cortex-cpp to onnx:

    • cortex-cpp then requests onnx to load the model.
  4. Model Preparation in onnx:

    • onnx performs the following tasks:
      • Create Tokenizer: Initializes a tokenizer for the model.
      • Create ONNX Model: Sets up the ONNX model for inference.
      • Cache Chat Template: Caches the chat template for future use.
  5. Callback from onnx to cortex-cpp:

    • Once the model is loaded and ready, onnx sends a callback to cortex-cpp to indicate the completion of the model loading process.
  6. HTTP Response from cortex-cpp to cortex-js:

    • cortex-cpp sends an HTTP response back to cortex-js, indicating that the model has been successfully loaded and is ready for use.

Stream Inference​

The diagram above illustrates the interaction between three components: cortex-js, cortex-cpp, and onnx when using the onnx engine to call the chat completions endpoint with the stream inference option:

  1. HTTP Request from cortex-js to cortex-cpp:

    • cortex-js sends an HTTP request to cortex-cpp for chat completion.
  2. Request Chat Completion from cortex-cpp to onnx:

    • cortex-cpp forwards the request to onnx to process the chat completion.
  3. Chat Processing in onnx:

    • onnx performs the following tasks:
      • Apply Chat Template: Applies the chat template.
      • Encode: Encodes the input data.
      • Set Search Options: Configures search options for inference.
      • Create Generator: Creates a generator for token generation.
  4. Token Generation in onnx:

    • onnx executes the following steps in a loop to generate the response:
      • Compute Logits: Computes the logits.
      • Generate Next Token: Generates the next token.
      • Decode New Token: Decodes the newly generated token.
  5. Callback from onnx to cortex-cpp:

    • Once a token is generated, onnx sends a callback to cortex-cpp.
  6. HTTP Stream Response from cortex-cpp to cortex-js:

    • cortex-cpp streams the response back to cortex-js as the tokens are generated.
  7. Wait for Done in cortex-js:

    • cortex-js waits until the entire response is received and the process is completed.

Non-stream Inference​

The diagram above illustrates the interaction between three components: cortex-js, cortex-cpp, and onnx when using the onnx engine to call the chat completions endpoint with the non-stream inference option:

  1. HTTP Request from cortex-js to cortex-cpp:

    • cortex-js sends an HTTP request to cortex-cpp for chat completion.
  2. Request Chat Completion from cortex-cpp to onnx:

    • cortex-cpp forwards the request to onnx to process the chat completion.
  3. Chat Processing in onnx:

    • onnx performs the following tasks:
      • Apply Chat Template: Applies the chat template.
      • Encode: Encodes the input data.
      • Set Search Options: Configures search options for inference.
      • Create Generator: Creates a generator to process the request.
  4. Output Generation in onnx:

    • onnx executes the following steps to generate the response:
      • Generate Output: Generates the output based on the processed data.
      • Decode Output: Decodes the generated output.
  5. Callback from onnx to cortex-cpp:

    • Once the output is generated and ready, onnx sends a callback to cortex-cpp to indicate the completion of the chat completion process.
  6. HTTP Response from cortex-cpp to cortex-js:

    • cortex-cpp sends an HTTP response back to cortex-js, providing the generated output.

Code Structure​


.
├── base # Engine interface definition
| └── cortex-common # Common interfaces used for all engines
| └── enginei.h # Define abstract classes and interface methods for engines
├── examples # Server example to integrate engine
│ └── server.cc # Example server demonstrating engine integration
├── onnxruntime-genai
│ └── (files from upstream onnxruntime-genai)
├── src # Source implementation for onnx engine
│ ├── chat_completion_request.h # OpenAI compatible request handling
│ ├── onnx_engine.h # Implementation onnx engine of model loading and inference
| ├── onnx_engine.cc
├── third-party # Dependencies of the onnx project
└── (list of third-party dependencies)