Skip to main content

llama.cpp

warning

🚧 Cortex.cpp is currently under development. Our documentation outlines the intended behavior of Cortex, which may not yet be fully implemented in the codebase.

info

llamacpp is formerly called "Nitro".

Introduction​

llamacpp is a C++ inference library that any server can load at runtime. It submodules (and occasionally upstreams) llama.cpp for GGUF inference. llama.cpp runs on CPU and GPU, and is optimized for inference.

In addition to llama.cpp, llamacpp adds:

  • Model orchestration, like model warm-up and concurrent models.
warning

llamacpp is bundled by default in our product, Jan and Cortex.

Usage​


cortex engines llama.cpp init

The command will check, download, and install these dependencies:


- engine.dll
- Cuda 11.7:
- cublas64_11.dll
- cublasLt64_11.dll
- cudart64_110.dll
- Cuda 12.2
- cublas64_12.dll
- cublasLt64_12.dll
- cudart64_12.dll
- cudnn_ops_infer64_8.dll
- cudnn64_8.dll
- Cuda 12.4
- cublas64_12.dll
- cublasLt64_12.dll
- cudart64_12.dll
- nvrtc64_120_0.dll
- MSBuild libraries:
- msvcp140.dll
- vcruntime140.dll
- vcruntime140_1.dll

info

To include llamacpp in your own server implementation, follow the steps here.

Get GGUF Models​

You can download precompiled models from the Cortex Hub on Hugging Face. These models include configurations, tokenizers, and dependencies tailored for optimal performance with this engine.

Interface​

llamacpp has the following Interfaces:

  • HandleChatCompletion: Processes chat completion tasks.

    void HandleChatCompletion(
    std::shared_ptr<Json::Value> jsonBody,
    std::function<void(Json::Value&&, Json::Value&&)>&& callback);

  • HandleEmbedding: Generates embeddings for the input data provided.

    void HandleEmbedding(
    std::shared_ptr<Json::Value> jsonBody,
    std::function<void(Json::Value&&, Json::Value&&)>&& callback);

  • LoadModel: Loads a model based on the specifications.

    void LoadModel(
    std::shared_ptr<Json::Value> jsonBody,
    std::function<void(Json::Value&&, Json::Value&&)>&& callback);

  • UnloadModel: Unloads a model as specified.

    void UnloadModel(
    std::shared_ptr<Json::Value> jsonBody,
    std::function<void(Json::Value&&, Json::Value&&)>&& callback);

  • GetModelStatus: Retrieves the status of a model.

    void GetModelStatus(
    std::shared_ptr<Json::Value> jsonBody,
    std::function<void(Json::Value&&, Json::Value&&)>&& callback);

    All the interfaces above contain the following parameters:
ParameterDescription
jsonBodyThe requested content is in JSON format.
callbackA function that handles the response.

Architecture​

Main Components​

llamacpp is architected with several key components:

  • enginei: An engine interface definition that extends to all engines, handling endpoint logic and facilitating communication between cortex.cpp and llama engine.
  • llama engine: Exposes APIs for embedding and inference. It loads and unloads models and simplifies API calls to llama.cpp.
  • llama.cpp: Submodule from the llama.cpp repository that provides the core functionality for embeddings and inferences.
  • llama server context: A wrapper offers a more straightforward and user-friendly interface for llama.cpp APIs

Communication Protocols​

The diagram above illustrates how llamacpp communication protocol works:

  • Streaming: Responses are processed and returned one token at a time.
  • RESTful: The response is processed as a whole. After the llama server context completes the entire process, a single result returns to cortex.cpp.

Code Structure​


.
├── base # Engine interface definition
| └── cortex-common # Common interfaces used for all engines
| └── enginei.h # Define abstract classes and interface methods for engines
├── examples # Server example to integrate engine
│ └── server.cc # Example server demonstrating engine integration
├── llama.cpp # Upstream llama.cpp repository
│ └── (files from upstream llama.cpp)
├── src # Source implementation for llama.cpp
│ ├── chat_completion_request.h # OpenAI compatible request handling
│ ├── llama_client_slot # Manage vector of slots for parallel processing
│ ├── llama_engine # Implementation of llamacpp engine for model loading and inference
│ ├── llama_server_context # Context management for chat completion requests
│ │ ├── slot # Struct for slot management
│ │ └── llama_context # Struct for llama context management
| | └── chat_completion # Struct for chat completion management
| | └── embedding # Struct for embedding management
├── third-party # Dependencies of the llamacpp project
│ └── (list of third-party dependencies)

Roadmap​

The future plans for llamacpp are focused on enhancing performance and expanding capabilities. Key areas of improvement include:

  • Performance Enhancements: Optimizing speed and reducing memory usage to ensure efficient processing of tasks.
  • Multimodal Model Compatibility: Expanding support to include a variety of multimodal models, enabling a broader range of applications and use cases.
info

To follow the latest developments of llamacpp, please see the GitHub Repository.