llama.cpp

info

llamacpp is formerly called "Nitro".

Introduction

llamacpp is a C++ inference library that any server can load at runtime. It submodules (and occasionally upstreams) llama.cpp for GGUF inference. llama.cpp runs on CPU and GPU, and is optimized for inference.

In addition to llama.cpp, llamacpp adds:

Model orchestration, like model warm-up and concurrent models.

warning

llamacpp is bundled by default in our product, Jan and Cortex.

Usage


cortex engines llama.cpp init

The command will check, download, and install these dependencies:

Windows
Linux
MacOs


- engine.dll
- Cuda 11.7:
  - cublas64_11.dll
  - cublasLt64_11.dll
  - cudart64_110.dll
- Cuda 12.2
  - cublas64_12.dll
  - cublasLt64_12.dll
  - cudart64_12.dll
  - cudnn_ops_infer64_8.dll
  - cudnn64_8.dll
- Cuda 12.4
  - cublas64_12.dll
  - cublasLt64_12.dll
  - cudart64_12.dll
  - nvrtc64_120_0.dll
- MSBuild libraries:
  - msvcp140.dll
  - vcruntime140.dll
  - vcruntime140_1.dll


- libengine.so
- Cuda 11.7:
  - libcublas.so.11
  - libcublasLt.so.11
  - libcudart.so.11.0
- Cuda 12.2:
  - libcublas.so.12
  - libcublasLt.so.12
  - libcudart.so.12
- Cuda 12.4:
  - libcublasLt.so.12
  - libcublas.so.12


- libengine.dylib

info

To include llamacpp in your own server implementation, follow the steps here.

Get GGUF Models

You can download precompiled models from the Cortex Hub on Hugging Face. These models include configurations, tokenizers, and dependencies tailored for optimal performance with this engine.

Interface

llamacpp has the following Interfaces:

HandleChatCompletion: Processes chat completion tasks.
void HandleChatCompletion( std::shared_ptr<Json::Value> jsonBody, std::function<void(Json::Value&&, Json::Value&&)>&& callback);
HandleEmbedding: Generates embeddings for the input data provided.
void HandleEmbedding( std::shared_ptr<Json::Value> jsonBody, std::function<void(Json::Value&&, Json::Value&&)>&& callback);
LoadModel: Loads a model based on the specifications.
void LoadModel( std::shared_ptr<Json::Value> jsonBody, std::function<void(Json::Value&&, Json::Value&&)>&& callback);
UnloadModel: Unloads a model as specified.
void UnloadModel( std::shared_ptr<Json::Value> jsonBody, std::function<void(Json::Value&&, Json::Value&&)>&& callback);
GetModelStatus: Retrieves the status of a model.
void GetModelStatus( std::shared_ptr<Json::Value> jsonBody, std::function<void(Json::Value&&, Json::Value&&)>&& callback);
All the interfaces above contain the following parameters:

Parameter	Description
`jsonBody`	The requested content is in JSON format.
`callback`	A function that handles the response.

Architecture

Main Components

llamacpp is architected with several key components:

enginei: An engine interface definition that extends to all engines, handling endpoint logic and facilitating communication between cortex.cpp and llama engine.
llama engine: Exposes APIs for embedding and inference. It loads and unloads models and simplifies API calls to llama.cpp.
llama.cpp: Submodule from the llama.cpp repository that provides the core functionality for embeddings and inferences.
llama server context: A wrapper offers a more straightforward and user-friendly interface for llama.cpp APIs

Communication Protocols

The diagram above illustrates how llamacpp communication protocol works:

Streaming: Responses are processed and returned one token at a time.
RESTful: The response is processed as a whole. After the llama server context completes the entire process, a single result returns to cortex.cpp.

Code Structure


.
├── base                              # Engine interface definition
|   └── cortex-common                 # Common interfaces used for all engines
|      └── enginei.h                  # Define abstract classes and interface methods for engines
├── examples                          # Server example to integrate engine
│   └── server.cc                     # Example server demonstrating engine integration
├── llama.cpp                         # Upstream llama.cpp repository
│   └── (files from upstream llama.cpp)
├── src                               # Source implementation for llama.cpp
│   ├── chat_completion_request.h     # OpenAI compatible request handling
│   ├── llama_client_slot             # Manage vector of slots for parallel processing
│   ├── llama_engine                  # Implementation of llamacpp engine for model loading and inference
│   ├── llama_server_context          # Context management for chat completion requests
│   │   ├── slot                      # Struct for slot management
│   │   └── llama_context             # Struct for llama context management
|   |   └── chat_completion           # Struct for chat completion management
|   |   └── embedding                 # Struct for embedding management
├── third-party                       # Dependencies of the llamacpp project
│   └── (list of third-party dependencies)

Roadmap

The future plans for llamacpp are focused on enhancing performance and expanding capabilities. Key areas of improvement include:

Performance Enhancements: Optimizing speed and reducing memory usage to ensure efficient processing of tasks.
Multimodal Model Compatibility: Expanding support to include a variety of multimodal models, enabling a broader range of applications and use cases.

info

To follow the latest developments of llamacpp, please see the GitHub Repository.

Introduction​

Usage​

Get GGUF Models​

Interface​

Architecture​

Main Components​

Communication Protocols​

Code Structure​

Roadmap​

Introduction

Usage

Get GGUF Models

Interface

Architecture

Main Components

Communication Protocols

Code Structure

Roadmap