model.yaml
🚧 Cortex.cpp is currently under development. Our documentation outlines the intended behavior of Cortex, which may not yet be fully implemented in the codebase.
Cortex.cpp uses a model.yaml
file to specify the configuration for running a model. Models can be downloaded from the Cortex Model Hub or Hugging Face repositories. Once downloaded, the model data is parsed and stored in two locations:
/cortexcpp/models/<model_id>/<model>.yml
: Contains the original model data./cortexcpp/models/<model>.yaml
: Manages model settings for Cortex.cpp.
model.yaml
High Level Structure​
Here is an example of model.yaml
format:
# Cortex Metaname: model: version: # Engine / Model Settingsengine: ngl: ctx_len: prompt_template: # Results Preferencesstop:max_tokens: stream:
The model.yaml
is composed of three high-level sections:
Cortex Meta​
# Cortex Metaname: openhermes-2.5model: openhermes-2.5:7Bversion: 1
Cortex Meta consists of essential metadata that identifies the model within Cortex.cpp. The required parameters include:
Parameter | Description |
---|---|
name | The identifier name of the model, used as the model_id . |
model | Details specifying the variant of the model, including size or quantization. |
version | The version number of the model. |
Engine / Model Settings​
# Engine / Model Settingsengine: llamacppngl: 33 ctx_len: 4096 prompt_template: "<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
Engine/Model Settings include the options that control how Cortex.cpp runs the model. The required parameters include:
Parameter | Description |
---|---|
engine | Specifies the engine to be used for model execution. |
ngl | Number of attention heads. |
ctx_len | Context length (maximum number of tokens). |
prompt_template | Template for formatting the prompt, including system messages and instructions. |
Result Preferences​
# Results Preferencesstop: - </s>max_tokens: 4096 stream: true
Result Preferences define how the results will be produced. The required parameters include:
Parameter | Description |
---|---|
max_tokens | Maximum number of tokens in the output. |
stream | Enables or disables streaming mode for the output (true or false). |
stop | Specifies the stopping condition for the model, which can be a word, a letter, or a specific text. |
Model Formats​
The model.yaml
parameters vary for each supported model formats.
GGUF​
Example of model.yaml
for GGUF format:
name: openhermes-2.5model: openhermes-2.5:7Bversion: 1# Engine / Model Settingsengine: llamacppngl: 33 # Infer from base config.json -> num_attention_headsctx_len: 4096 # Infer from base config.json -> max_position_embeddingsprompt_template: "<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"# Results Preferencesstop: - </s>top_p: 0.95temperature: 0.7frequency_penalty: 0presence_penalty: 0max_tokens: 4096 # Infer from base config.json -> max_position_embeddingsstream: true # true | false
Model Parameters​
Parameter | Description | Required |
---|---|---|
top_p | The cumulative probability threshold for token sampling. | No |
temperature | Controls the randomness of predictions by scaling logits before applying softmax. | No |
frequency_penalty | Penalizes new tokens based on their existing frequency in the sequence so far. | No |
presence_penalty | Penalizes new tokens based on whether they appear in the sequence so far. | No |
max_tokens | Maximum number of tokens in the output. | Yes |
stream | Enables or disables streaming mode for the output (true or false). | Yes |
ngl | Number of attention heads. | Yes |
ctx_len | Context length (maximum number of tokens). | Yes |
engine | Specifies the engine to be used for model execution. | Yes |
prompt_template | Template for formatting the prompt, including system messages and instructions. | Yes |
stop | Specifies the stopping condition for the model, which can be a word, a letter, or a specific text. | Yes |
ONNX​
Example of model.yaml
for ONNX format:
name: openhermes-2.5model: openhermesversion: 1# Engine / Model Settingsengine: onnxprompt_template: "<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"# Results Preferencestop_p: 1.0temperature: 1.0frequency_penalty: 0presence_penalty: 0max_tokens: 2048 stream: true # true | false
Model Parameters​
Parameter | Description | Required |
---|---|---|
top_p | The cumulative probability threshold for token sampling. | No |
temperature | Controls the randomness of predictions by scaling logits before applying softmax. | No |
frequency_penalty | Penalizes new tokens based on their existing frequency in the sequence so far. | No |
presence_penalty | Penalizes new tokens based on whether they appear in the sequence so far. | No |
stop | Specifies the stopping condition for the model, which can be a word, a letter, or a specific text. | No |
max_tokens | Maximum number of tokens in the output. | Yes |
stream | Enables or disables streaming mode for the output (true or false). | Yes |
ngl | Number of attention heads. | Yes |
ctx_len | Context length (maximum number of tokens). | Yes |
engine | Specifies the engine to be used for model execution. | Yes |
prompt_template | Template for formatting the prompt, including system messages and instructions. | Yes |
TensorRT-LLM​
Example of model.yaml
for TensorRT-LLM format:
name: Openhermes-2.5 7b Linux Adamodel: openhermes-2.5:7B-tensorrt-llmversion: 1# Engine / Model Settingsengine: tensorrt-llmos: linuxgpu_arch: adaquantization_method: awqprecision: int4tp: 1trtllm_version: 0.9.0ctx_len: 2048 # Infer from base config.json -> max_position_embeddingstext_model: falseprompt_template: "<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"# Results Preferencestemperature: 0.7max_tokens: 2048stream: true # true | false
Model Parameters​
Parameter | Description | Required |
---|---|---|
top_p | The cumulative probability threshold for token sampling. | No |
temperature | Controls the randomness of predictions by scaling logits before applying softmax. | No |
frequency_penalty | Penalizes new tokens based on their existing frequency in the sequence so far. | No |
presence_penalty | Penalizes new tokens based on whether they appear in the sequence so far. | No |
stop | Specifies the stopping condition for the model, which can be a word, a letter, or a specific text. | No |
max_tokens | Maximum number of tokens in the output. | Yes |
stream | Enables or disables streaming mode for the output (true or false). | Yes |
engine | Specifies the engine to be used for model execution. | Yes |
os | Operating system used. | Yes |
gpu_arch | GPU architecture used. | Yes |
quantization_method | Method used for quantization. | Yes |
precision | Precision level used. | Yes |
tp | Number of tensor parallelism partitions. | Yes |
trtllm_version | Version of TensorRT-LLM being used. | Yes |
ctx_len | Context length (maximum number of tokens). | Yes |
text_model | Indicates if the text model is being used (true or false). | Yes |
prompt_template | Template for formatting the prompt, including system messages and instructions. | Yes |
You can download all the supported model formats from the following: