Skip to main content

model.yaml

warning

🚧 Cortex.cpp is currently under development. Our documentation outlines the intended behavior of Cortex, which may not yet be fully implemented in the codebase.

Cortex.cpp uses a model.yaml file to specify the configuration for running a model. Models can be downloaded from the Cortex Model Hub or Hugging Face repositories. Once downloaded, the model data is parsed and stored in two locations:

  • /cortexcpp/models/<model_id>/<model>.yml: Contains the original model data.
  • /cortexcpp/models/<model>.yaml: Manages model settings for Cortex.cpp.

model.yaml High Level Structure​

Here is an example of model.yaml format:


# Cortex Meta
name:
model:
version:
# Engine / Model Settings
engine:
ngl:
ctx_len:
prompt_template:
# Results Preferences
stop:
max_tokens:
stream:

The model.yaml is composed of three high-level sections:

Cortex Meta​


# Cortex Meta
name: openhermes-2.5
model: openhermes-2.5:7B
version: 1

Cortex Meta consists of essential metadata that identifies the model within Cortex.cpp. The required parameters include:

ParameterDescription
nameThe identifier name of the model, used as the model_id.
modelDetails specifying the variant of the model, including size or quantization.
versionThe version number of the model.

Engine / Model Settings​


# Engine / Model Settings
engine: llamacpp
ngl: 33
ctx_len: 4096
prompt_template: "<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"

Engine/Model Settings include the options that control how Cortex.cpp runs the model. The required parameters include:

ParameterDescription
engineSpecifies the engine to be used for model execution.
nglNumber of attention heads.
ctx_lenContext length (maximum number of tokens).
prompt_templateTemplate for formatting the prompt, including system messages and instructions.

Result Preferences​


# Results Preferences
stop:
- </s>
max_tokens: 4096
stream: true

Result Preferences define how the results will be produced. The required parameters include:

ParameterDescription
max_tokensMaximum number of tokens in the output.
streamEnables or disables streaming mode for the output (true or false).
stopSpecifies the stopping condition for the model, which can be a word, a letter, or a specific text.

Model Formats​

The model.yaml parameters vary for each supported model formats.

GGUF​

Example of model.yaml for GGUF format:


name: openhermes-2.5
model: openhermes-2.5:7B
version: 1
# Engine / Model Settings
engine: llamacpp
ngl: 33 # Infer from base config.json -> num_attention_heads
ctx_len: 4096 # Infer from base config.json -> max_position_embeddings
prompt_template: "<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
# Results Preferences
stop:
- </s>
top_p: 0.95
temperature: 0.7
frequency_penalty: 0
presence_penalty: 0
max_tokens: 4096 # Infer from base config.json -> max_position_embeddings
stream: true # true | false

Model Parameters​

ParameterDescriptionRequired
top_pThe cumulative probability threshold for token sampling.No
temperatureControls the randomness of predictions by scaling logits before applying softmax.No
frequency_penaltyPenalizes new tokens based on their existing frequency in the sequence so far.No
presence_penaltyPenalizes new tokens based on whether they appear in the sequence so far.No
max_tokensMaximum number of tokens in the output.Yes
streamEnables or disables streaming mode for the output (true or false).Yes
nglNumber of attention heads.Yes
ctx_lenContext length (maximum number of tokens).Yes
engineSpecifies the engine to be used for model execution.Yes
prompt_templateTemplate for formatting the prompt, including system messages and instructions.Yes
stopSpecifies the stopping condition for the model, which can be a word, a letter, or a specific text.Yes

ONNX​

Example of model.yaml for ONNX format:


name: openhermes-2.5
model: openhermes
version: 1
# Engine / Model Settings
engine: onnx
prompt_template: "<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
# Results Preferences
top_p: 1.0
temperature: 1.0
frequency_penalty: 0
presence_penalty: 0
max_tokens: 2048
stream: true # true | false

Model Parameters​

ParameterDescriptionRequired
top_pThe cumulative probability threshold for token sampling.No
temperatureControls the randomness of predictions by scaling logits before applying softmax.No
frequency_penaltyPenalizes new tokens based on their existing frequency in the sequence so far.No
presence_penaltyPenalizes new tokens based on whether they appear in the sequence so far.No
stopSpecifies the stopping condition for the model, which can be a word, a letter, or a specific text.No
max_tokensMaximum number of tokens in the output.Yes
streamEnables or disables streaming mode for the output (true or false).Yes
nglNumber of attention heads.Yes
ctx_lenContext length (maximum number of tokens).Yes
engineSpecifies the engine to be used for model execution.Yes
prompt_templateTemplate for formatting the prompt, including system messages and instructions.Yes

TensorRT-LLM​

Example of model.yaml for TensorRT-LLM format:


name: Openhermes-2.5 7b Linux Ada
model: openhermes-2.5:7B-tensorrt-llm
version: 1
# Engine / Model Settings
engine: tensorrt-llm
os: linux
gpu_arch: ada
quantization_method: awq
precision: int4
tp: 1
trtllm_version: 0.9.0
ctx_len: 2048 # Infer from base config.json -> max_position_embeddings
text_model: false
prompt_template: "<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
# Results Preferences
temperature: 0.7
max_tokens: 2048
stream: true # true | false

Model Parameters​

ParameterDescriptionRequired
top_pThe cumulative probability threshold for token sampling.No
temperatureControls the randomness of predictions by scaling logits before applying softmax.No
frequency_penaltyPenalizes new tokens based on their existing frequency in the sequence so far.No
presence_penaltyPenalizes new tokens based on whether they appear in the sequence so far.No
stopSpecifies the stopping condition for the model, which can be a word, a letter, or a specific text.No
max_tokensMaximum number of tokens in the output.Yes
streamEnables or disables streaming mode for the output (true or false).Yes
engineSpecifies the engine to be used for model execution.Yes
osOperating system used.Yes
gpu_archGPU architecture used.Yes
quantization_methodMethod used for quantization.Yes
precisionPrecision level used.Yes
tpNumber of tensor parallelism partitions.Yes
trtllm_versionVersion of TensorRT-LLM being used.Yes
ctx_lenContext length (maximum number of tokens).Yes
text_modelIndicates if the text model is being used (true or false).Yes
prompt_templateTemplate for formatting the prompt, including system messages and instructions.Yes
info

You can download all the supported model formats from the following: