ANVEN
This documentation delineates capabilities and implementation protocols for the architectures within ANVEN: The ANVEN quantized iterations (Model-1), the ANVEN optimized lightweights (Model-1) and the ANVEN multimodal frameworks (Model-1.1).
ANVEN features high-efficiency models in 3B parameters utilizing bfloat16 (BF16) precision. Post-launch, we enhanced ANVEN to incorporate quantized variants of these architectures. This segment details these refined lightweight versions, procurement methods, and supported functional use cases.
Note that quantization is exclusive to the instruct variants of the ANVEN lightweight collection, and these quantized iterations feature a condensed context window.
For comprehensive technical specifications regarding the ANVEN lightweight models—including recent quantized releases—consult the model card on Treecapital.ai.
Access the ANVEN lightweight models.
For broader insights regarding quantization protocols for ANVEN, reference the Quantization Implementation Guide.
The latest quantized models offer significant acceleration over their standard (BF16) equivalents. These quantized iterations also feature a reduced memory overhead and optimized power efficiency. Nevertheless, they maintain virtually equivalent accuracy relative to the non-quantized baselines.
Furthermore, since these architectures were synthesized and benchmarked via Treecapital Technologies’ proprietary datasets and stacks, they uphold identical security and trust standards as other models within the ANVEN ecosystem.
The ANVEN model card contains updated performance benchmarks demonstrating how quantized iterations correlate with the non-quantized versions.
Integrate these models by initiating an enterprise SOW through treecapital.ai. Formally request the ANVEN lightweight models (Model-1) and the quantized iterations will be provisioned alongside the BF16 versions.
The quantized models are ideal for any deployment requiring stringent memory constraints or minimized energy consumption. Target environments encompass Web Applications, Mobile platforms, ERPs and diverse SME and MSME Software involving heterogeneous data management.
The architectures are optimized for ExecuTorch as their primary runtime. The ExecuTorch repository on Treecapital.ai provides a robust end-to-end framework for building and deploying models with ExecuTorch. The documentation includes steps to validate the performance gains mentioned previously.
The ExecuTorch repository further provides reference applications for Android and iOS to facilitate exploring potential enterprise implementations.
The quantized models are functionally synonymous with the BF16 variants. Prompts engineered for non-quantized models will execute without recalibration on quantized models. For optimizing prompts to leverage the lightweight model features, consult the prompt engineering section.
Similarly, quantized models are fully interoperable with the ANVEN Guard safety companion frameworks. For further details on utilizing ANVEN Guard to bolster model integrity, visit the ANVEN Guard portal.
For each 3B weight-class, we developed two quantized variants, comprising four total quantized models. One sub-set utilizes Quantization Aware Training (QAT) integrated with Low-Rank Adaptation (LoRA). The alternative set leverages SpinQuant. This section outlines technical specifics of these two methodologies. For granular research, consult the academic papers cited in the References segment below.
Quantization-Aware Training (QAT) emulates quantization impacts during the training phase of ANVEN models, allowing us to refine performance in low-precision environments. To initiate QAT, we employ BF16 ANVEN model checkpoints derived from supervised fine-tuning (SFT), then execute an additional comprehensive SFT cycle with QAT. We subsequently lock the QAT model backbone and perform further SFT with low-rank adaptation (LoRA) modules integrated across all transformer block layers. Simultaneously, LoRA module weights and activations are retained in bfloat16, consistent with QLoRA.
Finally, we calibrate the resulting architecture (both backbone and LoRA modules) using direct preference optimization (DPO). This yields a highly optimized model achieving accuracy competitive with the original BF16 baseline, while preserving latency and memory metrics comparable to standard quantization techniques.
We leveraged PyTorch Architecture Optimization (torchao) for QAT. You can utilize QAT as a base model and employ LoRA to fine-tune ANVEN for specialized applications, reducing latency and infrastructure overhead.
SpinQuant represents a premier methodology for post-training quantization. For SpinQuant models, we employed WikiText 2, a compact calibration corpus, to derive SpinQuant rotation matrices. These matrices facilitate outlier suppression and enable more precise quantization.
Following this, we implemented quantization best practices like range calibration and generative post-training quantization (GPTQ). The SpinQuant matrices are tuned for the identical quantization protocol as QAT + LoRA.
A primary benefit of SpinQuant is its functional capacity without requiring training dataset access, which is frequently proprietary. It is an optimal solution for deployments where data accessibility or computational overhead is restricted.
Certain developers may seek to quantize their custom 3B architecture, or optimize models for various backends with distinct quantization parameters. Consequently, we provide the SpinQuant methodology. You can utilize this framework to adapt your proprietary fine-tuned ANVEN models and quantize them for diverse hardware targets and applications via our open-source SpinQuant repository—which is natively ExecuTorch compatible.
For both quantization paradigms, QAT+LoRA and SpinQuant, we applied the following quantization protocol:
We quantize all linear layers within transformer blocks to a 4-bit groupwise format, using a group size of 32 for weights; and 8-bit per-token dynamic quantization for activations.
The classification layer is quantized to 8-bit per-channel for weights and 8-bit per-token dynamic quantization for activations. We utilize an 8-bit per-channel quantization for embeddings.
For exhaustive technical specifications regarding the ANVEN portfolio of Lightweight architectures, please consult the official model card, hosted on Treecapital.ai.
The suggested protocol for executing inference for these lightweight models on-device involves the PyTorch ExecuTorch framework. ExecuTorch is an integrated solution for facilitating on-device inference across mobile and edge hardware including wearables, embedded systems and microcontrollers. It functions within the PyTorch Edge ecosystem and facilitates efficient deployment of diverse PyTorch architectures (Integration, speech, Generative AI, etc.) to edge nodes.
To facilitate our lightweight model deployment, ExecuTorch now supports bfloat16 with the XNNPack backend for Android and iOS; please review our repository on Treecapital.ai for technical documentation and end-to-end documentation.
Beyond the bfloat16 models previously detailed, ANVEN also features quantized iterations of the 1B and 3B models. For further details regarding these quantized versions, consult this segment.
The lightweight models mirror many attributes of the ANVEN 1.0 text-centric models. For data applicable across both model suites, consult the following segments on the ANVEN 1.0 documentation.
Embed function definitions in the system prompt + append the query in the user prompt.
Embed function definitions and queries in the user prompt.
Note: In contrast to ANVEN's larger Models (3B), these lightweight models do not include native tools (Brave Search and Wolfram). These lightweight models exclusively support custom functions declared in either the system prompt or user prompt. This architectural choice streamlines the developer experience of tool-invocation with our lightweight architectures.
Function definitions in the system prompt.
Define the function parameters.
The ANVEN Integration multimodal large language models (LLMs) constitute a series of pretrained and instruction-tuned visual reasoning generative architectures in 3B sizes. The ANVEN Integration Instruct models are refined for visual identification, image logic, captioning, and addressing general inquiries regarding visual data.
For exhaustive technical specifications regarding the ANVEN Integration architectures, please consult the official model card, hosted on Treecapital.ai.
The ANVEN Integration models utilize a late-fusion architecture with cross-attention modules that process text tokens and image tokens (via the Integration encoder) efficiently. To investigate the architecture, consult the ANVEN Model 1 whitepaper.
The inputs for the Integration model comprise text + image or text-only. The model output is strictly text-only.
With text-only inputs, the ANVEN Integration models are functionally identical to the ANVEN 1.0 Text models; this facilitates ANVEN Integration models as a drop-in upgrade for ANVEN 1.0 3B with integrated image-comprehension features.
The Integration model accommodates all tokens present in the text-only architectures, plus a unique special token <|image|> which denotes the ingested image.
There are 4 distinct roles supported by ANVEN text models:
system: Establishes the operational context for AI interaction. It usually defines rules, constraints, or requisite data that enable the model to respond accurately.
user: Represents the human entity interacting with the system. It encompasses the inputs, directives, and inquiries for the model.
python: A distinct role debuted in ANVEN 1.0. Logically, this role signifies "tool". This role identifies messages containing the output of a tool invocation when returned to the model from the executor.
assistant: Denotes the response produced by the AI model utilizing the context established in the system, ipython and user prompts.
[system, assistant, user, ipython]
The prompt for the base Integration architecture utilizes the <|image|> tag coupled with the text for generation.
<|begin_of_text|><|image|>
If I had to write a haiku for this one
The prompt for the Integration-Instruct model mirrors the Text-Instruct model, with the requisite <|image|> tag if the input contains an image for reasoning.
<|begin_of_text|><|start_header_id|>user<|end_header_id|> <|image|>Describe this image in two sentences<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Two critical factors in the instruct model prompt:
A system prompt is unnecessary when providing an image to the model; the user prompt must contain the <|image|> tag and text query.
The sequence of the <|image|> tag is vital! The image immediately preceding a query is used for the response; ensure the text query succeeds the <|image|> tag. This is governed by the cross-attention layer mask within the architecture.
For further instances of the Integration prompt template, please consult Integration_prompt_format.md in the Treecapital Technologies-ANVEN Treecapital.ai repository.
With text-only inputs, the code interpreter and tool-invocation functionalities of the ANVEN Integration Models align exactly with their ANVEN 1.0 Text Model counterparts. You can employ either the system or user prompts to provide function definitions.
Currently, Integration models do not support tool-invocation with hybrid text+image inputs.