LLM Compressor with Alauda AI

This document describes how to use the LLM Compressor integration with the Alauda AI platform to perform model compression workflows. The Alauda AI integration of LLM Compressor provides two example workflows:

TOC

Supported Model Compression Workflows

On the Alauda AI platform, you can use the Workbench feature to run LLM Compressor on models stored in your model repository. The following workflow outlines the typical steps for compressing a model.

Create a Workbench

Follow the instructions in Create Workbench to create a new Workbench instance. Note that model compression is currently supported only within JupyterLab.

Create a Model Repository and Upload Models

Refer to Upload Models Using Notebook for detailed steps on creating a model repository and uploading your model files. The example notebooks in this guide use the TinyLlama-1.1B-Chat-v1.0 model.

(Optional) Prepare and Upload a Dataset

NOTE

If you plan to use the data-free compressor notebook, you can skip this step.

To use the calibration compressor notebook, you must prepare and upload a calibration dataset. Prepare your dataset using the same process described in Upload Models Using Notebook. The example calibration notebook uses the ultrachat_200k dataset.

Clone Models and Datasets in JupyterLab

In the JupyterLab terminal, use git clone to download the model repository (and dataset, if applicable) to your workspace. The data-free compressor notebook does not require a dataset.

Create and Run Compression Notebooks

Download the appropriate example notebook for your use case: the calibration compressor notebook if you are using a dataset, or the data-free compressor notebook otherwise. Create a new notebook (for example, compressor.ipynb) in JupyterLab and paste the contents of the example notebook into it. Run the cells to perform model compression.

Upload the Compressed Model to the Repository

Once compression is complete, upload the compressed model back to the model repository using the steps outlined in Upload Models Using Notebook.

Deploy and Use the Compressed Model for Inference

Quantized and sparse models that you create with LLM Compressor are saved using the compressed-tensors library (an extension of Safetensors). The compression format matches the model's quantization or sparsity type. These formats are natively supported in vLLM, enabling fast inference through optimized deployment kernels by using Alauda AI Inference Server. Follow the instructions in create inference service to complete this step.