MLC LLM brings effortless language model deployment
MLC LLM is a new open-source project that aims to enable the deployment of large language models on various hardware platforms and applications. The project includes a framework to optimize model performance for each specific use case, and its mission is to allow anyone to develop, optimize and deploy AI models natively on their devices, without relying on server support. This article will delve into MLC LLM and its capabilities.
At the core of MLC LLM lies a technique called machine learning compilation (MLC). MLC combines machine learning programming abstractions, learning-driven search, compilation, and an optimized library runtime for ease of deployment. The approach is designed to optimize model performance for each specific use case, which is vital when deploying large language models across a wide range of hardware platforms.
Support for heterogeneous hardware specifications
Deploying large language models on various hardware platforms and applications presents a complex challenge, and this is where MLC LLM shines. The project faces the challenge of supporting heterogeneous hardware specifications, including different models of CPUs, GPUs, and other co-processors and accelerators, as well as addressing memory constraints and dealing with OS environment variation.
Leveraging existing open-source projects
To achieve its goals, MLC LLM is based on Apache TVM Unity, a compiler stack for deep learning systems, and leverages tokenizers from Hugging Face and Google, as well as open-source LLMs such as Llama, Vicuna, Dolly, and others. The project includes both a C++ CLI tool and an iOS chat app showcasing how to integrate the compiled artifacts and the required pre/post-processing.
MLC LLM can be deployed on various hardware, including recent Apple Silicon, AMD GPUs, NVIDIA GPUs, and the Intel UHD Graphics 630 GPU. Performance varies significantly across supported hardware, with some NVIDIA GPUs, the AMD RX6800 16G VRAM, and the 2021 MacBook Pro M1 Max scoring above 20 tokens/second. For comparison, the M1 iPad Pro reaches 10.6 tokens/second and the iPhone 14 Pro 7.2 tokens/second.
According to the project maintainers, MLC LLM makes it possible to run quick experiments and try out compiler optimizations, and eventually deploy to the desired targets easily. The project has a companion project focused on Web browsers, WebLLM. If you're interested in learning more about MLC, you can check out the official documentation, which guides you through the key abstractions used to represent machine learning programs, automatic optimization techniques, and how to optimize for dependencies, memory, and performance.
Check out MLC LLM's GitHub page here.
Advertisement