MLC LLM brings effortless language model deployment

Emre Çitak
May 8, 2023

MLC LLM is a new open-source project that aims to enable the deployment of large language models on various hardware platforms and applications. The project includes a framework to optimize model performance for each specific use case, and its mission is to allow anyone to develop, optimize and deploy AI models natively on their devices, without relying on server support. This article will delve into MLC LLM and its capabilities.

At the core of MLC LLM lies a technique called machine learning compilation (MLC). MLC combines machine learning programming abstractions, learning-driven search, compilation, and an optimized library runtime for ease of deployment. The approach is designed to optimize model performance for each specific use case, which is vital when deploying large language models across a wide range of hardware platforms.

Support for heterogeneous hardware specifications

Deploying large language models on various hardware platforms and applications presents a complex challenge, and this is where MLC LLM shines. The project faces the challenge of supporting heterogeneous hardware specifications, including different models of CPUs, GPUs, and other co-processors and accelerators, as well as addressing memory constraints and dealing with OS environment variation.

MLC LLM aims to bring AI technologies to different devices- Image: MLC LLM

Leveraging existing open-source projects

To achieve its goals, MLC LLM is based on Apache TVM Unity, a compiler stack for deep learning systems, and leverages tokenizers from Hugging Face and Google, as well as open-source LLMs such as Llama, Vicuna, Dolly, and others. The project includes both a C++ CLI tool and an iOS chat app showcasing how to integrate the compiled artifacts and the required pre/post-processing.

MLC LLM can be deployed on various hardware, including recent Apple Silicon, AMD GPUs, NVIDIA GPUs, and the Intel UHD Graphics 630 GPU. Performance varies significantly across supported hardware, with some NVIDIA GPUs, the AMD RX6800 16G VRAM, and the 2021 MacBook Pro M1 Max scoring above 20 tokens/second. For comparison, the M1 iPad Pro reaches 10.6 tokens/second and the iPhone 14 Pro 7.2 tokens/second.

According to the project maintainers, MLC LLM makes it possible to run quick experiments and try out compiler optimizations, and eventually deploy to the desired targets easily. The project has a companion project focused on Web browsers, WebLLM. If you're interested in learning more about MLC, you can check out the official documentation, which guides you through the key abstractions used to represent machine learning programs, automatic optimization techniques, and how to optimize for dependencies, memory, and performance.

Check out MLC LLM's GitHub page here.


Tutorials & Tips

Previous Post: «
Next Post: «


There are no comments on this post yet, be the first one to share your thoughts!

Leave a Reply

Check the box to consent to your data being stored in line with the guidelines set out in our privacy policy

We love comments and welcome thoughtful and civilized discussion. Rudeness and personal attacks will not be tolerated. Please stay on-topic.
Please note that your comment may not appear immediately after you post it.