Web LLM brings browser-based revolution to AI

Apr 20, 2023

Misc

A team of developers is working to bring language model chats directly to web browsers, operating entirely in-browser with WebGPU acceleration. The project acknowledges advancements in generative AI and language model development, thanks to open-source contributions like LLaMA, Alpaca, Vicuna, and Dolly. The goal is to create open-source language models and personal AI assistants integrated into web browsers, harnessing client-side computing power.

Web LLM is a groundbreaking innovation in AI and web development, enabling fine-tuned models to run natively within browser tabs without server support. This local processing addresses privacy and security concerns, giving users control over personal information and reducing the risk of data leaks, especially concerning Chrome extensions or web applications.

The creator has opened a demo site for the Web LLM, but at the time of writing the demo is down.

Overcoming challenges and optimizations

Significant challenges include the need for GPU-accelerated Python frameworks in client-side environments and optimizing memory usage and weight compression for large language models. The project aims to develop a workflow for efficient language model development and optimization using a Python-first approach and universal deployment.

The project employs machine learning compilation (MLC) with Apache TVM Unity, utilizing native dynamic shape support to optimize the language model's IRModule without padding. TensorIR programs are transformed and optimized for deployment across various environments, including JavaScript for web deployment.

Utilizing int4 quantization techniques, the project compresses model weights and employs static memory planning optimizations to reuse memory across multiple layers. A wasm port of the SentencePiece tokenizer is used, with all optimizations executed in Python, except for the JavaScript application connecting components.

Web LLM aims to utilize Phyton's power to natively integrate language models into browsers - Image courtesy of Web LLM

The role of open-source ecosystems

The open-source ecosystem, specifically TVM Unity, facilitates a Python-centric development experience for optimizing and deploying language models on the web. TVM Unity's dynamic shape support addresses the dynamic nature of language models without padding, and tensor expressions enable partial-tensor computations.

You may check out the project's GitHub page via the link here.

Comparing WebGPU and native GPU runtimes reveals performance limitations due to Chrome's WebGPU implementation. Workarounds like special flags can enhance execution speed, and forthcoming features such as fp16 extensions exhibit potential for substantial improvements.

How to install Web LLM to Chrome?

WebGPU recently debuted in Chrome and is now in beta. They currently conduct their experiments in Chrome Canary, and you can also try the latest Chrome 113. Chrome version ? 112 isn't supported and using it will result in an error related to WebGPU device initialization and limits. The tests were performed on Windows and Mac, requiring a GPU with approximately 6.4GB of memory, say the creators.

For Mac users with Apple devices, follow these instructions to run the chatbot demo locally in your browser:

Install Chrome Canary, a developer version of Chrome that enables WebGPU usage
Launch Chrome Canary. We recommend launching it from the terminal with this command (or replace Chrome Canary with Chrome):
/Applications/Google\ Chrome\ Canary.app/Contents/MacOS/Google\ Chrome\ Canary --enable-dawn-features=disable_robustness
This command disables the robustness check in Chrome Canary that slows down image generation. While not mandatory, we strongly advise using this command to start Chrome

Enter your inputs and click "Send" to begin. The chatbot will first fetch model parameters into the local cache. The initial download may take a few minutes, but subsequent refreshes and runs will be faster.

Comments

Ty said on April 20, 2023 at 11:16 pm


We should also integrate an option for distributed computing, like folding@home, dedicating a certain percentage of cpu/gpu power as a decentralized service. Decentralized supercomputing would significantly aid with computationally intensive AI tasks and or building language models.
Ty said on April 20, 2023 at 11:10 pm


They should also integrate an option for distributed computing, like folding@home, dedicating a certain percentage of cpu/gpu power for a decentralized service, as decentralized supercomputing would significantly aid with CPU intensive tasks.

Web LLM brings browser-based revolution to AI

Overcoming challenges and optimizations

The role of open-source ecosystems

How to install Web LLM to Chrome?

Related content

Tutorials & Tips

How to delete all Google history from every device

The only Starfield performance optimization guide you need

How to fix Disney+ Hotstar Error code: PB_WEB_DR-6007-001_X

How to fix Roblox error code 277: Explained

Comments

Leave a Reply Cancel reply

Advertisement

Spread the Word

Advertisement

Hot Discussions

Advertisement

Recently Updated

Latest from Softonic

Advertisement

About gHacks

Web LLM brings browser-based revolution to AI

Overcoming challenges and optimizations

The role of open-source ecosystems

How to install Web LLM to Chrome?

Related content

Valve introduces Steam Families to allow members to share their libraries simultaneously

HP's All-In-Plan will let you rent printers, but it monitors them

Microsoft adds games that you can play without download to the Microsoft Store

The 10 best hidden Google Games that you can play in your browser

Xbox's Auto-Upload feature may get your account banned

MSI unveils Intel-powered Windows gaming handheld Claw

Tutorials & Tips

How to delete all Google history from every device

The only Starfield performance optimization guide you need

How to fix Disney+ Hotstar Error code: PB_WEB_DR-6007-001_X

How to fix Roblox error code 277: Explained

Comments

Leave a Reply Cancel reply

Advertisement

Spread the Word

Advertisement

Hot Discussions

Advertisement

Recently Updated

Latest from Softonic

Advertisement

About gHacks