Visual ChatGPT: Temporary solution until GPT-4's launch

Onur Demirkol
Mar 13, 2023
Updated • Mar 13, 2023

Microsoft researchers have launched Visual ChatGPT, which aims to gather ChatGPT's and visual foundation models' abilities together to offer a better service before GPT-4.

ChatGPT launched a while ago and started a new era in the generative AI industry. More AI tools were produced following the fame and success of the chatbot. Microsoft has taken important steps to improve generative AI tools, especially in the past few years. Unfortunately, ChatGPT is a text-based language model, and it doesn't have the same abilities as DALL-E 2 or Wombo Dream. However, it has changed with the launch of Visual ChatGPT.

What is Visual ChatGPT?

ChatGPT is a text-only chatbot that doesn't have the ability to generate images or videos, which is expected to change with GPT-4. However, Visual ChatGPT helps you generate, modify or crop an image. It mixes the powers of ChatGPT with other VFMs, such as Stable Diffusion, connecting ChatGPT and a series of Visual Foundation Models to enable sending and receiving images during chatting.

In other words, Visual ChatGPT helps users generate images out of text prompts. It lacked what other AI tools like Stable Diffusion had, and now, in a way, it is complete.

"Instead of training a new multi-modal ChatGPT from scratch, we build Visual ChatGPT directly based on ChatGPT and incorporate a variety of VFMs," says Microsoft.

Microsoft researchers have solved ChatGPT's text-only issue and added a new capability to it with Visual ChatGPT.
Credit: Microsoft

GPU memory usage?

The researchers have also given the GPU memory usage stats on the official GitHub page. It requires high GPU and computation power. Below you will find the GPU memory usage of each visual foundation model:

Foundation Model Memory Usage (MB)
ImageEditing 6667
ImageCaption 1755
T2I 6677
canny2image 5540
line2image 6679
hed2image 6679
scribble2image 6679
pose2image 6681
seg2image 5540
depth2image 6677
normal2image 3974
InstructPix2Pix 2795


As mentioned, ChatGPT was trained to give users text-based answers but lacked image or video creation. Visual ChatGPT's capabilities are as follows:

  • Send and Receive not only languages but also images.
  • Provide complex visual questions or visual editing instructions that require the collaboration of multiple AI models with multi-steps.
  • Provide feedback and ask for corrected results.

GPT-4 release date

Last week, the CTO of Microsoft Germany announced that GPT-4 would be released "next week." He gave the statement on March 9, meaning the new model could launch in the upcoming days. OpenAI will at least introduce it to the community if it doesn't launch.

GPT-4 will be a multi-modal LLM that has the ability to create images and videos from text prompt on top of GPT-3.5's text prompt abilities. For more information about Visual ChatGPT, you can check the official Github page


Tutorials & Tips

Previous Post: «
Next Post: «


There are no comments on this post yet, be the first one to share your thoughts!

Leave a Reply

Check the box to consent to your data being stored in line with the guidelines set out in our privacy policy

We love comments and welcome thoughtful and civilized discussion. Rudeness and personal attacks will not be tolerated. Please stay on-topic.
Please note that your comment may not appear immediately after you post it.