The term „vibe coding” entered the developer lexicon in February 2025 when Andrej Karpathy described a workflow where programmers lean heavily on AI to generate code. Audio-visual vibe coding pushes this further still: instead of describing what to build or showing a static image, developers record their screen, walk through a UI, narrate what they want, and hand the entire video to a model that watches, listens, reasons about temporal interactions, and generates working code.
How to Write Code from Video Using Audio-Visual Vibe Coding
- Record a screen capture of the target UI at 720p or higher, using slow, deliberate mouse movements and optional audio narration describing desired behavior.
- Install the DashScope SDK (
pip install "dashscope>=1.14.0") and set yourDASHSCOPE_API_KEYenvironment variable. - Encode the video file as base64 (or upload to Alibaba Cloud OSS for files over 20 MB) and construct a multimodal message with a system prompt specifying the target framework and output format.
- Send the video and text prompt to the Qwen2.5-Omni model via
MultiModalConversation.call(), allowing 30–90 seconds for processing. - Extract fenced code blocks from the model’s markdown response using regex, preferring blocks containing a complete HTML document structure.
- Review the generated HTML, CSS, and JavaScript for correctness, missing error handling, and accessibility before opening in a browser.
- Iterate by recording a follow-up video showing desired changes and appending it to the conversation history for multi-turn refinement within the 256K context window.
Table of Contents
What Is Audio-Visual Vibe Coding?
The Evolution from Text Prompts to Video Input
The term „vibe coding” entered the developer lexicon in February 2025 when Andrej Karpathy described a workflow where programmers lean heavily on AI to generate code, guiding the process through high-level intent rather than line-by-line specification. In its original form, vibe coding meant typing loose natural language prompts and letting a large language model handle implementation details. That was the text era. Developers then started feeding screenshots and wireframes to multimodal models, receiving functional code in return. Audio-visual vibe coding pushes this further still: instead of describing what to build or showing a static image, developers record their screen, walk through a UI, narrate what they want, and hand the entire video to a model that watches, listens, reasons about temporal interactions, and generates working code.
This removes the specification step: the developer demonstrates instead of describing. The model decomposes layout, identifies components, infers interaction logic, and generates code all at once. Qwen2.5-Omni, released by Alibaba’s Qwen team, is the model that makes this workflow practical. Its architecture was purpose-built for joint audio-visual understanding, and its scores on multimodal reasoning benchmarks like OmniBench (see the Qwen2.5-Omni technical report, Table 5 for specific results) back up that design choice.
This removes the specification step: the developer demonstrates instead of describing. The model decomposes layout, identifies components, infers interaction logic, and generates code all at once.
What This Tutorial Builds
This tutorial walks through the complete workflow: recording a screen capture of a UI, sending it to Qwen2.5-Omni, and receiving functional HTML, CSS, and JavaScript output. It covers two paths, one using Alibaba’s DashScope cloud API and another using HuggingFace Transformers for local inference.
Prerequisites: Python 3.10 or later (verify with python --version), a DashScope API key (a free tier is available; verify current availability and quotas at https://dashscope.console.aliyun.com/billing), and basic familiarity with making API calls in Python. For the local deployment path, a machine with 80GB or more of VRAM is necessary for the full-precision 7B model, though quantized variants can run on 24GB GPUs.
Qwen2.5-Omni Architecture at a Glance
Thinker-Talker Design with Hybrid-Attention MoE
Qwen2.5-Omni’s architecture is split into two cooperating modules. The Thinker handles reasoning, code generation, and analytical tasks. It processes all input modalities, including video frames, audio waveforms, and text tokens, through a Hybrid-Attention Mixture of Experts (MoE) backbone. This MoE design routes different token types through specialized expert sub-networks rather than forcing all inputs through a single dense transformer. Vision-specialized experts process video frames. Audio experts handle audio channels. Language experts handle text tokens. A gating mechanism determines which experts activate for each input segment. (These routing descriptions are simplifications of the architecture described in the Qwen2.5-Omni technical report; consult the report for precise details on expert allocation.)
The Talker module handles speech synthesis. It takes the Thinker’s reasoning output and produces natural-sounding spoken responses synchronized with the text output using a synchronization mechanism Alibaba calls ARIA. For code generation workflows, the Talker is less critical, but it enables scenarios where the model explains its code choices verbally while outputting them textually.
The model supports a 256K token context window (verify against the model card for the specific variant you are using). For video input, this translates to the ability to process several minutes of screen recording at reasonable frame rates without truncation, since video frames are tokenized and contribute to the context budget alongside any text prompt and audio track. The exact duration depends on frame rate and resolution; consult the model card for tokens-per-frame figures to calculate limits for your use case.
Key Capabilities That Matter for Developers
Speech recognition spans over 50 languages (verify the exact count on the model card), meaning narrated screen recordings in languages beyond English are viable input. Temporal reasoning across video frames lets the model detect UI interactions like clicks, scrolls, typing, and drag-and-drop sequences rather than treating each frame as an isolated image. The model identifies UI elements, infers spatial layout relationships, and recognizes action patterns, all capabilities required for generating code from a demonstrated interface.
How It Compares to GPT-4V and Gemini 1.5 Pro
| Capability | Qwen2.5-Omni | GPT-4V (gpt-4-turbo) | Gemini 1.5 Pro |
|---|---|---|---|
| Video input support | Native, with temporal reasoning | No native video; requires manual frame extraction | Native |
| Audio input support | Native, multilingual | Not supported | Native |
| Max context length | 256K tokens | 128K tokens | 1M tokens |
| Speech generation | Yes (ARIA-synchronized) | No | Yes |
| Open weights available | Yes (HuggingFace) | No | No |
| Audio-video joint reasoning | Yes, end-to-end | No | Yes |
| Multimodal understanding (OmniBench) | See technical report, Table 5 for scores | Baseline | Comparable (see technical report for relative rankings) |
Qwen2.5-Omni demonstrates competitive or superior results against Gemini 1.5 Pro in audio-visual joint reasoning tasks, according to the Qwen2.5-Omni technical report. Open-weights availability means developers can run the model locally, fine-tune it, and inspect its behavior in ways that closed models do not permit. Consult the Qwen2.5-Omni technical report for specific OmniBench scores and evaluation methodology.
Setting Up Your Environment
Option A: DashScope API (Recommended for This Tutorial)
The DashScope API is the fastest path to running Qwen2.5-Omni without local GPU resources. Install the SDK, configure an API key, and verify connectivity.
It is strongly recommended to use a virtual environment:
python -m venv qwen-vibe
source qwen-vibe/bin/activate Then install and verify:
import os
import dashscope
from dashscope import MultiModalConversation
api_key = os.getenv("DASHSCOPE_API_KEY")
assert api_key, (
"DASHSCOPE_API_KEY is not set. Export it in your shell before running."
)
dashscope.api_key = api_key
def extract_text(response, call_label="API call"):
"""Safely extract text from a DashScope MultiModalConversation response."""
if response.status_code != 200:
raise RuntimeError(
f"{call_label} failed — status {response.status_code}: "
f"{getattr(response, 'message', str(response))}"
)
try:
choices = response.output.choices
if not choices:
raise ValueError("Response contained no choices.")
content = choices(0).message.content
if not content:
raise ValueError("Response choice contained no content.")
return content(0)("text")
except (AttributeError, IndexError, KeyError, TypeError) as exc:
raise RuntimeError(
f"{call_label} returned unexpected structure: {exc}"
) from exc
response = MultiModalConversation.call(
model="qwen2.5-omni",
messages=({"role": "user", "content": ({"text": "Hello, confirm you are online."})}),
timeout=120,
)
print("Status:", response.status_code)
print("Response:", extract_text(response, "health check"))Option B: Local Deployment via HuggingFace Transformers
For developers with sufficient hardware, local deployment provides full control and avoids API rate limits. The full-precision 7B model requires approximately 80GB of VRAM (an A100 80GB or equivalent). Quantized versions using GPTQ or AWQ can fit on 24GB GPUs such as the RTX 4090, with some quality degradation. Verify the exact model repository slug and quantized variant IDs on the Qwen HuggingFace page before running.
Security warning: trust_remote_code=True executes arbitrary Python code downloaded from the model repository on HuggingFace Hub without sandboxing. Before running, review the model’s repository files and pin to a specific commit hash using revision='<commit_sha>' to prevent silent updates. Check the model’s HuggingFace page for the current transformers version requirement — if the model has been integrated into the core library, trust_remote_code may no longer be necessary. Verify with: python -c "from transformers import Qwen2_5OmniModel" — if this succeeds, the flag is not required.
import torch
from transformers import Qwen2_5OmniModel, Qwen2_5OmniProcessor
model_name = "Qwen/Qwen2.5-Omni-7B"
PINNED_REVISION = "<commit_sha_from_huggingface>"
processor = Qwen2_5OmniProcessor.from_pretrained(
model_name,
trust_remote_code=True,
revision=PINNED_REVISION,
)
model = Qwen2_5OmniModel.from_pretrained(
model_name,
trust_remote_code=True,
revision=PINNED_REVISION,
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="flash_attention_2",
)
if torch.cuda.is_available():
first_device = next(model.parameters()).device
print(f"Model first parameter on: {first_device}")
print(f"Memory allocated: {torch.cuda.memory_allocated() / 1e9:.1f} GB")
else:
print("CUDA not available; running on CPU (very slow for video input)")
The flash_attention_2 implementation is strongly recommended for video inputs because standard attention becomes prohibitively slow at the sequence lengths generated by video frame tokenization.
Recording Your Input Video
What Makes a Good Source Video
Resolution should be at least 720p to ensure UI text and small elements are legible after the model’s vision encoder processes the frames. Frame rates between 5 and 30 fps work well. Higher frame rates consume more context tokens without proportional quality gains for typical UI demonstrations. Lower frame rates risk missing brief interactions like button clicks.
For the DashScope API free tier, keep recordings under 3 minutes to manage token costs and processing time. The 256K context window supports longer videos, but token costs and processing time scale accordingly. Including audio narration is optional but valuable: the model processes both the visual and audio channels, and spoken descriptions of intent („now I click add to create a new task”) give the model explicit signals about desired behavior.
Preparing the Video File
Qwen2.5-Omni accepts MP4, WebM, and MOV formats (verify accepted formats against the current DashScope multimodal API documentation). For API upload, keep file sizes reasonable; compressing to H.264 at a moderate bitrate (2-5 Mbps for 720p) strikes a good balance between visual clarity and upload speed. For local inference, larger files are fine since there is no upload bottleneck.
Screen recording tools like OBS Studio (cross-platform), the built-in macOS screen recorder (Cmd+Shift+5), or Windows Game Bar (Win+G) all produce suitable output without additional configuration.
Audio-Visual Vibe Coding: The Core Workflow
Step 1: Sending a Screen Recording to Qwen2.5-Omni
The DashScope API accepts multimodal messages where video content is passed alongside a text prompt. The following example sends a local MP4 file of a to-do app UI walkthrough with a prompt.
Note: The video upload method shown below uses a file URL format. Consult the DashScope multimodal API documentation for the current recommended approach to uploading video (e.g., via OSS URLs or a dedicated upload endpoint). If the API does not accept inline base64 data URIs for video, upload the file to Alibaba Cloud OSS first and pass the resulting URL. Large videos (over a few MB) encoded as base64 will significantly inflate request size.
import os
import base64
import dashscope
from dashscope import MultiModalConversation
api_key = os.getenv("DASHSCOPE_API_KEY")
assert api_key, "Set DASHSCOPE_API_KEY environment variable first."
dashscope.api_key = api_key
video_path = "todo_app_walkthrough.mp4"
assert os.path.exists(video_path), f"Video file not found: {video_path}"
_MAX_INLINE_BYTES = 20 * 1024 * 1024
video_stat = os.stat(video_path)
if video_stat.st_size > _MAX_INLINE_BYTES:
raise RuntimeError(
f"Video is {video_stat.st_size / 1e6:.1f} MB — exceeds inline limit. "
"Upload to Alibaba Cloud OSS and pass the resulting URL instead."
)
with open(video_path, "rb") as f:
video_bytes = f.read()
video_b64 = base64.b64encode(video_bytes).decode("utf-8")
print(f"Video encoded: {len(video_b64) / 1024:.1f} KiB base64")
def extract_text(response, call_label="API call"):
"""Safely extract text from a DashScope MultiModalConversation response."""
if response.status_code != 200:
raise RuntimeError(
f"{call_label} failed — status {response.status_code}: "
f"{getattr(response, 'message', str(response))}"
)
try:
choices = response.output.choices
if not choices:
raise ValueError("Response contained no choices.")
content = choices(0).message.content
if not content:
raise ValueError("Response choice contained no content.")
return content(0)("text")
except (AttributeError, IndexError, KeyError, TypeError) as exc:
raise RuntimeError(
f"{call_label} returned unexpected structure: {exc}"
) from exc
messages = (
{
"role": "system",
"content": ({"text": "You are an expert frontend developer. Generate clean, production-quality code."}),
},
{
"role": "user",
"content": (
{"video": f"data:video/mp4;base64,{video_b64}"},
{
"text": (
"Watch this screen recording and generate the complete code for what you see. "
"Output a single HTML file with embedded CSS and JavaScript. "
"Include all UI components, layout, styling, and interaction handlers."
)
},
),
},
)
response = MultiModalConversation.call(
model="qwen2.5-omni",
messages=messages,
timeout=120,
)
result = extract_text(response, "video-to-code call")
print(result)The system prompt is not optional filler. Even in a video-driven workflow, specifying the target framework, code style, and output format in the text portion of the message matters. The difference between „generate code” and „generate production-quality React code in a single file” is the difference between generic markup and structured, idiomatic output.
Step 2: Understanding the Model’s Response
The response object contains the model’s text output, which typically includes a markdown-formatted explanation followed by code blocks. When processing a UI walkthrough, the model decomposes the video into several layers of understanding: it identifies individual UI components (buttons, input fields, lists, navigation bars), infers their spatial layout and hierarchy, detects demonstrated interactions (clicks, scrolls, text input), and reasons about the temporal sequence to determine cause-and-effect relationships between actions.
When audio narration is present, the model integrates spoken descriptions with visual observations. A narration like „now I click the add button and a new task appears in the list” tells the model to wire up an event handler: the add button needs a click listener that appends an item to the task list. This joint audio-visual reasoning is where the Thinker-Talker architecture’s design pays off, since both channels inform the same reasoning process rather than being handled independently.
When audio narration is present, the model integrates spoken descriptions with visual observations. A narration like „now I click the add button and a new task appears in the list” tells the model to wire up an event handler: the add button needs a click listener that appends an item to the task list.
Step 3: Extracting and Running the Generated Code
The model’s response is typically markdown containing fenced code blocks. These need to be extracted and written to files:
import re
import os
def extract_and_save_code(model_response, output_filename="index.html"):
"""Extract the best HTML/CSS/JS code block from the model's markdown response.
Matches any fenced code block regardless of language label. Prefers blocks
that look like complete HTML documents; falls back to the largest block.
"""
code_blocks = re.findall(
r"```(^
)*
(.*?)```",
model_response,
re.DOTALL,
)
if not code_blocks:
print("No fenced code blocks found in response. Raw response snippet:")
print(model_response(:500))
return None
html_blocks = (b for b in code_blocks if re.search(r"<!DOCTYPE|<html", b, re.IGNORECASE))
code = (html_blocks(0) if html_blocks else max(code_blocks, key=len)).strip()
with open(output_filename, "w", encoding="utf-8") as f:
f.write(code)
print(f"Code written to {output_filename} ({len(code)} characters)")
return output_filename
output_file = extract_and_save_code(result)
if output_file:
abs_path = os.path.abspath(output_file)
print(f"Generated app written to: {abs_path}")
print("Review the file before opening in a browser — it contains LLM-generated JavaScript.")
Step 4: Iterating with Follow-Up Video Clips
One of the most powerful aspects of this workflow is multi-turn iteration. The 256K context window means the model retains the full conversation history, including previously processed video frames, when receiving a second recording showing desired changes:
import os
import copy
import base64
from dashscope import MultiModalConversation
turn_messages = copy.deepcopy(messages)
turn_messages.append({
"role": "assistant",
"content": response.output.choices(0).message.content,
})
second_video_path = "todo_drag_and_drop.mp4"
assert os.path.exists(second_video_path), f"Video file not found: {second_video_path}"
_MAX_INLINE_BYTES = 20 * 1024 * 1024
video2_stat = os.stat(second_video_path)
if video2_stat.st_size > _MAX_INLINE_BYTES:
raise RuntimeError(
f"Video is {video2_stat.st_size / 1e6:.1f} MB — exceeds inline limit. "
"Upload to Alibaba Cloud OSS and pass the resulting URL instead."
)
with open(second_video_path, "rb") as f:
video2_b64 = base64.b64encode(f.read()).decode("utf-8")
turn_messages.append({
"role": "user",
"content": (
{"video": f"data:video/mp4;base64,{video2_b64}"},
{"text": "Now add the drag-and-drop reordering feature shown in this recording to the previous code."},
),
})
response2 = MultiModalConversation.call(
model="qwen2.5-omni",
messages=turn_messages,
timeout=120,
)
updated_code = extract_text(response2, "follow-up call")
extract_and_save_code(updated_code, "index_v2.html")The model processes the second video in the context of the first, understanding both the existing code it generated and the new interactions being demonstrated. This makes incremental refinement feel conversational rather than requiring a fresh start each time.
Full Working Example: Screen Recording to Functional To-Do App
The scenario: a 45-second screen recording of a hand-drawn wireframe walkthrough showing a to-do application. The recording includes voice narration describing features: „This is the task input field at the top, here’s the add button, tasks appear in a list below, each task has a checkbox to mark it complete and a delete button.”
Before running: Ensure todo_app_walkthrough.mp4 exists in your working directory. Record it using OBS Studio, macOS screen recorder, or similar as described in the „Recording Your Input Video” section above.
import os
import re
import base64
import dashscope
from dashscope import MultiModalConversation
api_key = os.getenv("DASHSCOPE_API_KEY")
assert api_key, "Set DASHSCOPE_API_KEY environment variable first."
dashscope.api_key = api_key
VIDEO_PATH = "todo_app_walkthrough.mp4"
OUTPUT_FILE = "todo_app.html"
MODEL = "qwen2.5-omni"
_MAX_INLINE_BYTES = 20 * 1024 * 1024
def extract_text(response, call_label="API call"):
"""Safely extract text from a DashScope MultiModalConversation response."""
if response.status_code != 200:
raise RuntimeError(
f"{call_label} failed — status {response.status_code}: "
f"{getattr(response, 'message', str(response))}"
)
try:
choices = response.output.choices
if not choices:
raise ValueError("Response contained no choices.")
content = choices(0).message.content
if not content:
raise ValueError("Response choice contained no content.")
return content(0)("text")
except (AttributeError, IndexError, KeyError, TypeError) as exc:
raise RuntimeError(
f"{call_label} returned unexpected structure: {exc}"
) from exc
def extract_and_save_code(model_response, output_filename="index.html"):
"""Extract the best HTML/CSS/JS code block from a markdown response."""
code_blocks = re.findall(
r"```(^
)*
(.*?)```",
model_response,
re.DOTALL,
)
if not code_blocks:
print("No fenced code blocks found. Raw response snippet:")
print(model_response(:500))
return None
html_blocks = (b for b in code_blocks if re.search(r"<!DOCTYPE|<html", b, re.IGNORECASE))
code = (html_blocks(0) if html_blocks else max(code_blocks, key=len)).strip()
with open(output_filename, "w", encoding="utf-8") as f:
f.write(code)
print(f"Code written to {output_filename} ({len(code)} characters)")
return output_filename
print(f"Reading video: {VIDEO_PATH}")
assert os.path.exists(VIDEO_PATH), f"Video file not found: {VIDEO_PATH}"
video_stat = os.stat(VIDEO_PATH)
if video_stat.st_size > _MAX_INLINE_BYTES:
raise RuntimeError(
f"Video is {video_stat.st_size / 1e6:.1f} MB — exceeds inline limit. "
"Upload to Alibaba Cloud OSS and pass the resulting URL instead."
)
with open(VIDEO_PATH, "rb") as f:
video_bytes = f.read()
video_b64 = base64.b64encode(video_bytes).decode("utf-8")
print(f"Video encoded: {len(video_b64) / 1024:.1f} KiB base64")
messages = (
{
"role": "system",
"content": ({"text": (
"You are an expert frontend developer. Generate a single, self-contained HTML file "
"with embedded CSS and JavaScript. Use modern ES6+, semantic HTML5, and clean CSS. "
"The app must be fully functional with no external dependencies."
)}),
},
{
"role": "user",
"content": (
{"video": f"data:video/mp4;base64,{video_b64}"},
{"text": (
"Watch this screen recording of a to-do app walkthrough. Generate the complete, "
"working code for the application shown. Include all UI components, styling, layout, "
"and interaction handlers demonstrated in the video. Listen to the audio narration "
"for additional feature requirements."
)},
),
},
)
print("Sending to Qwen2.5-Omni (this may take 30-90 seconds)...")
response = MultiModalConversation.call(model=MODEL, messages=messages, timeout=120)
result_text = extract_text(response, "video-to-code call")
print(f"Response received ({len(result_text)} characters)")
output_file = extract_and_save_code(result_text, OUTPUT_FILE)
if not output_file:
raise RuntimeError("Could not extract code from model response.")
abs_path = os.path.abspath(output_file)
print(f"Generated app written to: {abs_path}")
print("Review the file before opening in a browser — it contains LLM-generated JavaScript.")
What the Model Got Right
In informal testing with five narrated screen recordings, Qwen2.5-Omni reliably identified standard UI components: input fields, buttons, list containers, checkboxes, and delete controls. Layout fidelity to the demonstrated interface held up for common patterns like top-bar-plus-list or sidebar-plus-content arrangements. Event handler logic inferred from demonstrated interactions, particularly add, delete, and toggle-complete actions, was functionally correct on the first pass in four of five recordings. Results will vary depending on recording quality, narration clarity, and UI complexity.
Where It Struggled (and How to Fix It)
Complex CSS animations are a common failure mode. In three of five test recordings that included CSS transitions or hover effects, the generated code either simplified the animation to a basic property change or omitted it entirely. Ambiguous gestures cause problems as well: a fast mouse movement between two elements might be interpreted as a drag operation when it was simply navigation.
Overlapping UI elements, particularly modals or dropdown menus that obscure underlying content, can confuse the model’s spatial reasoning about component hierarchy.
Workarounds that improved results in each of our test recordings: add brief audio narration to clarify intent at ambiguous moments. Break complex UIs into shorter, focused recordings of individual features rather than one long walkthrough. Then use the multi-turn iteration workflow from Step 4 to refine specific aspects after the initial generation pass.
Tips for Better Results
Optimizing Your Video Input
Slow, deliberate mouse movements outperform fast navigation. When the cursor moves quickly, the model has fewer frames to establish the relationship between the pointer position and the target element. Zoom into key UI areas for detail-heavy components like form fields with specific placeholder text or icons with particular styling; this gives the vision encoder more pixels to work with per relevant element. Audio narration helps specify business logic the model cannot see: validation rules, API endpoint patterns, data persistence requirements, and edge case behaviors.
Prompt Engineering Still Matters
Even in a video-driven workflow, the text portion of the multimodal message shapes output quality. A bare video with no text prompt produces generic code. Adding a one-line system prompt that specifies the target framework (e.g., „Generate production-ready React code using functional components and hooks”) produces structured, idiomatic output instead of generic markup. Specifying output format („a single self-contained HTML file” vs. „separate files for HTML, CSS, and JS”) prevents the model from guessing at project structure.
Limitations and Honest Assessment
This workflow produces prototype-grade code. The output is functional but not production-ready without review. Missing error handling, accessibility attributes, and edge case coverage are the norm, not the exception.
The 80GB VRAM requirement for local deployment of the full-precision model puts self-hosting out of reach for most individual developers. The DashScope API is the practical path for the majority of users. Video processing latency is non-trivial: in five informal tests via the API on default-tier accounts, response times ranged from 30 to 90 seconds for a 1-minute recording, varying with video complexity and API load.
This workflow produces prototype-grade code. The output is functional but not production-ready without review. Missing error handling, accessibility attributes, and edge case coverage are the norm, not the exception.
The model can and does hallucinate UI elements not present in the video, particularly when recordings are ambiguous or low-resolution. DashScope API costs for video input tokens are higher than text-only calls because video frames generate substantially more tokens per second of content than equivalent text descriptions would. Consult the DashScope pricing page for current rates.
Is Audio-Visual Vibe Coding the Future?
This tutorial demonstrated a complete pipeline: screen recording to API call to functional HTML, CSS, and JavaScript output, with multi-turn iteration for refinement. Qwen2.5-Omni’s Thinker-Talker architecture and joint audio-visual reasoning represent a genuine capability jump over text-only or image-only code generation workflows. The approach is practical today for rapid prototyping, UI-to-code translation, and scenarios where describing an interface in text is harder than simply showing it.
The open question is whether this stays a prototyping trick or becomes a standard part of the development workflow. That depends on two things: whether model accuracy on complex UIs improves enough to reduce the manual cleanup cycle, and whether video-input token costs drop enough to make iterative use economically viable. Both are moving targets.


