CoDA: Agentic Systems for Collaborative Data Visualization

Zichen Chen; Jiefeng Chen; Sercan O. Arik; Misha Sra; Tomas Pfister; Jinsung Yoon

ICLR 2026

CoDA
Agentic Systems for
Collaborative Data Visualization

Specialized LLM agents collaborate through metadata analysis, task planning, code generation, and self-reflection to automate data visualization.

Zichen Chen^1*· Jiefeng Chen²· Sercan O. Arik³· Misha Sra¹· Tomas Pfister²· Jinsung Yoon²

¹ UC Santa Barbara ² Google Cloud AI Research ³ Google

* Work done during a research internship at Google Cloud AI Research.

Paper arXiv Code Poster

0%

Improvement

0

Agents

0x

vs SOTA

OVERVIEW

Abstract

Deep research has revolutionized data analysis, yet data scientists still devote substantial time to manually crafting visualizations, highlighting the need for robust automation from natural language queries. However, current systems struggle with complex datasets containing multiple files and iterative refinement. Existing approaches, including simple single- or multi-agent systems, often oversimplify the task, focusing on initial query parsing while failing to robustly manage data complexity, code errors, or final visualization quality. In this paper, we reframe this challenge as a collaborative multi-agent problem. We introduce CoDA, a multi-agent system that employs specialized LLM agents for metadata analysis, task planning, code generation, and self-reflection. We formalize this pipeline, demonstrating how metadata-focused analysis bypasses token limits and quality-driven refinement ensures robustness. Extensive evaluations show CoDA achieves substantial gains in the overall score, outperforming competitive baselines by up to 41.5%. This work demonstrates that the future of visualization automation lies not in isolated code generation but in integrated, collaborative agentic workflows.

ARCHITECTURE

How CoDA Works

Four collaborative phases powered by specialized agents. Click each stage to explore.

01

Understanding

Query intent parsing and data metadata extraction without raw data upload.

Query Analyzer Data Processor

02

Planning

Example code search, visual mappings, and design optimization.

VizMapping Design Explorer Search

03

Generation

Code generation with best practices and automated debugging.

Code Generator Debug Agent

04

Self-Reflection

Quality evaluation with iterative feedback loops for refinement.

Visual Evaluator Feedback Loop

CoDA framework pipeline diagram showing 4 collaborative phases: Understanding (metadata analysis), Planning (task decomposition), Generation (code writing with search agent), and Self-Reflection (visual evaluation with feedback loops), orchestrated by 8 specialized LLM agents — **Figure 1.** Overview of the CoDA framework. Natural language queries are decomposed into **Understanding**, **Planning**, **Generation**, and **Self-Reflection** phases with quality-driven feedback loops.

TRACE

Agent Collaboration Trace

A browser-usage sunburst query walks through the full pipeline, including diagnostic feedback loops. Click each step to expand.

Iteration 1 Full-pipeline pass

Understanding

Query Agent + Data Agent

Decompose the user query into tasks and extract data schema & hierarchy without loading raw data.

Query Agent — Input

"Create a sunburst chart showing browser market share by version from the provided dataset."

Query Agent — Output

Task list: load data, parse hierarchy (browser → version), compute share %. Viz type: sunburst. Key columns: browser, version, share.

Data Agent — Output

Schema: 5 browsers × 22 versions, 2-level hierarchy. Total share sums to 100%. No missing values.

Planning

Design Agent + Search Agent

Design the sunburst layout and color scheme; retrieve relevant matplotlib code examples.

Design Agent — Output

Chart: nested sunburst. Palette: distinct hue per browser (Chrome=blue, Firefox=orange, Safari=grey, Edge=green, Opera=red). Labels: radial text on outer ring, percentage on inner ring.

Search Agent — Output

Retrieved 3 examples: nested pie with ax.pie(), stacked-donut sunburst, label rotation patterns. Similarity: 0.89, 0.85, 0.81.

Generation

Code Agent + Debug Agent

Generate a 142-line Python script and execute it. Output image rendered successfully.

Code Agent — Output

142-line script: data loading, hierarchy parsing, nested ax.pie() rings with per-browser color maps, radial labels, percentage annotations.

Debug Agent — Output

Execution successful. No runtime errors. Image rendered.

Browser sunburst chart — iteration 1 output with label overlap issues

Self-Reflection

Eval Agent

Score below threshold — detected layout and labeling issues. Routing feedback to upstream agents.

Diagnosis

Issues: (1) outer-ring labels overlap at small slices, (2) inner-ring text collides with wedge borders, (3) color contrast too low for Safari segments.

Routing

Design Agent ← low aesthetics & layout | Code Agent ← label collision fix

Score: 45 / 100 — below θ_q

Feedback routed to Design Agent & Code Agent

Iteration 2 Targeted refinement — only re-triggered agents run

Planning

Design Agent (re-triggered)

Revises label placement and color contrast based on Eval feedback.

Revised output

Hide text for slices < 3%, leader lines for 3–5%. Safari: light-grey → steel-blue. Inner ring: percentages repositioned outside wedge borders.

Generation

Code Agent + Debug Agent (re-triggered)

Regenerate script with revised label logic (+18 lines). Execution successful, labels no longer overlap.

Code Agent — Changes

+18 lines: conditional label hiding, annotate() with leader lines, Safari color map updated, inner-ring text offset.

Debug Agent — Output

Execution successful. No runtime errors.

Browser sunburst chart — iteration 2, labels fixed after feedback loop

Self-Reflection

Eval Agent (converged)

All quality criteria met. Decision: HALT — no further refinement needed.

Evaluation

Correct chart type, accurate hierarchy, clean labels with leader lines, proper color contrast. All dimensions above θ_q = 0.85.

Score: 92 / 100 — HALT

EVALUATION

Experimental Results

Comprehensive evaluation across multiple benchmarks and human expert studies.

MatplotBench & Qwen Code Interpreter

gemini-2.5-pro

Method	MatplotBench			Qwen Code Interpreter
Method	EPR	VSR	OS	EPR	VSR	OS
MatplotAgent	97.0	56.7	55.0	81.6	79.7	65.0
VisPath	75.0	37.3	38.0	86.5	94.3	81.6
CoML4VIS	76.0	69.7	53.0	87.1	90.9	79.1
CoDA (Ours)	99.0	79.8	79.5	93.3	95.4	89.0

DA-Code Benchmark (Overall Score %)

SWE-level

CoDA (Ours) Gemini-2.5-Pro

39.0

DS-STAR Gemini-2.5-Pro

20.5

DA-Agent Gemini-2.5-Pro

19.2

DA-Agent GPT-4o

17.0

DA-Agent GPT-4

16.0

DA-Agent Deepseek-Coder

11.0

Human Expert Evaluation on MatplotBench

3 experts · 200 charts

Method	Elo	Harmony	Balance	Color	Simplicity	Query Al.
MatplotAgent	1506	3.65	3.65	3.53	4.31	3.63
VisPath	1484	2.71	2.71	2.65	2.92	2.78
CoML4VIS	1309	3.16	3.22	3.22	4.00	3.59
CoDA (Ours)	1701	4.82	4.73	4.96	4.94	4.86

Line chart showing overall visualization quality score improving from ~60 to ~85 across 5 refinement iterations in CoDA's self-reflection loop — **Overall Score** vs. refinement iterations.

Bar chart comparing visualization quality metrics with and without the Search Agent, showing consistent improvement when Search Agent is enabled — Impact of the **Search Agent**.

Bar chart comparing visualization quality metrics with and without the Global TODO List, demonstrating improved coordination across agents — Impact of the **Global TODO List**.

Key Performance Highlights

CoDA sets new state-of-the-art across all benchmarks and human evaluation.

0%

+24.5%

MatplotBench OS

vs. 55.0% best baseline

0%

+7.4%

Qwen OS

vs. 81.6% best baseline

0

#1 Rank

Elo Rating

vs. 1506 next best

0%

~2x SOTA

DA-Code OS

vs. 20.5% DS-STAR

GALLERY

Visualization Gallery

CoDA outputs compared side-by-side with ground truth across diverse benchmarks.

Side-by-side qualitative comparison of CoDA vs. baselines across 4 visualization tasks — **Figure 2.** Qualitative comparison across four diverse tasks. CoDA consistently produces complete, accurate visualizations scoring **90–95/100**, while baselines frequently produce broken outputs scoring **0–45/100**.

DA-Code 100 / 100

NBA team performance trends — multi-line chart with markers and legend

CoDA Output

Ground Truth

DA-Code 100 / 100

Steam game ratings — scatter plot with color-coded categories and trend line

CoDA Output

Ground Truth

MatplotBench 100 / 100

Hierarchical data distribution — sunburst chart with nested categories

CoDA Output

Ground Truth

Failure Recovery fail → success

Browser usage sunburst — initial iterations failed, self-reflection loop recovered at iteration 4

Iter 3 (Failed)

Successful output after self-reflection recovery

Iter 4 (Success)

CoDA's self-reflection loop detected layout and labeling errors across 3 iterations, then automatically recovered at iteration 4 — demonstrating the robustness of quality-driven refinement.

REFERENCE

Citation

BibTeX

@inproceedings{chen2026coda, title = {{CoDA}: Agentic Systems for Collaborative Data Visualization}, author = {Chen, Zichen and Chen, Jiefeng and Ar{\i}k, Sercan {\"O}. and Sra, Misha and Pfister, Tomas and Yoon, Jinsung}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2026} }

Abstract

How CoDA Works

Understanding

Planning

Generation

Self-Reflection

Agent Collaboration Trace

Query Agent + Data Agent

Design Agent + Search Agent

Code Agent + Debug Agent

Eval Agent

Design Agent (re-triggered)

Code Agent + Debug Agent (re-triggered)

Eval Agent (converged)

Experimental Results

MatplotBench & Qwen Code Interpreter

DA-Code Benchmark (Overall Score %)

Human Expert Evaluation on MatplotBench

Visualization Gallery

Citation

Key Findings