Fiction vs Friction: Challenges in Evaluating LLMs on Data Visualization Tasks

Abstract
Large language models (LLMs) are often marketed as all-purpose tools, capable of assisting users with a variety of tasks. This has led to the use of LLMs across domains and tasks, and has exposed some fundamental limitations of LLMs. For data visualization tasks (where an LLM is asked to create a visualization or answer a question with a visualization), we call out challenges associated with query specification and the difficulty of verifying results. Differently phrased queries may have the same analytic goal, while similarly phased queries may lead to dramatically different results. Add to this the plethora of visualization guidelines and design choices, and the complexity of evaluating LLMs on data visualization tasks grows quickly. While correct and credible answers take time to sort out, plausible-looking, but limited, hallucinated, or otherwise incorrect model responses are instant and ubiquitous. We explore the challenges associated with this space, and call for consideration of combinations of techniques to spot check model responses and surface errors.
Materials
PDF | BibTeX
Authors
Citation
Thumbnail image for publication titled: Fiction vs Friction: Challenges in Evaluating LLMs on Data Visualization Tasks
Fiction vs Friction: Challenges in Evaluating LLMs on Data Visualization Tasks

Shani Spivak and Melanie Tory. Human-centered Evaluation and Auditing of Language Models (HEAL) Workshop. 2025.

PDF | BibTeX


Khoury Vis Lab — Northeastern University
* West Village H, Room 302, 440 Huntington Ave, Boston, MA 02115, USA
* 100 Fore Street, Portland, ME 04101, USA
* Carnegie Hall, 201, 5000 MacArthur Blvd, Oakland, CA 94613, USA