Charts remain a bit of a mystery to Large Language Models (LLMs).
An LLM, on its own, has no idea what “up” or “down” means on a chart, much less what it signifies. Today’s top-tier models do a reasonably good job of identifying a bar chart from a pie chart, and can pull a bit of knowledge from the accompanying text.
But they do not know how to interpret a chart in any sort of computational way. It requires reasoning across visual patterns, numerical data and text.
Researchers from the Massachusetts Institute of Technology and IBM have created ChartNet, a multimodal dataset designed to help models programmatically better understand the content of charts they find in the wild. Their approach also points to how specialized datasets can help models outperform larger rivals.
When applied to IBM’s 2-billion-parameter open source Granite Vision model, ChartNet was able to boost the performance of this model beyond the might demonstrated by OpenAI’s GPT-4o, with its estimated 200 billion parameters. It showed superior results in data extraction, data summarization and some code reconstruction tasks.
ChartNet also fine-tuned LLaVA 7B to beat GPT-4o on querying charts (“Q&A”).
What Is Up?
For humans, charts are intuitive, at least when done well. They visually structure data so it can be easily interpreted as trends, distributions and relationships to the human eye.
But AI needs to be visually grounded. A model does not look at a graph as a graph, but rather as a set of adjoining grid tokens.
The process starts with positional encoding, a mathematical signature attached to each token. This is what the model will need in order to understand that data points, say, in the top half of the grid correspond to the concept of “up.”
ChartNet was designed to give models a way to programmatically understand all the different ways charts could represent data. It has 1.5 million synthetic tuples (or sets of data) that cover the variances within 24 chart types. About 96,000 of them have been verified by humans. The package also includes 30,000 real-world charts from reputable publishers.
Through the power of transfer learning, a model can learn from these tuples the core mathematical and spatial relationships in charts in general, a knowledge that can then be used to understand other charts that come in from the outside.
This approach works much better than guessing at pixels.
A Mechanical Translator
Instead of purloining more charts from the wild, the researchers took the novel approach of synthetically generating chart data via an LLM.
To do this, they returned to the basics, calling on Python chart plotters to supply the “structured intermediate representation” of how charts are built.
The researchers created a code-guided automatic chart generation pipeline for building tuples, bundles of multimodal data that describe synthetically-generated charts.
A ChartNet tuple includes a chart image, the Python plotting code that generated the chart, JSON tables of the data used in the chart, a text summary of the chart, and a set of QA pairs (practice exams).
The JSON table with the data is particularly important as the LLM will not have to rely on a fuzzy image of a line or a pie slice to interpret the numbers.
A model creates more than a dozen augmentations of each chart, in different formats and with varying data. A pipeline spanning multiple GPUs produced over a million annotated samples every 170 hours.
Specialized Datasets
In short, ChartNet shows that code-aligned multimodal supervision can unlock the mysteries of charts to LLMs.
In testing, ChartNet was used to fine tune five different models. Models ran from a billion to 7 billion parameters. They were challenged on four tasks: chart to code reconstruction, extracting data from the chart, chart summarization, and the ability to conduct queries against the chart. The researchers found ChartNet improved these tasks across all five models.
ChartNet also fared well in two chart benchmarks, ChartMimic for chart reconstruction and ChartCap for chart summarization.
Beyond charts themselves, this approach illustrates that the key to data visualization and document intelligence lies not in mindlessly increasing the size of a model, but in training it on highly structured, strategically-aligned datasets.

