SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

About SPARTA

SPARTA introduces a groundbreaking benchmark for tree-structured multi-hop question answering (QA) across text and tables, addressing the critical shortcomings of existing datasets like HybridQA and OTT-QA, which suffer from shallow reasoning, annotation errors, and limited scale. By constructing a unified reference fact database that merges source tables with grounding tables derived from unstructured passages, our end-to-end framework automates the generation of thousands of high-fidelity QA pairs—requiring only a quarter of the annotation effort—while incorporating advanced operations like aggregation, grouping, and deep nested predicates. Innovative techniques such as provenance-based refinement and realistic-structure enforcement ensure executable, semantically sound queries that mimic real-world complexity, spanning domains like NBA, movies, and medicine. On SPARTA, state-of-the-art models plummet by over 30 F1 points, exposing gaps in cross-modal reasoning and paving the way for more robust QA systems.

Representative examples of SPARTA benchmark

Figure 1: Representative examples from the SPARTA benchmark showing multi-hop reasoning across tables and text with varying query structures.

Why SPARTA?

In recent years, benchmarks like HybridQA and OTT-QA have pioneered Table-Text QA. However, their reliance on manual curation has led to fundamental limitations: shallow reasoning, high annotation noise, and toy-scale data environments.

As LLMs evolve to handle increasingly complex analytical tasks, we present SPARTA, a scalable and principled framework designed to advance cross-modal reasoning through automated, high-fidelity benchmark construction.

SPARTA redefines the standards of Table-Text QA with three core advancements:

  • Real-World Scale: We move beyond tiny web tables (averaging 15 rows) to realistic settings, featuring relational data with up to thousands of rows.
  • Tree-Structured Multi-Hop Reasoning: Unlike existing benchmarks limited to linear, shallow chains, SPARTA synthesizes complex queries requiring tree-structured reasoning, including advanced operations like aggregation and grouping.
  • High Fidelity & Reliability: By replacing error-prone manual labeling (which has up to a 21% error rate) with provenance-based refinement, SPARTA ensures executable, natural-sounding, and noise-free benchmarks at 4x the construction efficiency.

The challenge is significant: even state-of-the-art LLMs experience a performance drop of over 30 F1 points on SPARTA compared to previous benchmarks, highlighting the critical need for more robust and scalable evaluation.

Benchmark Comparison

SPARTA addresses critical limitations of existing Table-Text QA benchmarks

Benchmark Avg. Rows Question Gen. Grouping/Having Deep Multi-hop (>3) Star-shape Query Annotation Error
(over 100 sampled queries)
TAT-QA 9.4 Manual 30%
FinQA 6.4 Manual 27%
MultiHierTT 10.8 Manual 26%
HybridQA 15.7 Manual 21%
OTT-QA 15.7 Manual 21%
SPARTA (NBA) 3,280.5 Auto (LLM) 0%
SPARTA (Movie) 10,054.0 Auto (LLM) 0%
SPARTA (Medical) 200.0 Auto (LLM) 0%

  Examples from SPARTA Benchmark

Key Contributions

Reference Fact Database

Unifies heterogeneous evidence (tables + text) inside a single relational store, making all facts uniformly queryable via SQL.

Provenance-Based Refinement

Uses "why-not provenance" to identify and fix overly selective predicates, ensuring every generated query returns valid results.

Realistic-Structure Enforcement

Constrains generation to post-order traversal of query graphs, producing human-like SQL that mirrors how analysts actually write queries.

Quality-Assured Generation

Achieves 0% annotation error rate over 100 sampled queries with lightweight human validation, compared to 21-30% error rates in existing benchmarks.

Framework Overview

SPARTA Pipeline Overview

Figure 2: Overview of SPARTA pipeline: (1) Reference Fact Database Construction, (2) Query Generation, (3) Question Verbalisation.

1

Reference Fact Database Construction

SPARTA unifies all evidence—structured and unstructured—into a single relational store called the reference fact database. Textual facts are decomposed into atomic propositions and stored as tuples, making them directly addressable via SQL alongside structured data.

2

Query Generation with Provenance-Based Refinement

An LLM generates SQL queries with controlled hop counts. Two safeguards ensure quality: provenance-based refinement fixes empty-result queries using "why-not" analysis, and realistic-structure enforcement follows post-order traversal for human-like SQL.

3

Question Verbalisation

Each validated SQL query is converted to fluent natural language using AST-ICL, a state-of-the-art SQL-to-text model. Lightweight human validation ensures correctness with 75% less annotation time than traditional methods.

Provenance-Based Refinement

Our novel technique that identifies and fixes overly selective predicates using database provenance

Provenance-based refinement process

Figure 3: Provenance-based refinement process. When a query returns empty results, the system traces the failure using "why-not provenance", identifies the blocking predicate, and guides the LLM to relax overly restrictive conditions.

Experimental Results

State-of-the-art models experience dramatic performance drops on SPARTA:
ODYSSEY drops from 69.5 F1 (HybridQA) to 35.6 F1 (SPARTA)
HProPro drops from 70.5 F1 (HybridQA) to 40.4 F1 (SPARTA)

F1 scores across tree configurations

Figure 4: F1 scores across query tree configurations. Performance degrades as depth and breadth increase.

F1 scores across SQL operations

Figure 5: F1 scores across analytical operations. Models struggle with GROUP BY, ORDER BY, and aggregation.

Query naturalness comparison

Figure 6: Query naturalness evaluation comparing different generation methods. Our execution-guided approach with post-order traversal achieves the highest naturalness scores.

Case Study

Case study showing model failures

Figure 7: Case study illustrating typical failure modes of current Table-Text QA models on SPARTA benchmark questions.

BibTeX

@inproceedings{
park2026sparta,
title={{SPARTA}: Scalable and Principled Benchmark of Tree-Structured Multi-hop {QA} over Text and Tables},
author={Sungho Park and Jueun Kim and Wook-Shin Han},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=8KE9qvKhM4}
}