Vision and Language Training Helps Deploy Taxonomic Knowledge But Does Not Fundamentally Alter It

Abstract

Does vision-and-language (VL) training change the linguistic representations of language models in meaningful ways? Most results in the literature have shown inconsistent or marginal differences, both behaviorally and representationally. In this work, we start from the hypothesis that the domain in which VL training could have a significant effect is lexical-conceptual knowledge, in particular its taxonomic organization. Through comparing minimal pairs of text-only LMs and their VL-trained counterparts, we first show that the VL models often outperform their text-only counterparts on a text-only question-answering task that requires taxonomic understanding of concepts mentioned in the questions. Using an array of targeted behavioral and representational analyses, we show that the LMs and VLMs do not differ significantly in terms of their taxonomic knowledge itself, but they differ in how they represent questions that contain concepts in a taxonomic relation vs. a non-taxonomic relation. This implies that the taxonomic knowledge itself does not change substantially through additional VL training, but VL training does improve the deployment of this knowledge in the context of a specific task, even when the presentation of the task is purely linguistic.

Our Contributions

We investigate how vision-and-language training impacts the taxonomic knowledge of language models. To this end, we make the following contributions:

TaxonomiGQA: We introduce a new text-only question-answering dataset derived from GQA that isolates and tests taxonomic understanding. The dataset includes 148,020 examples, covering 126 hypernym chains, and emphasizes robust hierarchical knowledge using positive and multiple hard negative examples.
Systematic comparison of minimal LM–VLM pairs: We evaluate seven minimal LM-VLM pairs, where each VLM is built on top of its LM counterpart. In six out of seven pairs, VLMs consistently outperform LMs on this task.
Distinguishing knowledge from deployment: Using behavioral and representational analyses, we find that while the underlying taxonomic knowledge between VLMs and LMs is not significantly different, VLMs are substantially better at deploying that knowledge in task contexts.
Effect of visual similarity and cohesion: We show preliminary evidence that the performance gains in VLMs can be explained by visual similarity and cohesion between concepts in the taxonomy, suggesting that visual grounding may be a factor underlying the improvement in VLMs. We furthermore find that visual similarity helps more in cases where category members are more similar to the superordinate category.

TaxonomiGQA Dataset

We apply a three-step transformation to GQA to create TaxonomiGQA:

Scene Descriptions: Convert scene graphs into textual descriptions of the scene programmatically using hand-crafted templates.
Hypernym Substitution: Replace objects in questions with hypernyms sampled from their WordNet-derived hierarchy.
Negative Sampling: Create four hard negative samples per question by replacing the target term with a non-hypernym that is absent from the scene.

Illustration of the three-step pipeline for creating the TaxonomiGQA dataset.

Our three-step pipeline for generating the TaxonomiGQA dataset.

Dataset Summary

🔹 Total Instances: 148,020

🔹 Unique Scenes: 1,342

🔹 Positive Samples: 29,604

🔹 Negative Samples: 118,416 (4 per positive)

🔹 Hyponym-Hypernym Pairs: 276

🔹 Hypernym Chains: 126

🔹 Unique Hypernyms: 88

🔹 Top-Level Categories: 24

More details on question filtering and negative sampling can be found in Appendix C of the paper.

Experiments & Key Findings

We evaluated 7 minimal pairs of LMs and their VLM counterparts, where each pair shares the same base language model.

VLMs Consistently Outperform LMs on Text-Only QA

VLMs achieved higher scores than LMs on all three evaluation metrics: Overall Accuracy, Conditional Accuracy, and Hierarchical Consistency. This held across nearly all model pairs (with the exception of Vicuna vs. Llava-1.5), showing that VL training yield improvements in _text-only_ QA that requires sensitivity to taxonomic knowledge.

Performance of VLM-LM model pairs on TaxonomiGQA and TAXOMPS. Points above the diagonal indicate cases where the VLM outperforms the LM.

To explain these observations, we put forth two hypotheses:

H1: VLMs' taxonomic knowledge aligns better with the reference taxonomy

To test whether VL training alters the underlying taxonomic knowledge itself, we introduced TAXOMPS (taxonomic minimal pairs), a QA dataset constructed on top of our taxonomy, which directly asks whether one concept is a kind of another (e.g., “Is a cat an animal?”) along with four minimal pair questions (e.g., "Is a cat a vegetable?"). Both LMs and VLMs performed similarly well, indicating that their underlying taxonomic knowledge remains largely unchanged by VL training.

Performance of VLM vs LM on the TAXOMPS benchmark

Performance on TAXOMPS. Most model pairs perform similarly well, with points clustered on the diagonal.

We further confirmed this through Representational Similarity Analysis (RSA) on top of hierarchically sensitive representations extracted from the LM and the VLM (using the method developed by Park et al., 2024). We found no significant difference between these representations in any LM-VLM pair.

H2: VL Training improves the deployment of taxonomic knowledge in a specific task context

While their underlying knowledge was found to be similar to that of LMs, VLMs show a clear advantage in how they deploy that knowledge in task contexts. Two key analyses performed on the Qwen 2.5 pair (most salient differences in TaxonomiGQA performance) support this:

Contextualized Similarity Predicts Behavior: In Qwen2.5-VL, contextual embeddings of hyponyms and their hypernyms were significantly more predictive of correct task performance (measured using odds ratio) than in its LM counterpart Qwen2.5.

Odds ratios for contextual similarity predicting accuracy

Odds ratios from logistic regression: contextualized similarity is associated with accuracy more strongly in Qwen2.5-VL than its base LM.

More Distinct Question Representations: PCA of question embeddings showed that VLMs more clearly separate taxonomic vs. non-taxonomic substitutions. A linear SVM trained on the top 2 PCs had lower error for VLMs (0.36) than for LMs (0.52), indicating clearer task-relevant distinctions.

Contextualized representational analysis on Qwen2.5-I (LM) and Qwen2.5-VL-I (VLM).

Why Vision Might Help: The Role of Visual Similarity

To understand why VL training improves taxonomic reasoning in text-only settings, we explored whether visual similarity between related concepts helps models apply their knowledge more effectively.

Visual similarity predicts VLM accuracy much better than it does LM accuracy: For each hypernym-hyponym pair, we computed visual similarity using embeddings from the VLM’s vision encoder on independent images from the THINGS dataset. Higher similarity was associated with higher VLM conditional accuracy on corresponding taxonomic questions (b = 0.52, SE = 0.19, p < .01). No such effect was found for LMs (b = 0.23, SE = 0.17, p = 0.18).
The effect shows substantial variation by individual concept's visual cohesion: The strength of this effect varied across different concepts significantly. Concepts whose members were more visually cohesive (e.g., band, stick) seem to show stronger visual-similarity benefits, while broad or heterogeneous categories (e.g., animal, vertebrate) showed weaker effects.

Hypernym-specific random effects of image-similarity

Effect of visual similarity on VLM performance across concepts. Bar color reflects visual cohesion (darker = higher cohesion).

These results suggest that VL training may help models exploit visual regularities in concept hierarchies, especially when categories have largely similar visual features.

Conclusion & Future Work

By developing TaxonomiGQA, a QA dataset that requires sensitivity to taxonomic knowledge, we showed that VLMs outperformed their LM counterparts across all metrics despite TaxonomiGQA being a purely text-based dataset. We furthermore show that vision-and-language training doesn’t fundamentally change a model’s taxonomic knowledge. Instead, it improves how models apply that knowledge when solving tasks.

Through a series of behavioral and representational analyses, we find that VLMs form stronger contextual links between related concepts and represent taxonomic distinctions more distinctly in downstream tasks. This advantage appears linked to the visual similarity and cohesion among concepts in the taxonomic hierarchy.

Looking ahead: our analyses are correlational in nature. A key direction for future work is establishing causal links between VL training and task-specific deployment behavior, potentially by analyzing training data, probing internal objectives, or manipulating visual features during pretraining. There is also room to explore whether VLMs encode non-linear taxonomic distinctions that aren’t captured by the linear separability analyses used here.

BibTeX

@article{qin2025vision,
  title={Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It},
  author={Qin, Yulu and Varghese, Dheeraj and Lindstr{\"o}m, Adam Dahlgren and Donatelli, Lucia and Misra, Kanishka and Kim, Najoung},
  journal={arXiv preprint arXiv:2507.13328},
  year={2025}
}

Vision and Language Training Helps Deploy Taxonomic KnowledgeBut Does Not Fundamentally Alter It