ViperGPT: Visual Inference via Python Execution for Reasoning

*Equal contribution
Columbia University

ViperGPT decomposes visual queries into interpretable steps.


Answering visual queries is a complex task that requires both visual processing and reasoning. End-to-end models, the dominant approach for this task, do not explicitly differentiate between the two, limiting interpretability and generalization. Learning modular programs presents a promising alternative, but has proven challenging due to the difficulty of learning both the programs and modules simultaneously. We introduce ViperGPT, a framework that leverages code-generation models to compose vision-and-language models into subroutines to produce a result for any query. ViperGPT utilizes a provided API to access the available modules, and composes them by generating Python code that is later executed. This simple approach requires no further training, and achieves state-of-the-art results across various complex visual tasks.

How does it work?

A brief explanation of ViperGPT.

Logical Reasoning

ViperGPT can perform logical operations because it directly executes Python code.

Spatial Understanding

We show ViperGPT's spatial understanding.


ViperGPT can access the knowledge of large language models.


ViperGPT answers similar questions with consistent reasoning.


ViperGPT can count, and divide. All using Python.


We show some ViperGPT examples involving attributes.

Relational Reasoning

Reasoning about relations.


Negation is programmatic, not neural.

Temporal Reasoning

ViperGPT can navigate through videos to understand them.

More Results

Related Work


    title={ViperGPT: Visual Inference via Python Execution for Reasoning},
    author={D\'idac Sur\'is and Sachit Menon and Carl Vondrick},
    journal={Proceedings of IEEE International Conference on Computer Vision (ICCV)},