Research notes about agi

Cross-attention is where many multi-modal systems reveal their actual architectural priorities.

Why I care about it

Papers often describe the model as a clean fusion of visual and textual streams, but the real behavior depends on where cross-attention is inserted, how often it appears, and whether it is symmetric.

Those choices determine whether the model is built for efficient grounding, rich fusion, or lightweight adaptation.

A useful reading frame

When I look at a new architecture, I usually ask:

Is the language model attending to vision tokens, or is there an intermediate bridge module?
Does fusion happen early, late, or repeatedly through the stack?
Are visual tokens preserved at full granularity, pooled, or queried selectively?
Is cross-attention trainable end to end, or only through adapters?

These questions explain more than a diagram alone.

Common tradeoff

Heavy cross-attention improves fine-grained grounding but can be expensive and unstable. Lightweight bridges reduce cost but may bottleneck information flow before reasoning even begins.

Conclusion

Understanding a multi-modal architecture usually starts with understanding how cross-attention is doing the actual work of alignment.

Why I care about it

A useful reading frame

Common tradeoff

Conclusion

Discussion