Understanding Cross-Attention in Multi-Modal Architectures
A short framework for reading cross-attention design choices in modern vision-language model papers.
A focused archive of papers and working notes across the core areas I am studying and writing about.
A short framework for reading cross-attention design choices in modern vision-language model papers.
Notes on reducing visual token load while keeping cross-modal reasoning stable in large VLMs.
A working summary of the training choices that most affect stability, convergence, and downstream transfer.
A concise look at why unified latent representations keep appearing in modern image, video, and audio generation systems.
No papers match the current filter.