ICML 2024 Day 1 Highlights

Waleed Kadous
3 min readJul 23, 2024

--

ICML is one of the premier conferences in AI, perhaps only behind NeurIPS in prestige. It’s a great place to understand what’s happening at the cutting edge of AI. Here are my personal reflections after day one of the conference.

Video is coming, and it’s not just about cute memes

Progress on videos is moving very, very quickly. Even things you thought were insurmountable like the amount of computation are being overcome. For example, what if instead of trying to generate every frame of video, you generated MPEG directly?

MPEG works by having a keyframe every so often, and then working out motion paths for parts of the image. The keyframes are just images, so you can use your favorite image encoding and then all you have to do is train a motion encoder/decoder.

That’s what this paper did.

It’s easy to think that video is only important for generating cute memes, but it’s being painted by its advocates as something more — a way for machines to understand the real world. In much the same way LLMs opened up language, cognition and thought to AI, there is a belief amongst researchers that video models will open up the “real world” to AI.

Perhaps the most interesting example of this was Google’s Genie that claimed that video was a step towards AGI. To prove this they trained a model that allowed you to create a “virtual world” — think platform game — from no more than a few sketches. In other words, the model had learned how things in a platformer work.

Diffusion models for image and video are being challenged by … LLMs?

The keen-eyed amongst you will notice that the diagram above is not a diffusion model. In fact, it’s an LLM. The same is true of Google’s mind-blowing VideoPoet (you have to click on that link — the eye candy is glorious). Wait … what?

Yes, that’s right. The LLM representation has far more flexibility and thus it seems that ironically we are moving to LLMs for video and audio. You do need to do a bit more, in particular you not only have to encode images as tokens, but decode them as well. These, however are well understood.

State Space Model / Transformer hybrids

Transformer models have a problem: their memory consumption grows as the square of the sequence length. People have proposed state space models — most notably Mamba — that compress what the model already knows.

Here’s a great explanation of State Space Models. TL;DR: remember RNNs? Yeah, they’re making a comeback.

Each one of these models has issues. Transformer models are huge and cumbersome, but can model complex dependencies and can regenerate parts of the input (something state space models can’t do). State space models are really fast, and can learn recurring patterns much more easily. So people are starting (probably still too early) to hybridize them, e.g. MambaFormer.

[Fun story: I had the chance to talk to Jongho at the poster session, and he mentioned that he discovered the architecture for MambaFormer accidentally, he thought he was putting MambaFormer after the attention heads, but he actually put it before. Then he noticed it actually worked well …]

I’d say this is still too early, but it’s an indication of where things could go. I don’t think transformers are the “end state” architecture for AI.

Differential privacy is making its way into generative approaches
There’s a lot of work on how to train large models on potentially privacy sensitive data. One of the most interesting approaches there is adapting the well-understood framework of differential privacy (TL;DR version: the removal of a single training example shouldn’t change the model by much). For example, this work shows how to use self-supervised learning in such a way that you can use private photos and still have a model that is safe to release publicly.

--

--

Waleed Kadous

Co-founder of CVKey, ex Head Engineer Office of the CTO @ Uber, ex Principal Engineer @ Google.