[NeurIPS最佳论文] 用于自监督强化学习的千层网络 — 王凯文等，普林斯顿大学 | Latent Space: The AI Engineer Podcast 中文双语解读

本集简介

从普林斯顿本科研究研讨会到荣获NeurIPS 2025最佳论文奖，Kevin Wang、Ishaan Javali、Michał Bortkiewicz、Tomasz Trzcinski和Benjamin Eysenbach团队突破传统认知，将强化学习网络扩展至1000层深度——实现了RL领域曾认为不可能的性能突破。我们在NeurIPS现场专访了RL1000团队，深入探讨：为何深度网络在语言与视觉领域成功却十余年未攻克RL领域（关键不在深度而在于目标函数）；他们如何发现自监督RL（通过对比学习构建状态-动作-未来状态的表征）能突破基于价值方法的局限；使其成功的核心架构技巧（残差连接、层归一化及从回归转向分类）；深度扩展为何比宽度扩展更具参数效率（线性增长vs二次增长）；Jax与GPU加速环境如何实现数亿transition的快速采集（数据充裕性成为扩展前提）；"临界深度"现象——当突破1500万transition并添加正确架构组件后性能呈倍数增长；这不仅关乎"扩大网络"更是RL目标函数的根本转变（其代码没有"最大化奖励"指令——纯粹的自监督表征学习）；深度教师网络与浅层学生网络的蒸馏技术如何实现大规模部署（用1000层训练前沿能力，蒸馏为高效推理模型）；机器人学应用前景（无需人类监督或示范的目标条件RL，通过架构扩展替代人工数据收集）；以及他们的核心论点——RL终于能像语言与视觉那样扩展，不是靠堆砌价值函数算力，而是借鉴使深度学习成功的自监督表征学习范式。核心讨论：自监督RL目标函数：不学习价值函数（存在噪声/偏差/伪相关），而是学习使同轨迹状态聚合、异轨迹状态分离的表征——将RL转化为分类问题初始扩展失败原因：单纯加深网络导致性能下降，添加残差连接与层归一化后性能飙升——揭示"临界深度"现象深度vs宽度扩展：深度线性增加参数量，宽度二次增长——相同性能下深度架构更具参数效率与样本效率 Jax+GPU加速环境突破：并行采集数千轨迹消除数据瓶颈，突破1500万transition后深度网络效益显现 RL与自监督学习的融合：其算法不直接最大化奖励，虽采用actor-critic目标条件RL框架，但学习重点转向分类任务（交叉熵损失/表征学习）而非时序差分误差回归批量大小与深度扩展的协同效应：传统RL无法受益于大批量因网络太小难以提取信号，深度扩展后批量大小成为新增长维度 —— RL1000团队（普林斯顿）《千层网络实现自监督RL：深度扩展解锁新目标达成能力》：https://openreview.net/forum?id=s0JVsx3bx1 章节标记 00:00:00 开场：最佳论文奖与NeurIPS海报展示 00:01:11 团队介绍与普林斯顿研究渊源 00:03:35 深度学习异常现象：RL为何长期停滞于浅层网络 00:04:35 自监督RL：差异化的扩展路径 00:05:13 突破时刻：残差连接与临界深度 00:07:15 架构选择：借鉴ResNet与梯度消失防治 00:07:50 论文核心澄清：不仅是扩大网络更是目标函数革新 00:08:46 领域融合：RL与自监督学习的边界消弭 00:09:44 从TD误差到分类任务：目标函数可扩展性解析 00:11:06 架构细节：基于Braw与SymbaFowl的改进 00:12:05 机器人应用：无监督目标条件RL的可能性 00:13:15 效率权衡：深度与宽度的参数扩展对比 00:15:48 JAX与GPU加速环境：数据基础设施构建 00:18:05 世界模型与下一状态分类 00:22:37 通过网络容量释放批量规模效益 00:24:10 算力需求：单GPU实现前沿性能 00:21:02 未来方向：蒸馏技术、视觉语言模型与分层规划 00:27:15 结语：挑战RL扩展的传统认知框架

From undergraduate research seminars at Princeton to winning Best Paper award at NeurIPS 2025, Kevin Wang, Ishaan Javali, Michał Bortkiewicz, Tomasz Trzcinski, Benjamin Eysenbach defied conventional wisdom by scaling reinforcement learning networks to 1,000 layers deep—unlocking performance gains that the RL community thought impossible. We caught up with the team live at NeurIPS to dig into the story behind RL1000: why deep networks have worked in language and vision but failed in RL for over a decade (spoiler: it's not just about depth, it's about the objective), how they discovered that self-supervised RL (learning representations of states, actions, and future states via contrastive learning) scales where value-based methods collapse, the critical architectural tricks that made it work (residual connections, layer normalization, and a shift from regression to classification), why scaling depth is more parameter-efficient than scaling width (linear vs. quadratic growth), how Jax and GPU-accelerated environments let them collect hundreds of millions of transitions in hours (the data abundance that unlocked scaling in the first place), the "critical depth" phenomenon where performance doesn't just improve—it multiplies once you cross 15M+ transitions and add the right architectural components, why this isn't just "make networks bigger" but a fundamental shift in RL objectives (their code doesn't have a line saying "maximize rewards"—it's pure self-supervised representation learning), how deep teacher, shallow student distillation could unlock deployment at scale (train frontier capabilities with 1000 layers, distill down to efficient inference models), the robotics implications (goal-conditioned RL without human supervision or demonstrations, scaling architecture instead of scaling manual data collection), and their thesis that RL is finally ready to scale like language and vision—not by throwing compute at value functions, but by borrowing the self-supervised, representation-learning paradigms that made the rest of deep learning work. We discuss: The self-supervised RL objective: instead of learning value functions (noisy, biased, spurious), they learn representations where states along the same trajectory are pushed together, states along different trajectories are pushed apart—turning RL into a classification problem Why naive scaling failed: doubling depth degraded performance, doubling again with residual connections and layer norm suddenly skyrocketed performance in one environment—unlocking the "critical depth" phenomenon Scaling depth vs. width: depth grows parameters linearly, width grows quadratically—depth is more parameter-efficient and sample-efficient for the same performance The Jax + GPU-accelerated environments unlock: collecting thousands of trajectories in parallel meant data wasn't the bottleneck, and crossing 15M+ transitions was when deep networks really paid off The blurring of RL and self-supervised learning: their code doesn't maximize rewards directly, it's an actor-critic goal-conditioned RL algorithm, but the learning burden shifts to classification (cross-entropy loss, representation learning) instead of TD error regression Why scaling batch size unlocks at depth: traditional RL doesn't benefit from larger batches because networks are too small to exploit the signal, but once you scale depth, batch size becomes another effective scaling dimension — RL1000 Team (Princeton) 1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities: https://openreview.net/forum?id=s0JVsx3bx1 Chapters 00:00:00 Introduction: Best Paper Award and NeurIPS Poster Experience 00:01:11 Team Introductions and Princeton Research Origins 00:03:35 The Deep Learning Anomaly: Why RL Stayed Shallow 00:04:35 Self-Supervised RL: A Different Approach to Scaling 00:05:13 The Breakthrough Moment: Residual Connections and Critical Depth 00:07:15 Architectural Choices: Borrowing from ResNets and Avoiding Vanishing Gradients 00:07:50 Clarifying the Paper: Not Just Big Networks, But Different Objectives 00:08:46 Blurring the Lines: RL Meets Self-Supervised Learning 00:09:44 From TD Errors to Classification: Why This Objective Scales 00:11:06 Architecture Details: Building on Braw and SymbaFowl 00:12:05 Robotics Applications: Goal-Conditioned RL Without Human Supervision 00:13:15 Efficiency Trade-offs: Depth vs Width and Parameter Scaling 00:15:48 JAX and GPU-Accelerated Environments: The Data Infrastructure 00:18:05 World Models and Next State Classification 00:22:37 Unlocking Batch Size Scaling Through Network Capacity 00:24:10 Compute Requirements: State-of-the-Art on a Single GPU 00:21:02 Future Directions: Distillation, VLMs, and Hierarchical Planning 00:27:15 Closing Thoughts: Challenging Conventional Wisdom in RL Scaling

双语字幕

仅展示文本字幕，不包含中文音频；想边听边看，请使用 Bayt 播客 App。

Speaker 0

欢迎来到Lanespace。

So welcome to Lanespace.

Speaker 0

我们基本上是想为那些无法到场的人提供最佳的NeurIPS播客体验。

We are basically trying to provide the best optimal sort of podcast experience of NeurIPS for people who are not here.

Speaker 0

恭喜你的论文。

And congrats on your paper.

Speaker 0

感觉如何？

How does it feel?

Speaker 1

是的。

Yeah.

Speaker 1

这非常令人兴奋。

It was very exciting.

Speaker 1

是的。

Yeah.

Speaker 1

我们昨天有张贴海报，今天将进行口头报告。

We had a poster yesterday, and then today we'll have an oral talk.

Speaker 0

你当时是不是被人群围住了？

Were you just, like, mobbed?

Speaker 1

哦，哦，人特别多。

Oh, Oh, There was a lot of people.

Speaker 1

连续三个小时，一波又一波的人过来，我们根本忙不过来，但是

It's like three hours straight of like, you know, like waves of people to like that we trying to institute, but

Speaker 0

我从来没拿过最佳论文奖。

So I've never received the best paper.

Speaker 0

你是刚在网站上看到的吗？

Did you just find out on the website?

Speaker 0

就像，什么

Like, what

Speaker 1

哦，我有一天早上醒来，检查邮件，然后

Oh, I just like woke up one day and like checked my email and then

Speaker 0

啊，他们只是觉得他们只是

Ah, They just thought they just like

Speaker 1

他们说：‘哦，我刚看到你的邮件。’

They was like, oh, like, that's like, I just saw your email.

Speaker 1

哦，你被授予了最佳论文奖。

Oh, you were like, been awarded best paper.

Speaker 0

我只是想，也许从评审意见中你也能知道。

I'm like But maybe you know from the reviews as well.

Speaker 0

对吧？

Right?

Speaker 1

所以我觉得，是的。

So I think Yeah.

Speaker 1

我们知道评审意见显示我们表现不错，但表现好和拿到最佳论文还是有区别的。

We know from the reviews that we did well, there's but a difference between, like, doing well on the reviews and getting best paper.

Speaker 1

所以那部分，我们其实并不知道。

So that part, we didn't actually know.

Speaker 0

嗯。

Yeah.

Speaker 0

所以我稍微跳过了一点。

So I I I skipped a little bit.

Speaker 0

也许我们可以一个一个来，简单介绍一下你自己，以及你在团队中做了什么。

Maybe we can go sort of one by one and and sort of introduce, you know, who you are and what you did on on on on the team.

Speaker 1

我是凯文。

I'm Kevin.

Speaker 1

我来自普林斯顿大学，是本科生。

I was an undergrad from from Princeton.

Speaker 1

我刚毕业。

I just graduated.

Speaker 1

嗯，我想是我主导了这个项目，启动了它，然后我很高兴能和伊尚在通话中合作，还有本也一样。

And, yeah, I guess I led the project, like, started the project, and then well, I was very happy to collaborate with Ishaan in the call and Ben also.

Speaker 0

对。

Right.

Speaker 0

你们是在同一个研究小组吗？

And were you in, like, the same research group?

Speaker 0

你是怎么，呃，你的意思是？

Like, how do you how what's your Yes.

Speaker 0

我们是上下文。

We are context.

Speaker 1

所以，是的。

So so yeah.

Speaker 1

所以我们都是普林斯顿的。

So we're all from Princeton.

Speaker 0

是的。

Yeah.

Speaker 0

感谢艾伦帮你们安排的。

We're thanks to Alan for booking you guys.

Speaker 1

这个项目实际上起源于一次IW研讨会。

So this project actually started from, like, an IW seminar.

Speaker 1

就是本教授的独立研究研讨会。

So like like, an independent work research seminar that Ben was teaching.

Speaker 1

这实际上是我早期接触机器学习研究的经历之一。

And this was, like, actually, like, like, one of my first experiences in, like, ML research.

Speaker 1

所以能获得这样的经验真的很有价值。

So it was really valuable to, like, get that experience.

Speaker 1

当时伊莎娜也在这个研讨班里，研究相关的内容，因此我们在研讨班期间合作了很多。

And then Ishana was also in that seminar and working on adjacent things, so we collaborated a lot during that seminar.

Speaker 1

后来，这个项目取得了一些非常酷的结果。

And then, yeah, the project turned out to have some pretty cool results.

Speaker 1

再后来，哈特也加入了这个项目，他也在做类似的工作，成为了很好的合作伙伴。

And then later on, also, like, Halt working on sort of similar things also joined it on the project and became, like, a good collaboration.

Speaker 0

是的。

Yeah.

Speaker 0

我不知道你们有没有人想谈谈，关于如何决定这个研究问题的其他方面。

And I I don't know if any of you guys wanna wanna chime in on, like, other elements of coming into, like, deciding on this problem.

Speaker 2

我的实验室主要研究深度强化学习。

So it's, like, probably my lab works on deep reinforcement learning.

Speaker 2

但历史上，‘深度’意味着两层、三层或四层。

But historically, deep meant, like, two or three or four layers.

Speaker 0

不是一千层？

Not 1,000?

Speaker 2

当凯文和肖恩提到他们想尝试非常深的网络时，我有点怀疑这会不会成功。

When Kevin and Sean mentioned they wanted to try really deep networks, I was kinda skeptical it was gonna work.

Speaker 2

我以前试过。

I've tried this before.

Speaker 2

这行不通。

It doesn't work.

Speaker 2

其他论文也试过，但根本行不通。

Other papers have tried this before, and it doesn't even work.

Speaker 2

所以一开始我就非常非常怀疑。

So I was very, very skeptical starting out.

Speaker 2

我不知道当时有没有表达清楚，但这就是我当时的预设，因为

I don't know if I conveyed this at the time, but that was my prior going in because

Speaker 0

但你是否认为你的工作是筛选，或者，嘿，伙计们。

But do you do you view your job as, like, screening or, hey, guys.

Speaker 0

这可能行不通。

This is probably isn't gonna work.

Speaker 0

你应该试试别的想法。

You should try a different idea.

Speaker 0

你知道，还是你应该鼓励，即使它看起来

You know, or should you be encouraging even if it's

Speaker 2

很蠢？

dumb?

Speaker 2

这是在下注。

It's selecting bets.

Speaker 2

是的。

Yeah.

Speaker 2

而这是一个我愿意下的赌注。

And this was a bet I was willing to make.

Speaker 0

是什么让你愿意下这个注？

What what made you willing to make a bet?

Speaker 2

这看起来成本相对较低，而且米哈伊尤其在过去一年里开发了基础设施，使运行这些实验变得容易得多。

It seemed relatively low cost, and that we Mihaly, in particular, had spent the past year developing infrastructure to make it a lot easier to run some of these experiments.

Speaker 2

之前的先例表明，更深的网络能表现好得多。

And the precedent was deeper networks could do a whole lot better.

Speaker 2

就像深度学习革命在过去一天里所展现的那样。

Like, that's what the deep learning revolution has been over the last day.

Speaker 0

是的。

Yeah.

Speaker 0

我知道。

I know.

Speaker 0

为什么我们不再让网络更深了？

Why do we stop making them deeper?

Speaker 2

而强化学习就像是一个例外，我们一直使用这些非常浅的网络。

And reinforcement learning was like this one anomaly where we continue to use these really shallow networks.

Speaker 2

在我们所研究的从零开始的场景中，这一点尤其明显。

And that's particularly true in the settings that we were looking at where you're starting from scratch.

Speaker 2

你从一无所有开始。

You're starting from nothing.

Speaker 0

你们还有其他想补充的观点吗？

Any other perspectives you guys wanna chime in with?

Speaker 1

我想我应该先概述一下我们的项目。

I guess maybe I should just go over, like, an overview of our project.

Speaker 0

好的。

Yes.

Speaker 0

好吧。

Okay.

Speaker 0

抱歉。

Sorry.

Speaker 1

是的。

Yes.

Speaker 1

所以，我对我们的项目的看法是，如果你看一下深度学习的领域，你会看到有自然语言处理、语言、视觉，然后是强化学习。

So the way that I kind of view our project is that if you look at the landscape of deep learning, you know, you have NLP, like language, vision, and then RL.

Speaker 1

而且，就像本之前提到的，在语言和视觉领域，我们已经逐渐形成了这种大规模网络的范式。

And, like, as Ben kind of alluded to, you know, like, in in language and vision, we've sort of converged to these, like, paradigms of scaling to massive networks.

Speaker 1

对吧？

Right?

Speaker 1

比如数百亿甚至万亿级别的参数，深度学习从这种趋势中获得了大量进展。

Like, hundreds of billions of parameters, trillions of parameters, and there's been, you know, a lot a lot gained from in deep learning from from that.

Speaker 1

对吧？

Right?

Speaker 1

但随后，在深度学习和深度强化学习的第三个分支中，情况似乎并非如此。

And then but then it seems like in the third sort of branch of deep learning in deep RL, that has not yet been the case.

Speaker 1

当我进入本的课程和研讨会，观察这些网络时，我非常惊讶。

Like, I was very surprised, like, coming into some like, you know, Ben's class and seminar when I was looking at the networks.

Speaker 1

为什么你们只是用一个简单的两层MLP来处理这些前沿的、最先进的强化学习算法呢？

Oh, why were you just using, like, a simple, like, two layer MLP for, like, these frontier sort of, you know, state of the art RL algorithms?

Speaker 1

所以我非常好奇。

And so I was very curious.

Speaker 1

我们能否设计出强化学习算法？

Like, can we design RL algorithms?

Speaker 1

我们能否为强化学习制定一套方法，使其能够以类似于语言和视觉领域的方式进行扩展？

Can we sort of put together a recipe for RL that can allow it to scale in potentially, you know, analogous ways that language and vision might scale.

Speaker 1

因此，我们知道传统的强化学习，比如基于价值的强化学习，并不能很好地扩展。

And so what we did is that we know that traditional RL, like, let's say, like, value value based RL doesn't really scale.

Speaker 1

对吧？

Right?

Speaker 1

这一点在文献中已经非常明确了。

This is pretty clear from the literature.

Speaker 1

于是我们尝试了一种不同的强化学习方法，称为自监督强化学习：在这种方法中，我们不是学习价值函数，而是学习状态、动作和未来状态的表示，使得同一轨迹上的表示彼此靠近，不同轨迹上的表示彼此远离。

So we tried a different approach of RL called self supervised RL where instead of learning, like, a value function, we're learning representations of states, actions, and future states such that the representations along the same trajectory are pushed together, the representations along different trajectories are pushed apart.

Speaker 1

这是一种不同的强化学习方法，使我们能够以自监督的方式进行学习。

And this is just, like, a different approach to RL that allows us to learn in a self supervised manner.

Speaker 1

因此，我们可以在没有任何人工设计的奖励信号的情况下解决任务目标。

So there's we can solve task reach goals without any human crafted reward signal.

Speaker 1

我们知道自监督学习在深度学习的这些不同领域中是可扩展的，那么自监督强化学习是否也能以类似的方式扩展呢？

And so we know that self supervised learning is scalable in these different areas in deep deep learning, so can self supervised RL scale in similar ways?

Speaker 1

当我们最初尝试时，实际上并没有成功。

When we first tried it, it actually didn't work.

Speaker 1

我们让网络变得更深了。

Like, we've made the networks deeper.

Speaker 1

性能完全下降了。

The performance, like, totally degraded.

Speaker 1

但后来我也尝试了其他方法，当时我想到，在我们的文献中还有一些其他工作，比如我们尝试了残差连接，还有其他一些必须加入到这个方法中的架构组件。

But then we also try but then I separately, was like there's also some other work, like, in in our literature, like, we tried, like, residual connections, and then there's other a few other architectural components that we had to put into the recipe.

Speaker 1

然后突然有一天，我运行了这个实验，在某个环境中，当深度翻倍时并没有什么效果，但再次翻倍深度并结合这些新组件后，性能在这一环境中突然飙升。

And then all of a sudden, like, one day, like, I ran this experiment, and there was, like, this one environment in which there was, like, like, going from like like, doubling the depth didn't really do anything, but, like, doubling the depth again with these different components suddenly, like, skyrocketed performance in this one environment.

Speaker 2

要让这个方法生效非常不容易，因为通常我们会考虑进行超参数优化。

Getting this to work was very nontrivial in the sense that, like, usually, we wanted to think about doing hyperparameter opt optimization.

Speaker 2

我们尝试改变a，看看是否有所改善，再尝试改变b，看看是否有效。

We try changing a, see if it makes it better, try changing b, see whether it makes it better.

Speaker 2

如果我们只是增加深度，效果反而更差。

And if we just made the depth bigger, it makes it worse.

Speaker 2

如果我们只是添加残差连接，也没有改善效果。

If we just add residual connections, it didn't make it better.

Speaker 2

真正让这个方法奏效的是凯文和亚山发现的这些因素的组合。

And it was really this combination of factors that Kevin and Yashan figured out that really made this work.

Speaker 3

作为这之前的铺垫，我们也尝试了在不同维度上进行扩展。

And as a precursor to that, we also tried scaling along different dimensions.

Speaker 3

比如扩大批量大小，扩大网络的宽度，也就是隐藏层的规模。

So scaling the back size, scaling the the width of the network, so the hidden layers.

Speaker 3

M效应。

M effect.

Speaker 3

没错。

Yep.

Speaker 3

是的。

Yeah.

Speaker 3

差不多就像盲目增加深度一样。

Pretty much kinda similar to just scaling depth naively.

Speaker 3

是的。

Yeah.

Speaker 3

当我们开始引入残差连接、层归一化这些特定的架构选择时，性能才出现了显著提升，达到了某些关键深度，性能提升了好几倍。

And then once we started introducing residual connections, layer norm, these specific architectural choices, that's when we saw significant jumps in performance, like these critical depths at which performance multiplies by a pretty huge factor.

Speaker 3

正是在这一点上，我们真正注意到，与单纯扩大规模所带来的性能提升相比，我们解锁了显著的性能收益。

And that's where we really noticed, like, unlocking some significant performance gains as opposed to scaling just along with which did yield some performance improvements.

Speaker 3

但当你观察随着规模扩大，网络参数数量的变化时，它大致呈二次方增长，而不是像增加深度那样。

But when you look at the number of parameters that your network has as you grow with, it's roughly a quadratic as opposed to something like growing depth.

Speaker 3

因此，从我们进行的实验来看，这种方法在某种程度上更具参数效率，也更具样本效率。

So it's more in some sense, it's more parameter efficient, also more sample efficient from the experiments that we conducted.

Speaker 3

不错。

Nice.

Speaker 3

在某些方面，你实际上是在复制一些在现实世界中出现的现象

In some ways, you're sort of replicating stuff that is seen in

Speaker 0

但使用的是一个你可以研究的小型模型。

the wild, but on on a very small model that you can study.

Speaker 0

你会不会说这是

Is that would you would you say that's

Speaker 3

是的。

Yeah.

Speaker 3

我想补充一下凯文之前说的内容。

So I kind add to what Kevin said earlier.

Speaker 3

我们看到，通过扩大语言模型和图像生成模型的规模和深度，性能得到了巨大提升，这看起来非常直观。

We saw these huge performance improvements in language models, image generation models by making them larger, making them deeper, which seems very intuitive.

Speaker 3

是的。

Yeah.

Speaker 3

因此，我们的工作借鉴了基础研究，比如残差网络，它通过残差连接来避免梯度消失，这一点我们在论文的后续部分的一些消融实验中也展示了。

And so that's why our work, we we draw from, like, foundational research, right, like residual networks, which employ residual connections to avoid vanishing gradients, and that's something that we show in some of our ablations in our in our paper, like, further down.

Speaker 3

这可能在附录中，我们做了没有这些残差连接的实验。

It's probably in the appendices where we did experiments without these residual connections.

Speaker 3

因此，我们借鉴了其他领域已有的概念，并将其应用于强化学习的场景，证明了其有效性。

And so it's sort of borrowing these concepts that have existed in other fields and applying them to this setting with RL and showing that it works.

Speaker 0

在本恩离开之前，我让他来做最后的总结。

Before Ben has to has to go, I'll I'll leave the sort of last word to him.

Speaker 0

这项工作启发了哪些你接下来想推进的额外研究？

What additional work does this inspire that, like, that that you want to push on next?

Speaker 2

我想对论文澄清一点，然后直接回答这个问题。

I think there's one thing I'd clarify about the paper, and then I'll directly answer the question.

Speaker 0

好的。

Yes.

Speaker 2

我认为可能需要澄清的是，很多读到标题的人会想，哇。

I think the thing I might clarify about the paper is think a lot of people reading the title are like, wow.

Speaker 2

大网络。

Big networks.

Speaker 2

它们很棒。

They're great.

Speaker 2

我选大网络。

I'll take big networks.

Speaker 0

你现在解决了。

You solved it now.

Speaker 0

我们可以走了。

We can just go.

Speaker 2

是的。

Yeah.

Speaker 2

我们直接把大网络加到PPO里。

We just take big networks from add them to PPO.

Speaker 2

加到SAC里。

Add them to SAC.

Speaker 2

加到你最喜欢的强化学习算法里。

Add them to your favorite reinforcement learning algorithm.

Speaker 2

但我认为这实际上并不是主要结论。

I But think that's actually not the main conclusion.

Speaker 2

我认为主要结论是，使用大网络不仅需要这些架构技巧，而且如凯文之前提到的，还需要使用不同的目标函数。

I think the main conclusion is that using big networks not only requires these architectural tricks, but also, as Kevin mentioned before, it requires using a different objective.

Speaker 2

这个目标函数实际上并不使用奖励。

This objective doesn't actually use rewards in it.

Speaker 2

因此，标题中的另一个词——强化学习——也可能有点用词不当，因为我们并没有直接试图最大化奖励。

And so there's another word in the title, reinforcement learning, that also might be a little bit of a misnomer because we aren't directly trying to maximize rewards.

Speaker 2

我们的代码中没有一行是说‘最大化奖励’的。

Our code doesn't have a line of code saying maximize rewards here.

Speaker 2

那么，归根结底，这还是一个强化学习方法吗？

And so is, at the end of the day, this a reinforcement learning method?

Speaker 2

我不知道。

I don't know.

Speaker 2

它看起来更类似于机器学习其他领域的自监督方法。

It looks much more similar to the self supervised methods in other areas of machine learning.

Speaker 2

因此，我认为这种方法和这项工作实际上处于强化学习和自监督学习研究的一个有趣交汇点上。

And so I think that the method and the work really stands in some sort of interesting intersection of reinforcement learning and self supervised learning research.

Speaker 2

我们在海报的左下角放了一个小图，那是从一段关于如何构建智能系统、以及这将由无监督学习、监督学习还是强化学习实现的幻灯片中截取的截图。

And we had this little figure on the bottom left of the poster, which was the screenshot of a slide from talking about how to build intelligent systems and whether that's gonna be done by unsupervised learning or supervised learning and reinforcement learning.

Speaker 2

我认为她的论文真正表明，这些方法之间的界限非常模糊，而构建智能系统的关键可能在于综合利用它们的所有洞见。

And I think what her paper really suggests is that the boundary between these things is really blurry, and maybe the keys to building intelligent systems are gonna be leveraging insights from all of them.

Speaker 0

是的。

Yeah.

Speaker 0

这就像一层层的蛋糕。

It's a layer cake.

Speaker 2

没错。

Exactly.

Speaker 0

谢谢你的宝贵时间。

Well, thank you for your time.

Speaker 0

我知道你很快就要走了。

I know you I know you have to go soon.

Speaker 1

期待

Looking forward

Speaker 2

是啊。

to yeah.

Speaker 0

多伦，谢谢。

Doron, thanks.

Speaker 0

嗯。

Yeah.

Speaker 0

非常感谢你的到来。

Thank you so much for coming.

Speaker 0

我觉得那种模糊界限的见解很有趣。

I I think that that insight of, like, blurring things is interesting.

Speaker 0

我不确定你是否提到过，比如表征学习的抽象层。

I I don't know if you like, you you were talking about sort of, like, the abstraction layer of representation learning.

Speaker 0

我不知道这是否让你想到自监督学习和强化学习之间的融合。

I don't know if if that triggers anything in terms of, like, the mix between self supervised and reinforcement learning.

Speaker 0

这是你们发现的一个根本性问题，还是人们读论文时普遍不理解的地方？

Is that is that something fundamental that you've discovered or that we've that people don't understand when they when they read the paper?

Speaker 1

是的。

Yeah.

Speaker 1

我认为最好的解释方式是，我们知道标准的强化学习并不具备很好的可扩展性。

I think the best way that I would explain it is that we know that standard RL is not super scalable.

Speaker 1

那么，为什么这种不同的方法或不同的目标强化学习可以具备可扩展性呢？

And so, like, why can this different approach or different objective RL be scalable?

Speaker 1

我认为这是因为我们从根本上将学习的负担从类似Q学习或回归TD误差（我们知道这些方法非常不稳定、嘈杂且有偏差）转移到了一个本质上是分类的问题上。

I think it's because we're fundamentally shifting the burden of learning from something like q like q learning or, like, regressing to, like, TD errors, which we know is quite spurious and noisy and biased, to fundamentally, like, a classification problem.

Speaker 1

我们试图分类的是，未来状态是否属于同一条轨迹，还是属于不同的轨迹。

We're trying to classify whether future state is along the same trajectory or along a different trajectory.

Speaker 1

而我们是通过表征学习来实现这一点的。

And we do this with representation learning.

Speaker 1

对吧？

Right?

Speaker 1

我们知道，在深度学习文献中，分类、交叉熵损失和表示学习都是可扩展的。

And we know that classification, cross entry loss, and representation learning is scalable in the deep learning literature.

Speaker 1

对吧？

Right?

Speaker 1

如果我们想想语言以及其中的一些目标函数。

If we think about language and, like, some of the objectives there.

Speaker 1

从某种意义上说，我们正在模糊我们正在进行强化学习的界限。

So in some sense, we're kind of blurring the the lines that we're doing reinforcement learning.

Speaker 1

它仍然是一个演员-评论家强化学习算法。

It's still an actor critic reinforcement learning algorithm.

Speaker 1

它就像一个目标条件的强化算法。

It's like a goal condition reinforcement algorithm.

Speaker 1

但解决这个强化学习任务的学习负担，已经转移到了类似于语言和视觉领域中那些已被证明高度可扩展的目标上。

But the objective the burden of, like, learning of the of of solving that RL task shifts to something that's more similar to objectives that you might see in language and vision that we know have scaled so much.

Speaker 1

所以我认为是的。

And so I think yeah.

Speaker 1

我认为这是我们看到的一个基本洞见：通过这种不同的方式来处理强化学习，我们能够获得远超传统ARA方法所能做到的成果，我们的网络规模也得到了显著扩展。

I think that's, like, one of the fundamental insights that we've seen is that it seems like by approaching RL in in this different approach, we're able to, like, get so much more out of we were able to scale our networks, like, significantly beyond what was, like, standard used in ARA.

Speaker 4

我可以插一句吗？

Can I jump in?

Speaker 4

我来再多补充一点关于架构的背景，因为是的，我们使用了另一个目标函数，即对比损失。

I will just give a bit more of context about the architecture because, yeah, we use another objective and the influence is or the contrastive clause.

Speaker 4

不过，我们的架构与之前的工作，比如Brau或Simba Fowl、Simba Fowl One、Simba Fowl Two等论文非常相似。

However, the architecture is quite similar to the previous works of previous papers like Brau or Simba, Simba Fowl, Simba Fowl Simba Fowl one, Simba Fowl two.

Speaker 4

因此，我们也对这个架构做了一些调整。

So we also tweaked a bit of this architecture.

Speaker 4

但并不是说我们从零开始发明了轮子。

However, it's not that we like invented the wheel for the first time.

Speaker 4

真正让规模大幅提升、性能随规模增长的，是架构与目标函数的结合。

It's the merging between the architecture and the objective that makes the scale really, like, go up and and performance follow the the scale.

Speaker 0

我认为这一点我们可能需要更深入地思考。

I think that's something that we should probably mind deeper.

Speaker 0

你觉得，比如说，你已经在哪些领域或行业应用了它？在多种不同的网络或数据集上？

Do you think I guess, like, what domains, what industry like, you've applied it on on multiple different types of networks that are or datasets.

Speaker 0

你认为有没有哪个领域是特别容易上手的、低垂的果实？

Is there a particular affinity that you think, like, has is, like, sort of low hanging fruit?

Speaker 1

是的。

Yeah.

Speaker 1

实际上，如果你看看我们很多任务，特别是机器人任务。

So, actually, if you look at a lot of our tasks there, particularly sort of like robotics tasks.

Speaker 1

首先，我非常好奇这类工作会对机器人领域产生怎样的影响。

So this is first thing, I'd be very curious about how a work like this could impact, like, the robotics field.

Speaker 1

据我了解，现在的机器人技术有几种不同的方法。

Like, my understanding of robotics is that a lot of robotics are now there's kind of multi a few different approaches.

Speaker 1

一种方法是希望通过模仿学习来训练机器人。

Like, one approach is we wanna train robots using imitation learning.

Speaker 1

所以我们试图收集海量的数据。

So we try to collect, like, an insane amount of data.

Speaker 1

我们有大量的人类监督数据，试图扩展这些数据，并通过模仿学习进行训练。

We have a ton of cubers human supervision, and we try to scale up this data, and we're, like, learning with imitation learning.

Speaker 1

但另一方面，可能还有另一种方法，比如目标条件强化学习，我们可以训练机器人代理和强化学习代理，在完全不需要人类监督的情况下解决有意义的任务。

Like but on the other hand, potential like, perhaps there's another approach, which is, like, for example, like, goal condition reinforcement learning, where we can actually train robotic agents and train RL agents to solve meaningful tasks with absolutely no human supervision.

Speaker 1

无需演示。

No demonstration.

Speaker 0

这更具可扩展性。

It's much more scalable.

Speaker 0

是的。

Yeah.

Speaker 0

所以，嗯，所以

So, yeah, so

Speaker 1

这可以作为一种替代方法。

this could serve as an alternate approach.

Speaker 1

也许与其扩展数据、扩展人工人类监督（这并不太容易扩展），不如想办法让目标条件强化学习变得可扩展，比如我们可以扩展架构，或者扩展

And perhaps instead of, like, scaling data, like, scaling manual, like, human supervision, which which is, you know, not super scalable, if there are ways to sort of make goal condition reinforcement learning scalable and, like, we can just scale the architecture or we can scale

Speaker 0

因为你专注于所有的目标。

Because you're focused on all your objectives.

Speaker 1

是的。

Yeah.

Speaker 1

对。

Right.

Speaker 1

对于某些不同的目标，我认为这可能会非常令人兴奋，看看它如何影响机器人技术这样的领域。

With with certain different objectives, I think that could be very exciting and see to see how that can affect a field like robotics, for example.

Speaker 0

是的。

Yeah.

Speaker 0

再深入探讨一下Igor提到的效率问题。

Double click on on just one more one thing on the efficiency, which you Igor was talking about.

Speaker 0

我预计越深，情况应该会呈平方级恶化。

I would expect the very deep the deeper it is, this should be quadratically worse.

Speaker 0

我不太熟悉现有的文献。

I am not familiar with, like, the the preexisting literature.

Speaker 0

我只是在慢慢理清一些直觉。

I'm just, like, sort of working out intuitions.

Speaker 0

嗯。

Mhmm.

Speaker 0

但总的来说，你发现了哪些权衡因素，我觉得你可能想提醒大家注意？

But, basically, what are the trade offs that you've found that I think you might wanna warn people about?

Speaker 0

因为你才是那个提到效率的人。

Because because you you are the guy who mentioned efficiency.

Speaker 3

嗨。

So Hi.

Speaker 3

当然。

Sure.

Speaker 3

当然。

Sure.

Speaker 3

当然。

Sure.

Speaker 3

是的。

Yeah.

Speaker 3

所以我指的是我们海报和论文中的一个图表，其中我们比较了模型在深度和宽度两个维度上扩展时的参数数量，是的。

So I was referring to, like, one of the figures on our poster also in our paper where we compare, like, the number of parameters that models have as we scale along the axis of depth and as we scale along the axis of width Yeah.

Speaker 3

从我们的基线架构来看，最基础的版本是宽度为256，即隐藏层有256个神经元，深度为4层，也就是4个隐藏层。

From our baseline architecture the most baseline one would be, like, a width of two fifty six the hidden layers of two fifty six neurons, and then the depth is four four layers or hidden layers.

Speaker 3

所以我当时想表达的是，当你在深度维度上扩展时，模型的参数数量大致呈线性增长。

And so the point I was making there is that when you scale along depth, your the number of parameters that your model has is gonna grow roughly linearly.

Speaker 3

而当你让网络的输出变宽时，下一个网络的输入也会随之增长。

Whereas with with, you're making your network outputs wider, and then the input to the net next network is also growing as well.

Speaker 3

因此，你的网络参数数量将大致呈二次方增长。

And so the the number of parameters your network's then gonna have grows approximately quadratically.

Speaker 3

因此，我们做了一个实验，研究在通过这两种不同方式扩展模型参数数量时，对于大致相同的参数数量，哪种方式能带来更好的性能。

And so one of the experiments we did was sort of examining, as we grow the number of parameters in our model by scaling along these two different choices, which one for the same, like, approximate number of parameters yields a better performance.

Speaker 3

而深度曲线大致是这样的。

And the depth curve kinda goes like this.

Speaker 3

它上升得非常快。

It jumps up pretty fast.

Speaker 3

这在我们的论文中随处可见。

That's, like, present throughout our paper.

Speaker 3

对于宽度来说，增长得稍微慢一些。

For width, it grows a little bit more slowly.

Speaker 3

因此，从中得出的结论是，如果你的资源比较有限，按深度扩展可能更好，因为这样可以用更少的参数、更小的模型、更少的可学习参数——而这些是昂贵的。

And so that the kinda takeaway from that is that if you are a bit more resource constrained, scaling long depth might be better because with fewer parameters with a smaller model tool a smaller number tool learnable parameters Which is expensive.

Speaker 3

这是昂贵的。

Which is expensive.

Speaker 3

没错。

Exactly.

Speaker 3

而且一般来说，参数越多，成本自然也越高。

And in general, of course, like, more parameters is also gonna be more expensive.

Speaker 3

所以，使用这些网络时，这也是一个需要考虑的额外因素。

So that's just, like, another consideration to think about when using these networks, I suppose.

Speaker 3

是的。

Yeah.

Speaker 0

还有其他类似的经验法则可以提取吗？

Any other sort of rules of thumbs like that that I can extract?

Speaker 0

这只是我能想到的最基础的一个。

This is just the most basic one that I could think of.

Speaker 0

是的。

Yeah.

Speaker 0

我不知道还有其他的吗？

I don't know if there's any others?

Speaker 1

是的。

Yeah.

Speaker 1

我想，回到你最初关于权衡的问题。

I guess, like, to your original question of, like, the trade offs.

Speaker 1

其中一个权衡或局限性是，显然，如果你让网络更大，运行时间会更长。

Like, one of the trade offs one of the limitations that we say is, like, obviously, if you make the networks bigger, the it it will takes longer to run.

Speaker 1

对吧？

Right?

Speaker 1

所以，如果你在某个深度层次上把网络深度翻倍，前向传播的时间可能会增加一倍。

So if you, like, double the depth at some level of depth, you you it might take, like, twice as much to, like, take make a forward pass through the network.

Speaker 1

对吧？

Right?

Speaker 1

然而，在我们的论文中，对于大多数环境，我们实际上能够在很少的层数内就达到近乎完美的性能，根本不需要达到上千层。

However, this is not so, like, within our paper, like, for most environments, we are able to, like, saturate like, get to, like, almost perfect performance within just you know, we don't even need to get to, like, a thousand layers.

Speaker 1

比如，可能只需要64层就足够了。

Like, maybe just 64 layers, for example, is sufficient.

Speaker 1

在这个范围内，网络的延迟实际上并不一定构成显著的瓶颈。

And in this regime, like like, the the latency of the network is not necessarily actually even the not necessarily, like, a a significant bottleneck.

Speaker 1

你可以想象，在很多任务中，尤其是在强化学习中，收集数据才是真正的瓶颈。

Like, you can imagine there's a lot of tasks in which, especially in RL, that, like, collecting data might be the bottleneck.

Speaker 1

对吧？

Right?

Speaker 1

通过我们的网络进行前向传播可能并不是瓶颈。

And making forward passes through our network may not be the bottleneck.

Speaker 1

因此，在我们的环境和研究中，我们专门使用了基于 JAX 的 GPU 加速环境 JAX GCRL。

And so in our environment, we in our research, we specifically use the JAX g c r l environment, which is a JAX based GPU accelerated environment.

Speaker 1

因此，我们可以同时并行收集数千条环境轨迹，这样我们就能轻松实现这一点。

So we can collect, like, thousands of, like, environment trajectories, like, in parallel at the same time so that we're able to, like, make, like Oh, this is built in.

Speaker 1

对。

Right.

Speaker 1

这是内置功能，使我们能够同时在所有这些环境中收集上千条轨迹。

This is built in so that we can collect, you know, like like, a thousand trajectories at the same time along all these environments.

Speaker 1

这确保了我们有足够的数据来充分学习。

And so makes that make sure that, like, we have enough data to, like, saturate the learning from things.

Speaker 1

哇。

Wow.

Speaker 1

是的。

Yeah.

Speaker 3

这就像他们所称的工作。

That's, like, work they've been called.

Speaker 0

好的。

Okay.

Speaker 0

而且我不知道你是否想深入探讨一下

And you I don't know if you wanna explore expound up

Speaker 4

关于DoctorZC时代。

on that on the DoctorZC era.

Speaker 0

也许我可以，而且你知道，大多数人对PyTorch比较熟悉，但对JAX可能不太了解。

Maybe I can And and, you know, most people are familiar with PyTorch, maybe less familiar with JAX.

Speaker 4

关于JAX。

With JAX.

Speaker 4

我认为JAX正在获得关注，尤其是在强化学习领域，因为对于在线强化学习来说，获取尽可能多的数据是最重要的。

I think JAX is getting the the traction, especially in RL field because the in for online reinforcement learning, getting as much data as you can is is the most important.

Speaker 0

一定有PyTorch的等价物。

There's gotta be a PyTorch equivalence.

Speaker 0

但无论如何，对于其他也在探索这种 rollout 的人，有什么建议吗？

But anyway, how are any tips for other people also exploring this kind of rollout?

Speaker 4

是的。

Yeah.

Speaker 4

是的。

Yeah.

Speaker 4

所以我认为我也可以推荐一下，对于条件良好的强化学习，我用的是 JAX DRL，但也有一些多智能体 JAX 实现和其他方法。

So I think I can also recommend, like, for for well conditioned RL, I'm JAGS DRL but there are also like multi agent JAGS implementation and others.

Speaker 4

回到我们的论文，如果你看这些图表，会发现只有当我们跨越五千万次转换的门槛时，性能才出现巨大提升。

So going back to our paper, if you look at the plots, only see this like huge performance increase when we cross like 50 millions of transitions gap.

Speaker 4

所以我认为这里的数据至关重要。

So I think the data is crucial like here.

Speaker 1

是的。

Yeah.

Speaker 1

我想进一步补充一下，我喜欢将这些成功经验与深度学习其他领域的成就进行类比。

I guess even to build on that, like I like drawing analogies to like successes in other areas of deep learning.

Speaker 1

比如，在大型语言模型中，我们之所以能够扩展到如此庞大的网络，是因为我们找到了一种范式，可以利用整个互联网规模的数据进行训练。

Like, for example, in large language models, the reason why we're able to scale to such large networks is that we found a paradigm in which we can leverage the entire Internet scale of data to alert.

Speaker 1

对吧？

Right?

Speaker 1

因此，传统上，强化学习中的数据很难获取。

And so data in RL traditionally has been hard to come by.

Speaker 1

但现在有了这些GPU加速的环境，我们只需几小时就能收集到数亿倍的数据。

But now with these, like, GPU accelerated environments, we can collect hundreds of millions of times of the data within just a few hours.

Speaker 1

因此，我认为这为我们提供了一个绝佳的测试平台，让我们能够找到扩展网络容量的方法，并获得类似的提升。

And so I think that this serves as, like, a really good test bed for us to be able to also find ways to scale up, like, network capacity and get similar kind of gains.

Speaker 0

我想我之前问过，你是不是说你在LLM的预训练上会采取不同的做法？

I think I asked about Are you saying that you have a difference you would do pretraining differently in LLMs?

Speaker 0

那么，现在的目标有什么不同？

Like, what's the what's the difference objective now?

Speaker 0

嗯。

Yeah.

Speaker 0

很简单，你所提到的范式是下一个词，没错。

Very simply very simply, the the paradigm that you're referencing is next word Yeah.

Speaker 0

或者下一个标记。

Or next token.

Speaker 0

对吧？

Right?

Speaker 0

这非常稳健。

It's very robust.

Speaker 0

嗯。

Yeah.

Speaker 0

嗯。

Yeah.

Speaker 0

你如何改变这一点？

How do you change that?

Speaker 1

哦，我不是说我们在改变它。

Oh, I'm not saying that we're changing it.

Speaker 1

我想利用这些见解来提升士气。

I wanna leverage insights from that to apply to morale.

Speaker 0

我觉得他应该走另一条路。

I I feel like he should go the other way.

Speaker 1

你觉得他应该走另一条路。

You think he should go the other way.

Speaker 1

也许吧。

Maybe.

Speaker 1

我的意思是，那也会是一个非常有趣的研究方向。

I mean, that that would be a very interesting research direction too.

Speaker 1

但事实上，是的，就这一点而言，我一直在想，我们的目标函数在某些情况下并不是严格的下一个词预测，而更像是下一个状态预测。

But actually, yeah, even on that point, like, one of the things I was thinking about is that the way that our our objective works is in some set it's not exactly next word prediction, but it's kind of like next state prediction.

Speaker 1

对吧？

Right?

Speaker 1

你可以想象，你处于某个当前状态和某个当前动作，我们想要预测这个未来状态——这个特定的状态——是否属于同一条轨迹的未来状态，还是另一条轨迹的未来状态。

You imagine you're at some current state and you're at some current action, and we wanna predict whether or not this future state this this certain state is a future state along the same trajectory or a different trajectory.

Speaker 1

所以某种程度上，我们实际上是在做某种类似

And so in some sense, we are actually doing some sort of, like

Speaker 0

隐式世界模型。

Implicit world model.

Speaker 0

隐式的，比如，

Implicit, like,

Speaker 1

你知道，我在想

you know, like, in in I don't know

Speaker 0

如果这是个不好的词，那这就是。

if that's a bad word this this is.

Speaker 1

或者，在语言中，你用交叉熵损失来分类下一个词元。

Or, like, in language, you you do a cross entry loss to classify the next token.

Speaker 1

对吧？

Right?

Speaker 1

而在这里，我们只是对某个下一个状态是否属于同一轨迹进行二分类。

And here, we're just doing a binary classification of, like, whether or not some next state is some Yeah.

Speaker 0

是的。

Yeah.

Speaker 0

是的。

Yeah.

Speaker 0

这是一种分类。

It's a classification.

Speaker 0

是的。

Yeah.

Speaker 0

是的。

Yeah.

Speaker 0

是的。

Yeah.

Speaker 1

所以我确实看到这里有一些相似之处，或许我们应该深入探讨一下，到底是什么让深度学习能够扩展，以及我们如何利用这些洞察，如何提炼出这些核心原理，并将其应用到各个不同领域，无论是语言还是强化学习。

And so I do see that there are some, like, sort of parallels here that perhaps we should dig into deeper and see, like, what is the core to of what enables deep learning to scale, and then how can we, like, leverage that how how can distill those, like, insights and then apply those across, like, all different fields, whether it's language or reinforcement learning.

Speaker 0

是的。

Yeah.

展开剩余字幕（还有 153 条）

Speaker 0

你明白我关于榜样作用的意思了吗？

Did you did you get my my meaning about the role model stuff?

Speaker 3

嗯。

Yeah.

Speaker 3

嗯。

Yeah.

Speaker 3

实际上，我好像昨天在海报展上听到艾森巴赫教授谈到过这个，他正在向几个人解释，因为这并不是在进行表征学习，去为给定的状态、动作或目标学习有意义的表征，从某种意义上说，你可以把它看作是在学习环境模型、学习世界模型，而无需进行任何高维复杂的下一帧预测之类的工作。

Actually, I I and I I heard I think I might have heard professor Eisenbach yesterday talking about this at a poster, and he's explaining to a couple people that because this isn't, like, doing representation learning and trying to learn these meaningful representations for a given state and action, for a given goal, in some sense, you can think of it almost like learning a model of environment, learning a model of the world, but without having to do any sort of, like, next frame prediction or stuff like that that's a little bit more high dimensional and complex.

Speaker 3

嗯。

Yeah.

Speaker 0

嗯。

Yeah.

Speaker 0

我试图思考和推进的角度是，他们不是去学习下一个世界，而是生成多个可能的世界候选，并对它们进行分类，正如你所说，这正是我做事的方式。

I will think, like, the the angle that I'm trying to think about and push is instead of learn the next world, they're they're basically, like, generate a number of candidate possible worlds and classify them, to your point, which is exactly how I do things.

Speaker 0

比如说我在打扑克，试图判断你手里有什么牌。

Let's say I'm playing poker, and I'm trying to classify what hands you have.

Speaker 0

好吧，根据你的行为，手牌范围是变化的，我获得的信息越多，就越能确定你手里到底是什么牌，基于你展示或伪装的内容。

Well, there's a range of hands based on what you're you're doing, and the more information I get, the more I resolve to, oh, I know exactly what have you have based on what you're showing, you know, or you're buffing.

Speaker 0

但这是另一回事。

But that's a that's a different thing.

Speaker 0

但，你知道，我的意思是？

But, you know, you know I mean?

Speaker 0

我觉得，这正是表征的终极角度，嗯。

Like, I I feel like that is the ultimate sort of angle of representation Mhmm.

Speaker 0

也就是一个世界。

Which is a a world.

Speaker 0

但我不确定，这是否比视频生成领域那些更具体的全局模型太过模糊了。

But I don't know if that is too vague compared to the more concrete types of world models that, let's say, the video gen people are doing.

Speaker 0

然后，我想还有另一件事，我也在探索。

And then I I guess one one other thing, like, I I'm also exploring.

Speaker 0

你提到过，深度模型更慢或者更昂贵。

I you mentioned, like, the deep models being slower or more expensive.

Speaker 0

嗯。

Yeah.

Speaker 0

这确实是推理领域让模型变浅的一种趋势。

That that is a trend in the inference world of making models shallower.

Speaker 0

对吧？

Right?

Speaker 0

我在想，像‘深度教师，浅层学生’这样的简短说法，会不会是一个不错的部署范式。

And I wonder if this, like, short catchphrase I was thinking about, like, deep teacher, shallow student would be a good deployment paradigm.

Speaker 0

嗯。

Yeah.

Speaker 0

比如，你用大模型推动前沿能力，然后再进行蒸馏

Like, you push the frontier capabilities with the with with f and then you distill

Speaker 1

把它们蒸馏回去。

Distill it back.

Speaker 1

好的。

Alright.

Speaker 1

是的。

Yeah.

Speaker 1

实际上，这是个很好的观点。

Actually, this is a good point.

Speaker 1

比如，如果你访问我们的网站，这是我们列在最底部的未来研究方向之一。

Like, if you go out to our website, like, this is one of the future directions that we list at the very bottom.

Speaker 1

哦，原来如此。

Oh, okay.

Speaker 1

是的。

Yeah.

Speaker 1

我们非常希望看看是否能获得类似的性能。

We we we we would love to see if we could get similar performance.

Speaker 1

我们推动了，你知道的，我们在目标条件强化学习和杰克的GC RL上取得了显著的领先性能。

Like, we pushed the you know, like, we do achieve state of the art performance on goal condition RL and Jack's GC RL by a significant amount.

Speaker 1

因此，看到强化学习智能体训练能力的前沿被推向前所未有的高度，令人非常兴奋。

And so it was very exciting to see the the, like, the the sort of frontier of the ability to train RL agents sort of pushed.

Speaker 1

如果我们能以一种与标准网络一样高效的方式实现这一点，那就太棒了。

And if we can do that in a way that also sort of is just as efficient as a standard, you know, networks, that would be very cool.

Speaker 1

所以，你知道，有没有

So, you know, like, is there

Speaker 0

有的。

any Yeah.

Speaker 0

因为训练和推理时使用的东西不必相同。

Because training doesn't have to be the the same thing that you deploy at inference.

Speaker 0

对。

Right.

Speaker 0

你明白我的意思吗？

You know what mean?

Speaker 0

比如，所以

Like, so

Speaker 1

嗯。

Yeah.

Speaker 1

所以，是的。

So Yeah.

Speaker 1

所以，如果能通过蒸馏到更小的模型或剪枝模型，同时仍保持性能，那将是一个非常有趣的研究方向，我们

So if there's ways to, like, distill down to a smaller model or prune the model and maybe not and still retain performance, that's a very interesting research structure that we

Speaker 0

应该填补。

should fill.

Speaker 0

让我们谈谈其他未来方向。

Let's talk about other future directions.

Speaker 0

你个人还有哪些其他的热情所在？

What what what else is your personal passions?

Speaker 4

是的。

Yeah.

Speaker 4

目前我正在研究强化学习中的拼接方向。

So currently I'm pursuing direction of stitching in reinforcement learning.

Speaker 4

我们试图从更短的子行为中泛化强化学习，以便在测试时将它们拼接合并。

So we are trying to generalize reinforcement learning from shorter sub behaviors so that they are stitched, merged during the test time.

Speaker 4

是的，我认为这是我在博士期间要处理的最后几篇论文之一。

And yeah, I think this is one of my last papers that I will tackle during the PhD.

Speaker 1

我个人非常好奇，我们究竟能不能推动到极致？

Personally, I would I'm very curious of like, can we like, what's the real like, can we push?

Speaker 1

我很好奇，能否尽可能地推进前沿。

I'm I'm I'm curious about, like, advancing the frontier as much as possible.

Speaker 1

所以如果你仔细看我们的论文，我们专注于增加深度，但发现增加宽度也能提升性能，而且我们还发现，通过增加深度，实际上也解锁了增加批量大小的能力。

So if you actually look at our paper, we focus on scaling depth, but we notice that we see that scaling width actually also improves performance, and we also find that actually by scaling depth, we actually unlock the ability to scale along batch size as well.

Speaker 1

所以这真是。

So this is Ah.

Speaker 1

这是其中之一，嗯。

One of yeah.

Speaker 1

所以，好吧。

So so okay.

Speaker 1

这就像，我想是共线的，

It's like, I guess collinear,

Speaker 0

嗯，是的。

like yeah.

Speaker 1

对。

Right.

Speaker 1

所以，好吧。

So, like okay.

Speaker 1

我想说明一下背景，在传统强化学习中，比如基于价值的强化学习，增大批量大小效果并不明显。

I guess for context, like, in traditional RL, like, value based RL, scaling batch size is not super effective.

Speaker 1

但我们也看到，在深度学习的其他领域中，有研究指出，增大批量大小只有在网络容量足够大以充分利用增大的批量时才最有效。

But there's we also can see there's also other work in other areas of deep learning that show that scaling batch size is only most effective when there's, like, a large enough network capacity to take advantage of the scaled batch size.

Speaker 1

而我们实际上发现，嗯，也许吧。

And we actually find that, you know, perhaps yeah.

Speaker 1

因此，一个可能的假设是，为什么在传统强化学习中增大批量效果不佳，可能是因为我们一直使用的是这些小型网络，无法捕捉到这种效应。

So one hypothesis might be, like, perhaps the reason why scaling batches isn't that effective in traditional RLs because, like, we've been using these tiny networks that haven't been able to capture that.

Speaker 1

我们的一项实验是，由于我们实现了深度网络的顺序训练，这为我们提供了一个绝佳的测试平台来验证这一假设，我们发现，确实，随着网络容量的扩大，我们也解锁了批量缩放的这一不同维度。

And one of our experiments is that, like, because we are enabled sequential training of deep network, we actually were able to this is a great test bed for, you know, like, testing this hypothesis, and we find that indeed, as we scale the network capacity, we also unlock this different dimension of scaling Biosec.

Speaker 1

所以我本来想说的是，我非常好奇，对于那些拥有足够算力的人来说，能否拿一些这些环境，把批量大小、网络深度都扩展到最大能力，同时也要扩大批量大小。

And so all I was about to say is that I'm very curious for someone, like, with enough compute to, like, take some of these environments, scale up batch scale up depth to the maximum capability, also scale along with also scale along batch size.

Speaker 1

让我们像在语言模型中那样，同时在多个维度上进行扩展，我们能否也解锁不同的扩展维度？通过这样做，我们能获得哪些能力？又能把强化学习智能体的训练前沿推多远？

And let let's like, basically, like, in the same way that in language, we're we're scaling along so many different axes, can we unlock different dimensions of scaling as well, and what capabilities and how far can we push the frontier of training these RL agents from doing that?

Speaker 0

在我们转交之前，肖恩，当你说到‘足够算力’时，你指的是多大的算力预算？

Before we pass it, Sean, when you say enough compute, what kind of compute budget did you have?

Speaker 0

你是怎么知道的？我只是想看看你们到底有什么资源。

How how does it how you know, I just wanna That's that's see what you guys got.

Speaker 1

问题。

Question.

Speaker 1

所以我们想确保这一点是可行的，我们希望让它足够易用。

So we we wanted to make sure that this is we we wanted to make it such that, like, you know, it's quite accessible.

Speaker 1

好的一点是，我们的所有实验，即使是千层网络，也能在单张80GB的H100显卡上运行。

So I keep the nice thing is that all of our experiments, even the thousand layer networks, can be run on one single 80 gigabyte h 100 GPU.

Speaker 1

所以这些成本都控制在了这个范围内。

So that's all those dollars.

Speaker 1

是的。

Yeah.

Speaker 1

对。

Right.

Speaker 1

对。

Right.

Speaker 1

对。

Right.

Speaker 1

所以我们所有的实验都在单张GPU上运行，但理论上，如果我们有一个分布式训练设置，能够全力投入计算资源，真正去推动前沿，那将会非常有趣，看看会发生什么。

So everything we run on one GPU, but in theory, if we had, you know, like a distributor training setup and, like, can just, like, blast compute through this and really wanted to push the frontier, it'd be very interesting to see how things go.

Speaker 3

没错。

Yep.

Speaker 3

酷。

Cool.

Speaker 3

我一直在积极学习关于视觉-语言-动作模型和NeurIPS上的角色模型的尽可能多的知识。

And I've actively been trying to learn as much as I can about vision language action models, role models at NeurIPS.

Speaker 3

我去了很多

I'm going to a lot of

Speaker 0

机器语言动作模型？

Machine language action models?

Speaker 3

那是视觉语言。

That's vision language.

Speaker 3

视觉语言。

Vision language.

Speaker 3

是的。

Yeah.

Speaker 3

是的。

Yeah.

Speaker 3

是的，对表示学习在机器人中的应用很好奇。

Yeah, curious about applications of representation Robotics.

Speaker 3

用于这些。

For these.

Speaker 3

是的。

Yeah.

Speaker 3

没错。

Exactly.

Speaker 3

针对机器人。

For robotics.

Speaker 3

正在积极尝试探索这一领域的更多内容。

Actively trying to explore more in that area.

Speaker 3

所以我在大量阅读文献，和很多人交流。

So just reading a lot of literature, talking to as many people.

Speaker 3

这就像

It's like

Speaker 0

我们刚刚发布了与Geno Intuition的对话。

We just released our episode with Geno Intuition.

Speaker 3

哦，好的。

Oh, okay.

Speaker 3

太棒了。

Awesome.

Speaker 0

如果你了解他们的历史，他们最初是一家游戏剪辑公司，现在基本上已经开发出了一个视觉-语言-动作模型，是的。

Where if if you know a bit about their history, they started as a gaming clipping company, and they basically have a vision language action model Yeah.

Speaker 0

我看到了一个预览版。

Which I I saw I saw a preview.

Speaker 0

它非常令人印象深刻。

It was it was very impressive.

Speaker 0

我不确定它在具身应用场景中的可迁移性如何，但其实也不需要。

I'm not sure exactly how transferable it is to embodied use cases, but it doesn't have to.

Speaker 0

比如，屏幕就挺好。

Like, screen is fine.

Speaker 0

你知道的？

You know?

Speaker 0

是啊。

Like, yeah.

Speaker 0

我不确定你有没有什么看法，嗯。

I I I don't know if you have any takes on Yeah.

Speaker 3

这确实是一个令人兴奋的研究方向。

So it's an exciting research direction, definitely.

Speaker 0

是的。

Yeah.

Speaker 0

我认为，将‘动作’作为输出内容这一概念，在业界其实并不流行。

I I I think the the the the the concepts of actions as as something that you are outputting is actually not that popular in industry.

Speaker 0

对吧？

Right?

Speaker 0

仅仅是因为过去三年里，文本完全占据了主导地位，而工具调用也不过是另一种形式的结构化文本。

Only because text has completely dominated the last three years, and tool calling and which is just another form of structured text.

Speaker 0

我觉得动作研究有点像是我不知道需要发生什么才能推动它进入下一阶段。

And and I I feel like the action research is is kind of like I don't know how I don't know what needs to happen in order to unlock the next phase in in that.

Speaker 0

我不确定你有没有看到什么有趣的东西，有的话说出来吧。

I don't know if you if you've seen anything interesting out here, shout shout it out.

Speaker 0

嗯。

Yeah.

Speaker 0

有很多

There's a lot

Speaker 3

关于利用预训练的视觉语言模型的精彩工作

of cool work on, like, leveraging pretrained VLMs

Speaker 0

然后你冻结它，再应用它。

and the You freeze it, and then you apply it.

Speaker 0

嗯。

Yeah.

Speaker 0

然后你把

And then you put

Speaker 3

所有东西都放在上面，比如专家来输出动作。

everything on top of that, like, sort of experts to output actions.

Speaker 3

还有，比如用于分层规划的系统，可能先输出一个更高层次的计划，这是一个较大的网络，推理时间较长，因此以较低频率输出计划，比如以某种块的形式，然后在此基础上，还有一个运行速度更快的二级系统。

Also, like, systems for doing, like, hierarchical planning, maybe outputting some higher level plan that and this is, a larger network that takes a long time to a little longer to do inference, and so it outputs its plans with less frequency, like some sort of chunk, and then from there, there's, like, some sort of second system that operates a bit more fast.

Speaker 3

我认为这个方向有很多有趣的研究，所以我非常期待。

I think there's quite a bit of interesting research in that direction, so that's what I'm I'm looking forward to.

Speaker 0

酷。

Cool.

Speaker 0

最后一个问题。

Final question.

Speaker 0

在邮局会议期间，你被问过最难的问题，或者最难忘的相遇，有没有遇到什么名人？

Hardest question you were asked at the postal session or just favorite encounter, anyone famous that you met?

Speaker 1

其实我还没怎么有机会去参加这个会议。

So I actually haven't gotten a chance to go to the conference that much.

Speaker 1

我现在全职工作了。

I'm actually working full time now.

Speaker 1

哦，天哪。

So Oh, dang.

Speaker 1

是啊。

Yeah.

Speaker 1

到目前为止，我刚刚在会议开始前几分钟才拿到我的徽章。

So so far, I I I actually literally just got my badge, like, a few moments before my session.

Speaker 1

所以，我想我可能不是回答这个问题的最佳人选。

So I I guess I wouldn't be the best to answer that question.

Speaker 0

不。

No.

Speaker 0

不。

No.

Speaker 0

不。

No.

Speaker 0

因为人们会向你问一些问题。

Like, the because you like, people ask you stuff.

Speaker 0

对吧？

Right?

Speaker 1

哦，哦，哦。

Oh oh oh.

Speaker 3

我是我的密友。

I'm I'm my close friend.

Speaker 0

因为人们会问你或遇见你，然后你给人一种感觉，就是人们在说些什么，没错。

Because people asking you or meeting you and, like, you know, just give a vibe of, like, what people are saying and Yeah.

Speaker 1

我在想，人们觉得这相当令人耳目一新。

Was thinking about people were very I think it's sort of, like, a very eye opening.

Speaker 1

我认为普遍的看法是，人们觉得这篇论文非常令人耳目一新，因为它的目标相当简单。

I think that the general question is that people thought it was a very eye opening paper because, like, the objective is quite simple.

Speaker 1

这非常优雅，对我们来说，能够——我不太想说‘颠覆’，但至少挑战一下‘强化学习不可扩展’这种传统观点，并将其推至千层深度，同时看到性能持续提升，这很了不起。

It's quite elegant, and for us to be able to, like, you know, like I I don't wanna say, like, overturn, but, like, sort of challenge the conventional wisdom that, like, RL is not super scalable and push it to such limits, like, a thousand layers deep and see continued improve performance.

Speaker 1

我觉得我得到的普遍印象是，如果我们能沿着这个方向继续发展，确实可以在多个维度上扩展强化学习的能力，推动其前沿边界。

I think the general impression that I've gotten is that, you know, this this this could be like a really cool like, if if we can sort of build along this direction and that, like, we can really scale along to all these different dimensions and push the frontier of the ability for RL.

Speaker 1

我非常好奇这会如何发展。

I'm very curious to see how that plays.

Speaker 0

好的。

Alright.

Speaker 0

非常感谢你抽空前来。

Well, thank you so much for dropping by.

Speaker 0

再次祝贺你的论文。

Congrats on the paper again.