今天看到了Geo hotz发了一篇推文. 推文如下:

如果不了解他的话, 这里补充一下背景tinygrad就是他一直在working的项目, 类似于pytorch这样的神经网络框架. 简单总结一下他的观点: 未来的计算机boot之后直接进入一个神经网络. 而神经网络依赖于神经网络框架. 所以tinygrad就像硬件层一样, 神经网络是软件层. 因此神经网络框架就等于操作系统. 其实就是LLM OS的一种分解. 此外没有了CPU也就没有了内存问题, 神经网络只存在data flow而不存在control flow.

最后让claude点评一下hotz的推文.

第一次用LLM去点评其他人的观点. 还有有点被惊叹到的. 有理有据而且这个回答还是比较有说服力.

尽管我是hotz的粉丝, 上研究生的时候经常会去看他编程的直播. 但这篇推文目前时间点感觉还是两个字.

Too hype.

footnote: hotz的想法完全有可能达到, 只要有无限context window和解决了幻觉的llm.

二〇二四年九月廿三日

深度强化学习101

记录一下接触深度强化学习的学习历程. 材料选自Pieter Abbeel的Foundation of Deep RL Series.. 整篇文章追求知识点的连惯性. 所以会补充详细的公式和历史背景. 这篇文章会一直持续更新. 也算是对之前说要写的llya30做一次原型测试吧. 由于会采用不用来源的资料所以概念会有重叠, 好处是可以从不同角度看到一个概念的描述. 方便更全面的理解.

Foundations of Deep Reinforcement Learning with Pieter Abbeel

Lecture 1: MDPs, Exact Solution Methods, Max-ent RL

点击这里进入lecture1的slides. 一句话总结lecture1的内容: 强化学习可以表示为马尔可夫决策过程(Markov Decision Processes, MDPs). 解MDP有两种基本策略, Value Iteration和Policy Iteration. 另外, 引入Entropy带MDP可以提升MDP的鲁棒性.

MDP: 需要了解什么是MDP. 先要了解什么是马尔可夫链(Markov chain). Markov为了证明独立同分布不是大数定律(Law of large numbers)满足的前提, 在1906年发表的文章里面提出了马尔可夫链(Markov chain). 如果需要进一步了解Markov chain, 推荐这个视频
- Markov chain: 前提下一颗的状态只取决于当前时刻 ,转移矩阵
- frist-order Markov chain & n-order Markov chain:
- Markov chains convergence: 我们在前面说到了Markov chain的提出是为了证明大数定律在非独立同分布的假设前提下, 依然成立. 那么Markov chain是如何收敛的呢?

$$ \frac{n!}{k!(n-k)!} = \binom{n}{k} $$

Bellman equation:
Value Iteration:
Policy Iteration:
Maximum Entropy Formulation:

我们这里通过slide里面的grid world并结合代码来理解一下整个过程.

小结:

Lecture 2: Deep Q-Learning

点击这里进入lecture2的slides. 一句话总结lecture2的内容: 由于实际情况下MDP的状态空间通常很大, 不适合直接用value iteration和policy iteration来解复杂的MDP. Q-Learning通过在样本空间进行采样来解复杂的MDP. 而Q-Learning也需要表格存储状态转移信息. DQN通过使用神经网络去采样来近似Q函数.

Q-Learning:
DQN:

Lecture 3: Policy Gradients, Advantage Estimation

点击这里进入lecture3的slides. 一句话总结lecture3的内容: 由于Value Iteration无法表示下一步的动作, 并且Q-learning在高维空间下没办法有效求解. 相比与Value Iteration和Q-learning, Policy Gradients的表示更加简单. 而Policy Gradients通过Likelihood Ratio Graident求解出来的是Path. 需要使用Temporal decomposition分解成State和Action. 需要使用Baseline subtraction去改变最终选择的Path. 在使用公式去分解的时候, 需要使用Value function estimation求解. 而在Advantage Estimation介绍一些更加进阶的Value function estimation方法.

Policy Gradients:
Temporal Decomposition:
Baseline subtraction:
Value function estimation:
Advantage Estimation(A2C/A3C/GAE):

Lecture 4: TRPO, PPO

点击这里进入lecture4的slides. 一句话总结lecture4的内容: 在Lecture 3中, 我们知道Policy Gradients可以求解出特定的Path. 但是在求解特定的Path的过程中, 步长(step-size)的大小该如何确定呢? TRPO解决了这个问题. 而TRPO需要second order来求解, 而不太方便做优化. PPO作为TRPO的改进, 降阶为first-order优化问题.

Surrogate loss:
Step-sizing and Trust Region Policy Optimization (TRPO):
Proximal Policy Optimization (PPO):

Lecture 5: DDPG, SAC

点击这里进入lecture5的slides. 一句话总结lecture5的内容: 对于特定场景, on-policy方法采样复杂度会比较高. 而DDPG解决了这个问题. SAC通过引入entropy保证了更好的收敛和防止过拟合.

Deep Deterministic Policy Gradient (DDPG):
Soft Actor Critic (SAC):

Lecture 6: Model-based RL

点击这里进入lecture6的slides. 一句话总结lecture6的内容: 前面5个lecture讲的都是model-free RL. 所谓model-free RL是指agent直接通过action来迭代policy而不需要学习环境. model-based RL通过对环境建模再进行policy迭代.

Model-based RL:
Robust Model-based RL:Model-EnsembleTRPO (ME-TRPO):
Adaptive Model-based RL: Model-based Meta-Policy Optimization (MB-MPO):

CS 285:Deep Reinforcement Learning

这一部分采用CS 285来补充一下上一部分没提到的一些内容.

UCL Course on RL

项目

Where to Go Next?

二〇二四年九月三日

智力复利

最近在重新回顾Naval写的书, 在网页版的最后Naval推荐了一系列读物. 其中有一个twitter thread是Naval标记是Must Read. 之前一直没注意到. 看了一遍有点感触. 这篇文章是这篇twitter thread的原文集合. 原文在这里. (Twitter thread on “intellectual compounding” by @Zaoyang)

1/ There's a concept known as financial compounding, but most people don't know about intellectual compounding. Buffett and Munger employed this to great effect and to accumulate mental models such that they can make large decisions quickly. Intuition is simply reading a lot.

2/ This allows people to convert typically slow thinking and bad fast thinking (bad intuition) into good intuition. In academic literature this is known as Type 1 and Type 2 thinking. Most people don't accumulate enough of knowledge the tree.

3/ One of the common patterns for a self made billionaires is their ability to self study, self reason and accumulate a set of their own mental model. Intellectual capital compounds at a hidden rate and most people use tangible badges and net worth as measures.

4/ Intellectual capital is filling out the decision tree and is a forward looking view and net worth is backward looking. Just like how the wrong way to value tech companies with network effects is revenue but instead by their retention rate and network effects.

5/ Most people value themselves on their individual badges, which is what society labels for you. But most of what is predictive in life, is how you make decisions. Being effective and investing your time in the right area at the right time is a skill, not purely luck.

6/ Allowing readings to fill your mind takes advantage of your diffuse/focused mode of thinking and type 1 and type 2 decision making. Meaning you can focus and use your creative brain (diffuse mode) to wander and make associations. This is the key toward STEM education.

7/ More and more, just like the last 20 years was focused on physical athletes, the next 20 years would be focused on mental atheletes. These ways of thinking and compounding, Buffett, Munger, and polymaths have already used to a large degree and have been confirmed by academia.

8/ As the world get faster and faster with AR/VR, AI, crypto. The ability to invest in your own intellectual capital is a crucial prerequisite to maintaining and succeeding in this world and also for your children.

9/ While politicians say it's education that's important, it's only partially true, it's the ability to assimilate knowledge trees and compound knowledge that leads to satisfaction, mental stimulation, and long term wealth.

10/ What politicians are doing now is simply aiming backwards, but how can you scale this to everyone? It turns out that charter school have been doing a grand experiment.

11/ Taking children from disadvantaged backgrounds and making them fit for college. This has worked and three of the results are the following.

12/ "Growth mindset" not "Fixed mindset" "Motivation mindset" not "Fixed mindset" "Student directed AND teacher directed education and projects"

13/ In turns out statistically those three items are the most predictive. What are they? I'm glad you asked.

14/ It turns out that if you just tell students that their mind is like a "muscle" and spend just 10 minutes explaining that concept they will improve their grades dramatically.

15/ It turns out this concept is for a person's mind and for each skill set. People can have verbal "fixed" mindsets, humor "fixed" mindsets, math "fixed" mindsets. Almost everything. So, you have to consciously unlearn this and apply it consciously even if you know it.

16/ Why is this? It turns out people are criticized by society and labeled. So even if you have a "growth" mindset for physical items, you don't have it for mental items. This applies not only for students but also for adults. You always hear "That's not me." It's a label.

17/ There's another concept called "motivational mindset" which teaches the person "what good looks like" which simply teaches the kid to follow "go do the extra problems" "go to office hours" "do more problems" Follow the process of the "motivated student" and it will work.

18/ It turns out these two concepts turn someone that's socio economically disadvantaged similar to the education status of someone who grew up upper middle class. This is not a panacea as a lot of people have so much stressors in their lives that they can't study.

19/ The last concept is student self directed project plus teacher lectures is the best. This surprised me, but conceptually, it gives the student agency and motivation. Most students have been so battered down by the system that they can't do this, but that's another story.

20/ These concepts have to be mind beliefs. Just like in Dune how they cite "fear is the mindkiller" These concepts have to be constantly applied to adults and children as they opposite tends to be pervasive and insidious.

21/ These are mindsets and as for strategies. People need to take concepts from Learning how to Learn by @barbaraoakley and @sejnowski and Art of Learning from Josh Waitzkin. They are are a manual for your brain.

22/ You thought just because you own a brain you knew how to operate it right? Why do you think the drop out rates for STEM is so high. Most people attribute it to pipeline or professor, but perhaps it's because people don't know how to learn difficult subjects.

23/ @LHTL_MOOC takes you from beginner to intermediate, and art of learning takes you from expert to being world class. Then you have Cal Newport's material, and those three resources are simply the best that I know of to hack your own brain

24/ Do you know anymore? Would love to know more resources and techniques.

25/ In conclusion, as the world become more and more technical and complex, most people don't have the mindset nor tactical skills. In short, people have to re-learn the manual to their own brain. Just because you have a computer, it doesn't mean you know everything about it.

二〇二四年八月五日

目录