由後驗取樣學習最優化
紫式晦澀每日一篇文章第38天
前言
-
今天是2022年第36天, 全年第5週, 二月的第1個週六. 今天來思考後驗取樣來學習最優化.
-
今天的素材主要來自文章:
摘要:後驗取樣的極大優勢
摘要:後驗取樣的極大優勢:
- 學習行動最優化(Learning to optimize actions): 平衡探索與剝削(Balance between exploration and exploitation)
- Thompson Sampling (機率匹配 Probability Matching): 比UCB更優, 可以申請有限或無限行動.
- 理論貢獻一: 轉換UCB後悔界成為後驗取樣的貝氏後悔界.
- 理論貢獻二: 後驗取樣的貝氏後悔界, 對許多模型族.
- 逃避者維度(Eluder dimension): 測量「行動獎勵(action reward)」相依程度.
- 在線性模型下, 此方法的廣義bound 可以match最好的可能, 也比最好廣義線性模型的結果好.
- 分析展現出後驗取樣的表現優勢, 實驗展現出比UCB還要好很多的表現
This paper considers the use of a simple posterior sampling algorithm to balance between exploration and exploitation when learning to optimize actions such as in multiarmed bandit problems. The algorithm, also known as Thompson Sampling and as probability matching, offers significant advantages over the popular upper confidence bound (UCB) approach, and can be applied to problems with finite or infinite action spaces and complicated relationships among action rewards. We make two theoretical contributions. The first establishes a connection between posterior sampling and UCB algorithms. This result lets us convert regret bounds developed for UCB algorithms into Bayesian regret bounds for posterior sampling. Our second theoretical contribution is a Bayesian regret bound for posterior sampling that applies broadly and can be specialized to many model classes. This bound depends on a new notion we refer to as the eluder dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm Bayesian regret bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models. Further, our analysis provides insight into performance advantages of posterior sampling, which are highlighted through simulation results that demonstrate performance surpassing recently proposed UCB algorithms.
文章組成:
-
- Introduction
-
- Related literature.
- 2.1. Measures of performance.
- 2.2. Related results.
-
- Problem formulation.
- 3.1. On regret and Bayesian regret.
- 3.2. On changing action sets.
-
- Algorithms.
- 4.1. UCB algorithms.
- 4.2. Posterior sampling.
- 4.3. Potential advantage of posterior sampling.
-
- Confidence bounds and regret decompositions.
- 5.1. UCB regret decomposition.
- 5.2. Posterior sampling regret decomposition.
-
- From UCB to posterior sampling regret bounds.
- 6.1. Finitely many actions.
- 6.2. Linear and generalized linear models.
- 6.2.1. Linear models.
- 6.2.2. Generalized linear models.
- 6.3. Gaussian processes.
-
- Bounds for general function classes.
- 7.1. Confidence bounds.
- 7.2. Bayesian regret bounds.
- 7.3. Relation to the VC dimension.
-
- Simulation results.
-
- Conclusion.
- Appendix A. Details regarding Lemma 1.
- Appendix B. Proof of confidence bound.
- B.1. Preliminaries: Martingale exponential inequalities.
- B.2. Proof of Lemma 3.
- B.3. Least squares bound—Proof of Proposition 6.
- B.4. Discretization error.
- Appendix C. Bounds on eluder dimension for common function classes.
- C.1. Finite action spaces.
- C.2. Linear case.
- C.3. Generalized linear models.
訣竅一. 鞅指數不等式 (Martingale Exponential Inequality)
Lemma 6 –> Lemma 7 –> Lemma 3 –> Section 7.1. Confidence bounds. –> Section 7: Bounds for general function classes.
構造: 過濾隨機變數序列, 中心化隨機變數, 條件累積生成函數, 指數鞅
過濾隨機變數序列, 中心化隨機變數, 條件累積生成函數, 指數鞅:
- 一組隨機變數序列, 適應一組過濾(filtration)
- 假設此組隨機變數序列的「累計生成函數」是有限的
- 定義「條件均值(Conditional mean)」
- 定義「中心化隨機變數」的「條件累積生成函數」
- 定義「指數鞅 (Exponential martingale)」
引理6:指數鞅是鞅, 期望是1.
引理6:指數鞅是鞅, 期望是1.:
- 利用「條件累積生成函數」的定義, 在時間1的時候, 「指數鞅」的期望是1
- 利用「過濾」, 可以推導出時間n的指數鞅條件期望,是時間n-1的指數鞅.
- 於是, 滿足的「鞅(Martingale)」的定義.
這其實是在操作某種勢能函數(potential function)?
引理7: 累積和的高機率上邊界
引理7: 累積和的高機率上邊界:
- 對任何一個「溫度lambda」, 這都是一個鞅
- 因此, 對於任何的「停時stopping time」, 指數鞅的期望一直都是1.
- 定義特殊的停時: 當「指數鞅」大過「截距(Intercept) $x$」
- 利用「馬可夫不等式」, 「指數鞅超過截距」的機率, 不會大過「1/截距」.
- 同樣的事件, 表示在時間1,2,…,n中「過至少一次門檻」的機率, 不會大過「1/截距」.
- 使用「單調收斂定理(Monotone convergence theorem)」, 得到「在任何時間中過至少一次門檻」的機率, 不會大過「1/截距」.
- 同樣的道理, 大過「exp(截距)」的機率低於「exp(-截距)」
- 帶回文脈: 「取樣誤差」* 「溫度」會很靠近條件累積生成函數」(該溫度)
- 於是, 這個「取樣誤差偏差累積過程」過「門檻」的機率可以被控制住.
「取樣誤差」* 「溫度」會很靠近條件累積生成函數」(該溫度)
****: ****:
****: ****: ****: ****: ****:
****: ****: ****: ****: ****:
****: ****: ****: ****: ****:
****: ****: ****: ****: ****:
後記
2022.02.05. 紫蕊 於 西拉法葉, 印第安納, 美國.
Version | Date | Summary |
---|---|---|
0.1 | 2022-02-05 | 初次閱讀, 摘要, 文章結構 |
評論