MUR028 Bregman鞅集中不等式基礎 (Ver 0.2)

目錄

Bregman鞅集中不等式基礎-Ver(0.1)

紫式晦澀每日一篇文章第28天

前言

  1. 今天是2022年第26天, 全年第4週, 一月的第四個週三. 今天來嘗試寫「非漸進鞅集中不等式 (Non-Asymptotic Martingale Concentration Inequality)」, 提升自己對集中不等式的理解.

  2. 今天的素材主要來自Bregman Deviations of Generic Exponential Families . 作者Odalric-Ambrym Maillard的文章是我學習非漸進鞅不等式的主要材料, 值得持續投資時間來累積自己的機率功力.

代碼BEF

000 摘要

拉普拉斯方法(Laplace method): We revisit the method of mixture technique, also known as the Laplace method, to study the concentration phenomenon in generic exponential families.

控制「有限樣本估計」與「指數族參數」之間的「Bregman散度」: Combining the properties of Bregman divergence associated with log-partition function of the family with the method of mixtures for super-martingales, we establish a generic bound controlling the Bregman divergence between the parameter of the family and a finite sample estimate of the parameter.

Bregman資訊得 (Bregman information gain) 建立時間均勻結果: Our bound is time-uniform and makes appear a quantity extending the classical information gain to exponential families, which we call the Bregman information gain.

經典參數族的具體「信心集(confidence sets)」與「Bregman資訊得」: For the practitioner, we instantiate this novel bound to several classical families, e.g., Gaussian, Bernoulli, Exponential and Chi-square yielding explicit forms of the confidence sets and the Bregman information gain.

數值上與SOTA表現相當:We further numerically compare the resulting confidence bounds to state-of-the-art alternatives for time-uniform concentration and show that this novel method yields competitive results.

實踐在線性脈絡多臂強盜問題: Finally, we highlight how our results can be applied in a linear contextual multi-armed bandit problem.

100 簡介

1.0 Introduction (P.1-P.2)

BEF101 集中不等式以拿到估計誤差:

  • 集中不等式: 理論統計的工具箱, 在機器學習有核心應用
  • 原因: 機器學習中, 常常用「來自未知分佈的樣本」估計東西, 且想要知道「估計誤差(estimation error)」
  • 例子: 可測實隨機變數的平均, 利用i.i.d.樣本的經驗平均估計.
  • 脈絡:
  • a. Ste ́phane Boucheron, Ga ́bor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
  • b. Maxim Raginsky and Igal Sason. Concentration of measure inequalities in information theory, communications and coding. arXiv preprint arXiv:1212.4663, 2012.
  • c. A Dembo O Zeitouni and A Dembo. Large deviations techniques and applications. Applications of Mathe- matics, 38, 1998.

Concentration inequalities are a powerful set of methods in statistical theory with key applications in machine learning. This is due to the fact that in many machine learning applications, a learner often estimates some quantity solely based on samples from an unknown distribution, and would like to know the magnitude of the estimation error. The typical example is that of the mean $\mu$ of some measurable real-valued random variable $X$, estimated by its empirical mean built from a sample of $n$ independent and identically distributed (i.i.d.) observations. We refer the interested reader to the monographs of Boucheron et al. (2013); Raginsky and Sason (2012); Zeitouni and Dembo (1998) for standard results in this and related topics.

BEF102 指數族是估計向量參數融通的參數族模型:

  • 需求: 許多狀況下也會要估計「向量參數(Vector Parameter)」, 例如線性迴歸分析.
  • 需求: 估計參數族分佈的參數
  • 指數族: 很融通的參數族分佈$p_{\theta}(x) \propto h(x) \exp (\langle\theta, F(x)\rangle)$
  • 指數族實例: Gaussian (possibly multivariate, with or without known covariance), Poisson, Exponential, Gamma, Chi-square, Bernoulli and Multinomial (over a fixed alphabet) distributions.

In many situations, one may want to further estimate some vector parameter, as in e.g. linear regression (Abbasi-Yadkori et al., 2011). A closely related problem is to estimate the parameter of a distribution coming from a parametric family (Chowdhury et al., 2021). Exponential families are a flexible way to formalize such distributions, where a distribution on some set $\mathcal{X}$ is described by its density $p_{\theta}$ such that $p_{\theta}(x) \propto h(x) \exp (\langle\theta, F(x)\rangle)$, for some given feature function $F: \mathcal{X} \mapsto \mathbb{R}^{d}$ and base function $h$ (see details in Section 2). Most classical distributions fall into this category, including Gaussian (possibly multivariate, with or without known covariance), Poisson, Exponential, Gamma, Chi-square, Bernoulli and Multinomial (over a fixed alphabet) distributions, to name a few; see e.g. Amari (2016).

BEF103 序貫主動取樣策略下, 需要隨機停時集中不等式:

  • 序貫主動取樣策略: 當代機器學習問題, 很多都有「序貫主動取樣策略(Sequential, active data-sampling strategies)」.
  • 實例: 多手臂強盜 multi-armed bandits, 強化學習 reinforcement learning, 主動學習 active learning, 聯邦學習 federated learning.
  • 需求: 「取樣決策 (decision to sample)」來自學習算法與環境之間的互動, 與過去觀察相關, 因此需求「樣本數是隨機停時」的集中不等式結果.
  • 作法: 時間均勻集中不等式, 信賴區間序列, 而不再只是對固定樣本成立的結果.

A number of problems in current-day machine learning involve sequential, active data-sampling strategies (Cesa-Bianchi and Lugosi, 2006). This includes multi-armed bandits, reinforcement learning, active learning and federated learning, to mention a few application domains. Since the decision to sample a novel observation results from the interaction between the learning algorithm and the environment, and depends on past observations, one needs to design concentration inequalities working with a random number of observations typically at a random stopping time (Durrett, 2019). A natural way to handle this difficulty is to derive time-uniform concentration inequalities, producing sequences of confidence sets valid uniformly over all number of observations with high probability, as initiated by Robbins (1970), as opposed to being valid for a single number of observation.

BEF104 時間剝皮法-超鞅與幾何時間格:

  • 來自「強盜理論 (bandit theory)」的技術, 是使用「超鞅(Super-martingale)」與「幾何時間格 (geometric time grid)」, 又稱「時間剝皮 (time peeling)」的技術.
  • 近期的進展看Howard 2020.

A popular technique in bandit theory is to combine super-martingale techniques with union bound arguments over a geometric time grid, a technique known as time peeling - see Bubeck (2010), Cappé et al. (2013) for early uses in bandits, as well as Garivier (2013), or more recently Maillard (2019). See also Howard et al. (2020) for a recent, complementary survey of the history of this field.

BEF105 拉普拉斯法的進展:1949, 2008, 2011, 2017, 2018, 2019, 2020, 2021:

  • 替代方法: 混合法(method of mixtures).
  • 線性強盜: Abbasi-Yadkori的Improved linear bandit那篇, 影響力很大的文章.
  • Emilie Kaufmann and Koolen (2018): 混合法與指數族, 姐具體Gaussian (known variance)與Gamma(known shape)
  • Shafer and Vovk(2019): 對「有界分佈(bounded distributions)造另一種capital process, 由Waudby-Smoth and Ramdas(2020)推廣.
  • Kuchibhotla and Zheng (2021): 結合Bentkus集中不等式結果, 結合剝皮法, 得到很sharp的有界隨機變數deviation結果.

The method of mixtures, initiated by Robbins and Pitman (1949) and popularized in Peña et al. (2008) is a powerful alternative to develop time-uniform confidence sets. It has been applied to sub-Gaussian families in Abbasi-Yadkori et al. (2011), leading to a variety of applications (see Chowdhury and Gopalan (2017); Durand et al. (2017); Kirschner and Krause (2018) among others). In Kaufmann and Koolen (2018), a generalization to handle exponential families is considered, with applications to the cases of Gaussian (with known variance) and Gamma distributions (with known shape). A fairly different ‘capital process’ construction technique has been recently developed for bounded distributions in Shafer and Vovk (2019), and popularized further in Waudby-Smith and Ramdas (2020). Very recently, Kuchibhotla and Zheng (2021) incorporate Bentkus' concentration results (Bentkus, 2004) with the peeling technique to prove deviation results for bounded random variables.

BEF106 拉普拉斯法的積分計算, 可由貝式參數更新化約:

  • 本作: 參數指數族的拉普拉斯法, 得到Bregman散度的偏離結果.
  • 拉普拉斯法主要難點: 需要計算積分, 對廣義分佈有實踐阻礙.
  • 貝氏解釋: 用共軛先驗, 可以把積分計算轉為「參數更新(parameter update)」
  • 計算可追(Computationally tractable)信賴集: sharp與computationally tractable, 對「通用指數族(generic exponential families)」執行拉普拉斯法

In this article, we revisit the method of mixtures for the case of parametric exponential families, expressing deviations in the natural (Bregman) divergence of the problem. Our motivation is the following: On the one hand, an a priori difficulty of applying the method of mixture is that it involves computation of integrals, which can be tedious for general distributions. On the other hand, the setting of parametric exponential families is known to be convenient when dealing with integration in a Bayesian setup, thanks to the notion of a conjugate prior that enables us to reduce computation of a tedious integral to a simple parameter update. While the case of Gaussian distributions has been extensively studied for several decades, in this work, we investigate to which extent one can obtain both sharp and computationally tractable confidence sets using the method of mixture beyond the Gaussian case, for generic exponential families.

很精彩資訊豐富的Introduction, 對瞭解相關理論技術的人, 是非常好的文件, 來學習最進階的拉普拉斯法與時間均勻集中不等式.

200 方法

2.0 Exponential Families and Bregman Divergence (P.3)

BEF201 指數族(Exponentil Families):

  • 參數化分佈: 用開集合當參數範圍, 指定具體的密度函數形式.
  • 密度函數成員: 特徵函數(feature), 基礎函數(base), 對數分割函數(log-partition)
  • 優化正規性條: (1) 對數分割函數有限 (2) 對數分割函數的Gradient是一對一(3) 對數分割函數的Hessian矩陣可逆.
  • 期望泛函(Expectional Functional): 在此參數族內, 執行機率測度與期望算子.

BEF202 指數族的Bregman散度 (Bregman Divergence):

  • KL散度: 指數族中兩個密度函數的距離, 可以表達成一種「泰勒二階餘項 (Taylor second order remainder)」
  • 對偶參數(Dual parameter): 期望值, 在這個框架下, 可以被理解為對偶參數, 而因為在指數族中, 期望等於對數分割函數的梯度(Gradient)
  • 在這個意義下, 「指數族中兩密度的KL散度」等同於「兩對偶參數的Bregman散度」
  • 在這個新的脈絡下, 對數分割函數, 成為「勢能函數 (Potential function)」, 而Bregman散度成為勢能函數, 貨真價實的泰勒二階餘項.

BEF203 Bregman散度, 好控制尾端行為, 好操作代數:

  • 典範Bregman散度(Canonical Bregman divergnce): 指數族的典範Bregman散度有兩個基礎性質.
  • 性質一: 隨機變數的「對數動差生成函數(log-moment generating function)」, 讓Bregman散度能很好控制「隨機變數尾端行為 (Tail-behaviour of random variables)」.
  • 性質二: 對偶性(Duality)可以讓我們做方便的代數操作.

Tail and Duality Properties The canonical Bregman divergence of an exponential family enjoys two fundamental properties that we recall below. The first one links it to the log-moment generating function (MGF) of the random variable $F(X)$, which makes Bregman divergences especially well suited to control the tail-behaviour of random variables appearing for instance in concentration inequalities. The second one highlights duality properties that enables convenient algebraic manipulations.

BEF204 Bregman散度: 位移, 對偶, 內差梯度:

  • 位移: Bregman散度下的位移, 可以被對數動差生成函數給描述
  • 對偶: 如果勢能函數的梯度是一對一, 則有「Bregman對偶性」: 兩對偶參數的Bregman散度, 等同「勢能梯度」之間的「對偶Bregman散度」.
  • 內插梯度: 梯度面向的位移, 與對偶參數內差量有個互相表達的關係, 與內插梯度有關.

3.0 Time-Uniform Bregman Concentration (P.4)

BEF276 隨時成立的偏移控制:

    1. 控制「參數」與「基於n個樣本的估計」的「偏離(Deviation)」
    1. 度量偏離的方式, 是使用「典範Bregman散度」
    1. 我們要同時控制「所有的樣本數n」, 也就是計算 . (這個不等式是不是沒寫完整?)
    1. 這種控制, 在樣本是sequentially, actively的狀況下收集, 很好用. 尤其是在具體樣本數是多少未知的狀況下, 特別有用.
    1. 經典例子包含: 多手臂強盜模型, 基於模型的強化學習.

In this section, we are interested in controlling the deviation between a parameter $\theta \in \Theta$ and its estimate $\theta_{n}$ built from $n$ observations from distribution $p_{\theta}$. We naturally measure the deviation error in terms of the canonical Bregman divergence of the family. Further, we would like to control the deviation not only for a single sample number $n$, but simultaneously for all $n$. Namely, we would like to control (from above) quantities of the form $\mathbb{P}{\theta}\left[\exists n \in \mathbb{N}: n \mathcal{B}{\mathcal{L}}\left(\theta, \theta_{n}\right)\right] .$ Such a control is very useful in the context when observations are gathered sequentially (either actively or otherwise), and especially when the number of observations $n$ is unknown beforehand. Classical example of such settings include the multi-armed bandit problem (Auer et al., 2002), or model-based reinforcement learning (Jaksch et al., 2010).

3.1 A Generic Deviation Inequality (P.4-P.6)

BEF301 Bregman資訊得 (Bregman Information Gain):

  • 使用「拉普拉斯法」:原始版是對高斯的狀況.
  • 這裡介紹廣義對指數族的策略, 介紹「觀察n個樣本後對參數$\theta$」的「資訊得(Information gain)」
  • 使用Bregman散度的語言, 因此稱為Bregman資訊得 (Bregman Information Gain).
  • 「資訊得」是某種「對數似然比」的概念, 一個在「真實參數$\theta_0$」, 一個在「參數估計 $\theta_{n,c}$」

BEF: BEF: BEF: BEF:

3.2 Specification to Classical Families (P.6-P.7)

3.3 Proof Sketch: Theorem 1 (P.7-P.8)

BEF501 拉普拉斯法, 證明三步:

  • 步驟一: 構造「指數鞅」; 補「指數族先驗」來選溫度; 得到工作鞅.
  • 步驟二: 選「具體」的「指數族先驗」; 造「對偶Bregman散度」; 選切點; 選出「估計式$\theta_{n,c}$」.
  • 步驟三: 根據選出的結果, 寫出工作鞅; 使用馬可夫不等式; 達到需要的bound.

從圖案上, Bregman 的結果, 能有效控制住1000的重複的bound.

BEF: BEF: BEF: BEF: BEF:

Another section

BEF: BEF: BEF: BEF: BEF: BEF:

Another section

BEF: BEF: BEF: BEF: BEF: BEF:

300 結果

4.0 Numerical Experiments (P.8-P.11)

5.0 Linear Bandits Revisited (P.11-P.12)

400 討論

6.0 Conclusion (P.12-P.13)

900 形式結果

A.0 Proof of the main result about Bregman deviations (P.15)

B.0 Properties of Bregman Divergence (P.20)

C.0 Specification to illustrative exponential families (P.21)

D.0 Empirical Comparision with existing time-uniform confidence sequences (P.25)

E.0 Tuning of the regularization parameter c (P.33)

後記

  1. 到此解讀了摘要以及製作十張知識卡片. 之後再持續累積, 快思慢想把這個文章吃透. 天天向上, 共勉之!

2022.01.26. 紫蕊 於 西拉法葉, 印第安納, 美國.

Version Date Summary
0.1 2022-01-26 初次閱讀, 製作十張知識卡片
0.2 2022-02-07 再次閱讀, 了解貝式鞅利用共軛先驗, 來造出Bregman偏離不等式, 裡面的對偶框架.很酷!其中資訊得的概念還是蠻難直接理解, 還需沈澱慢慢參悟.

版權

CC BY-NC-ND 4.0

評論