思考水平聯邦學習

紫式晦澀每日一篇文章第29天

前言

今天是2022年第2˙天, 全年第4週, 一月的第四個週四. 今天來仔細思考「水平聯邦學習 (Horizontal Federated Learning)」的相關故事與細節. 即將截稿ICML, 打鐵趁熱, 剛好我們設計的communication protocol與水平聯邦學習的結構相似, 可以藉著相關的故事與文獻串起來.
今天的素材主要來自Federated Machine Learning: Concept and Applications 裡面關於水平聯邦學習的段落.

心法

聯邦學習-隱私保護去中心化的機器學習

聯邦學習概覽:

Google提出聯邦學習[36,37,41].
Google主要想法: 防止「資料洩漏(Data leakage), 建立機器學習模型, 基於分佈在多個裝置上的資料集.
主要研究: 統計挑戰[60, 77], 安全性改良[9,23], 個人化[13, 60].
On-device聯邦學習: 分佈式裝置用戶互動, 溝通成本高, 非均衡資料分佈, 裝置可靠性, 都是優化的挑戰.
水平分割(Horizontal partitioned data): 資料以用戶id裝置id被分割了.
隱私保護機器學習(Privacy-preserving ML): 聯邦學習考慮隱私, 以「去中心協力學習 (decentralized collaborative-learning)」的框架來保護隱私.
組織之間的協力學習(Collaborative-learning among organizations): 聯邦學習= 隱私保護去中心化的機器學習技術. (privacy-preserving decentralized collaborative machine-learning)
在[71]有聯邦學習技術& 聯邦遷移學習技術
本文章研究「安全性基礎(security foundations)」, 相關的「多代理人理論(Multiagent theory)」, 「隱私保護資料探勘(Privacy-preserving data mining)」.

我們也是去中心協力探索(Decentralized collaborative exploration).

The concept of federated learning was proposed by Google recently $[36,37,41]$. Google’s main idea is to build machine-learning models based on datasets that are distributed across multiple devices while preventing data leakage. Recent improvements have been focusing on overcoming the statistical challenges $[60,77]$ and improving security $[9,23]$ in federated learning. There are also research efforts to make federated learning more personalizable $[13,60]$. The above works all focus on on-device federated learning in which distributed mobile-user interactions are involved and communication cost in massive distribution, unbalanced data distribution, and device reliability are some of the major factors for optimization. In addition, data is partitioned by user Ids or device Ids, therefore, horizontally in the data space. This line of work is highly related to privacy-preserving machine learning, as reported in [58] because it also considers data privacy in a decentralized collaborative-learning setting. To extend the concept of federated learning to cover collaborative-learning scenarios among organizations, we extend the original “federated learning” to a general concept for all privacy-preserving decentralized collaborative machine-learning techniques. In [71], we have given a preliminary overview of the federated-learning and federated transfer-learning technique. In this article, we further survey the relevant security foundations and explore the relationship with several other related areas, such as multiagent theory and privacypreserving data mining. In this section, we provide a more comprehensive definition of federated learning that considers data partitions, security, and applications. We also describe a workflow and system architecture for the federated-learning system.

水平聯邦學習-用戶不同, 業務相似

水平聯邦學習: 用戶不同, 但業務相同:

水平聯邦學習=基於樣本的聯邦學習: 資料集有同樣的特徵空間, 但樣本的空間不同.
例子: 兩個銀行可以有很不同的用戶群, 但他們的業務很類似, 所以特徵空間相同.
[58]: 協同深度學習-參與者獨立訓練, 只分享參數更新的子集
[41]: 2017年Google利用水平聯邦學習來做Android的模型更新. 協同學習一個中心模型.
安全聚合方案(Secure aggregation scheme)來保護被拒和用戶更新的隱私, 在聯邦學習的框架下
同態加密( homomorphic encryption):同態加密指的是一種加密方式，使得使用者可以對密文做計算，計算出來的結果，再做解密的時候的明文是預期的計算結果。

2.3.1 Horizontal Federated Learning. Horizontal federated learning, or sample-based federated learning, is introduced in the scenarios in which datasets share the same feature space but different space in samples (Figure 2(a)). For example, two regional banks may have very different user groups from their respective regions, and the intersection set of their users is very small. However, their business is very similar, so the feature spaces are the same. The authors of [58] proposed a collaboratively deep-learning scheme in which participants train independently and share only subsets of updates of parameters. In 2017, Google proposed a horizontal federated-learning solution for Android phone model updates [41]. In that framework, a single user using an Android phone updates the model parameters locally and uploads the parameters to the Android cloud, thus jointly training the centralized model together with other data owners. A secure aggregation scheme to protect the privacy of aggregated user updates under their federated-learning framework is also introduced in [9]. The authors of [51] use additively homomorphic encryption for model parameter aggregation to provide security against the central server.

水平聯邦學習的安全性: 誠實但好奇的服務器

水平聯邦學習的安全性:

誠實的參與者(honest participants); 誠實但好奇的服務器(honest-but-curious server) [9,51]
只有服務器可以破壞資料參與者的隱私.
[29] 惡意用戶
服務器完成訓練, 會把參數分享給所有資料參與者.

Security Definition. A horizontal federated learning system typically assumes honest participants and security against an honest-but-curious server $[9,51]$. That is, only the server can compromise the privacy of data participants. Security proof has been provided in these works. Recently, another security model considering malicious users [29] was also proposed, posing additional privacy challenges. At the end of the training, the universal model and all of the model parameters are exposed to all participants.

水平, 垂直, 遷移聯邦學習:

聯邦學習	實例
🤩 水平聯邦學習: 用戶不同, 業務相同	兩個城市的「銀行」
😍 垂直聯邦學習: 用戶相同, 業務不同	同一個城市的「銀行」與「電子商務」
🤯 遷移聯邦學習: 用戶不同, 業務不同	美國的電子商務, 亞洲的銀行

技法

資料: (特徵, 標記, 樣本身份)=(feature, label, sample ID)

資料: (特徵, 標記, 樣本身份)=(feature, label, sample ID):

資料矩陣: row是sample, column是feature.
資料符號: feature是X, label是Y, sample ID是 I
金融的例子: label = 信用 (credit)
行銷的例子: label = 用戶購買慾(user’s purchase desire)
教育的例子: label = 學位(degree of the students. )
分類: 水平聯邦學習, 垂直聯邦學習, 聯邦遷移學習(feature, label都不一樣)

Let matrix $\mathcal{D}_{i}$ denote the data held by each data owner $i$. Each row of the matrix represents a sample, and each column represents a feature. At the same time, some datasets may also contain label data. We denote the feature space as $\mathcal{X}$, the label space as $\mathcal{Y}$, and we use $\mathcal{I}$ to denote the sample ID space. For example, in the financial field, labels may be users' credit; in the marketing field, labels may be the user’s purchase desire; in the education field, $y$ may be the degree of the students. The feature $\mathcal{X}$, label $\mathcal{Y}$, and sample Ids $\mathcal{I}$ constitute the complete training dataset $(I, \mathcal{X}, \mathcal{Y})$. The feature and sample spaces of the data parties may not be identical, and we classify federated learning into horizontally federated learning, vertically federated learning, and federated transfer learning based on how data is distributed among various parties in the feature and sample ID space. Figure 2 shows the various federated learning frameworks for a two-party scenario.

水平聯邦學習在資料下的表達:

(特徵, 標記)相同–>業務相同
樣本ID不同–> 用戶不同

We summarize horizontal federated learning as $X_{i}=X_{j}, \quad y_{i}=y_{j}, I_{i} \neq I_{j}, \quad \forall \mathcal{D}{i}, \mathcal{D}{j}, i \neq j .$

水平聯邦學習執行流程:

有相同資料結構的k個參與者, 協同學習機器學型模型, 利用雲端服務器來幫忙.
假設參與者誠實, 服務器誠實但好奇. 所以沒有資訊洩漏從參與者到服務器.
訓練流程:
1. 參與者計算局部的訓練梯度, 給梯度戴面具, 用加密[51], 差分隱私[58], 秘密分享[9]. 將面具後的結果傳給服務器.
1. 服務器計算聚合結果.
1. 服務器傳回聚合結果給參與者.
1. 參與者更新模型, 利用加密梯度.
接著持續跑, 讓損失函數收斂, 完成整個訓練過程.
所有參與者分享最後的結果.

我們只會傳Lasso, 但不會傳其訓練出來的regularization level, 所以也不會洩漏隱私.

2.4.1 Horizontal Federated Learning. A typical architecture for a horizontal federated-learning system is shown in Figure 3 . In this system, $\mathrm{k}$ participants with the same data structure collaboratively learn a machine-learning model with the help of a parameter or cloud server. A typical assumption is that the participants are honest whereas the server is honest but curious; therefore, no leakage of information from any participants to the server is allowed [51]. The training process of such a system usually contains the following four steps.

Step 1: Participants locally compute training gradients; mask a selection of gradients with encryption [51], differential privacy [58], or secret sharing [9] techniques; and send masked results to the server.
Step 2: The server performs secure aggregation without learning information about any participant.
Step 3: The server sends back the aggregated results to participants.
Step 4: Participants update their respective model with the decrypted gradients.

Iterations through the above steps continue until the loss function converges, thus completing the entire training process. This architecture is independent of specific machine-learning algorithms (logistic regression, DNN, etc.) and all participants will share the final model parameters.

用法

聯邦學習比分佈機器學習更沒有資料的完全使用權

聯邦學習VS分佈機器學習:對資料有沒有完全使用權:

分佈式機器學習: 訓練資料的分佈儲存, 計算任務的分佈執行, 模型結果的分佈分發.
參數服務器: 加速訓練過程, 存資料, 分配計算資源, 配資源, 讓模型訓練更有效率.
水平聯邦學習: node是資料擁有者, 不一定需要配合
聯邦學習面對更複雜的學習環境
聯邦學習強調對資料擁有者於訓練過程中的「資料隱私保護 (Data-privacy protection)」
保護資料隱私的度量, 要在未來加強資料安全規範環境上更增強
聯邦學習要討論non-IID資料.

3.2 Federated Learning versus Distributed Machine Learning Horizontal federated learning at first sight is somewhat similar to distributed machine learning. Distributed machine learning covers many aspects, including distributed storage of training data, distributed operation of computing tasks, and distributed distribution of model results. A parameter server [30] is a typical element in distributed machine learning. As a tool to accelerate the training process, the parameter server stores data on distributed working nodes and allocates data and computing resources through a central scheduling node to train the model more efficiently. For horizontally federated learning, the working node represents the data owner. It has full autonomy for the local data; it can decide when and how to join the federated learning. In the parameter server, the central node always takes control; thus, federated learning is faced with a more complex learning environment. In addition, federated learning emphasizes the data-privacy protection of the data owner during the model training process. Effective measures to protect data privacy can better cope with the increasingly stringent data privacy and data security regulatory environment in the future. As in distributed machine-learning settings, federated learning will also need to address non-IID data. The authors of [77] showed that, with non-IID local data, performance can be greatly reduced for federated learning. The authors in response supplied a new method to address the issue similar to transfer learning.

後記

讀了Federated Machine Learning: Concept and Applications 後, 收穫很多！感覺對聯邦學習很多的技術細節更了解了! 這樣做問題就可以抓住核心需要的概念, 寫出對方覺得重要的點!
持續利用文章輸出的方法來學習與整理資訊成知識, 效果真的比之前邊找邊讀還要好很多! 加上打字夠快, 把讀過的記下來的東西都變成之後reference的資料庫, 成為時間的朋友, 感受知識的複利! 天天向上, 共勉之!

2022.01.27. 紫蕊於西拉法葉, 印第安納, 美國.

MUR029 思考水平聯邦學習

目錄

思考水平聯邦學習

前言

心法

聯邦學習-隱私保護去中心化的機器學習

水平聯邦學習-用戶不同, 業務相似

水平聯邦學習的安全性: 誠實但好奇的服務器

技法

資料: (特徵, 標記, 樣本身份)=(feature, label, sample ID)

用法

聯邦學習比分佈機器學習更沒有資料的完全使用權

後記

版權

評論

MUR029 思考水平聯邦學習

目錄

思考水平聯邦學習

前言

心法

聯邦學習-隱私保護去中心化的機器學習

水平聯邦學習-用戶不同, 業務相似

水平聯邦學習的安全性: 誠實但好奇的服務器

技法

資料: (特徵, 標記, 樣本身份)=(feature, label, sample ID)

用法

聯邦學習比分佈機器學習更沒有資料的完全使用權

後記

版權

相關文章

評論