Paper Reading on Compressed Sensing with LDPC codes

2020.06.06

1. LDPC Codes for Compressed Sensing

1.1 Introduction

We present a mathematical connection between channel coding and compressed sensing. In particular, we link, on the one hand, channel coding linear programming decoding (CCLPD), which is a well-known relaxation of maximum-likelihood channel decoding for binary linear codes, and, on the other hand, compressed sensing linear programming decoding (CS-LPD), also known as basis pursuit, which is a widely used linear programming relaxation for the problem of finding the sparsest solution of an under-determined system of linear equations. More specifically, we establish a tight connection between CS-LPD based on a zero-one measurement matrix over the reals and CC-LPD of the binary linear channel code that is obtained by viewing this measurement matrix as a binary parity-check matrix. This connection allows the translation of performance guarantees from one setup to the other. The main message of this paper is that parity-check matrices of “good” channel codes can be used as provably “good” measurement matrices under basis pursuit. In particular, we provide the first deterministic construction of compressed sensing measurement matrices with an order-optimal number of rows using high-girth low-density parity-check (LDPC) codes constructed by Gallager.

我们提出了信道编码和压缩感知之间的数学联系。特别是，我们一方面链接了信道编码线性编程解码（CCLPD），这是众所周知的针对二进制线性编码的最大似然信道解码的松弛，另一方面是压缩感知线性编程解码（CS-LPD），也称为基追踪，它是一种广泛使用的线性规划松弛方法，用于发现线性方程组欠定的最稀疏解的问题。更具体地说，我们在基于实数上的01测量矩阵的CS-LPD与通过将此测量矩阵视为二进制奇偶校验矩阵而获得的二进制线性通道代码的CC-LPD之间建立了紧密连接。此连接允许将性能保证从一种情境转换为另一种情境上。本文的主要信息是，在基追踪下，“良好”信道编码的奇偶校验矩阵可被证明是“良好”基追踪的测量矩阵。特别是，我们使用Gallager构造的高围低密度奇偶校验（LDPC）码，提供了行数为顺序最优的压缩传感测量矩阵的确定性构造。

2. Improving Distributed Gradient Descent Using Reed–Solomon Codes

Today’s massively-sized datasets have made it necessary to often perform computations on them in a distributed manner. In principle, a computational task is divided into subtasks that are distributed over a cluster operated by a taskmaster. One issue faced in practice is the delay incurred due to the presence of slow machines, known as stragglers. Several schemes, including those based on replication, have been proposed in the literature to mitigate the effects of stragglers, and more recently, those inspired by coding theory have begun to gain traction. In this work, we consider a distributed gradient descent setting suitable for a wide class of machine learning problems. We adapt the framework of Tandon et al. [18] and present a deterministic scheme that, for a prescribed per-machine computational effort, recovers the gradient from the least number of machines f theoretically permissible, via an $O(f^2)$ decoding algorithm. We also provide a theoretical delay model that can be used to minimize the expected waiting time per computation by optimally choosing the parameters of the scheme. Finally, we supplement our theoretical findings with numerical results that demonstrate the efficacy of the method and its advantages over competing schemes.

当今庞大的数据集使得有必要经常对它们进行分布式计算。原则上，计算任务分为多个子任务，这些子任务分布在由任务主管操作的群集上。实践中面临的一个问题是由于存在缓慢的机器（称为落后者）而导致的延迟。文献中已经提出了几种方案，包括基于复制的方案，以减轻落后者的影响。最近，受到编码理论启发的方案也开始受到关注。在这项工作中，我们考虑了适用于各种机器学习问题的分布式梯度下降设置。我们采用了Tandon等人的框架，并提出了一种确定性方案，该方案针对规定的每台机器计算工作量，通过 $O(f^2)$ 解码算法从理论上允许的最少数量的机器f中恢复梯度。我们还提供了理论上的延迟模型，该模型可通过最佳选择方案的参数来使每次计算的预期等待时间最小化。最后，我们用数值结果补充了我们的理论发现，这些数值结果证明了该方法的有效性及其相对于竞争方案的优势。

数据集增大，高运算量，内存需求，单机训练困难，因此多机多核分布式训练；并行与分布式训练吸引了很多注意
任务并行处理，而非序列化
Master 需等待所有机器；大量延迟与异构延迟，训练变慢
解决方案：
- 忽略落后节点：学习算法的性能有负面影响
- 引入冗余度：设计方案划分任务，只使用部分机器恢复计算，不需考虑是哪些机器
- 编码理论：进展，移动通信，类似场景
本文：建立分布式机器学习与编码理论之间的桥梁
- 将精心设计的编码方案引入多机间高效划分任务
- 考虑基于梯度的方法实现的可加可分离的损失函数
- 提出基于RS码的确定性重建；高效的解码器将梯度从固定数量的返回机器上更新
相关工作
- Lee et al. 在矩阵乘法与数据改组的分布式任务中使用编码理论获得性能提升
- Dutta et al 提出方法加速分布式矩阵乘法，通过稀疏化单机计算的内积
- Li et al 提出编码的 MapReduce 框架，加速分布式训练中的数据改组
- Tandon et al 使用MDS 码改进落后者的分布式梯度下降
- 上述方法，解码均为线下，这在实际中可能不可行
贡献
- 给定机器数量下高效分布式梯度下降确定性方法
- 高效在线解码器
- 分析运算时间，最优编码参数；重尾延迟

准备工作

Master，n 个workers，学习参数 $\beta\in\mathbb{R}^p$ 通过最小化损失函数 $L(D;\beta)$
数据集 $D=\{(x_i,y_i\}_{i=1}^N$ ， $x_i\in\mathbb{R}^p$
$L(D;\beta)=\sum l(x_i,y_i;\beta)$
$\nabla_\beta L(D;\beta)=\sum \nabla l(x_i,y_i;\beta)$
数据集分为 $k$ 个，梯度也分为k个， $g_i = \sum_{x_i,y_i)\in D_i} \nabla l(x_i,y_i;\beta)$ , $\bar{g} = [g_1, g_2,\cdots,g_k]^T$
每个worker分配w个数据划分；每个划分可以被分给多个机器，冗余度就在这里
$s \leq ground(\frac{wn}{k}) - 1$
编码矩阵B
- B一行有w个非零元
- $B$ 任取 $n-s$ 行，向量组合可以张成 $1_{1\times k}$
- $W_i$ 发送的梯度： $c_i = \sum_{j=1}^k B_{i,j}g_j = B_i g$
解码： $AB=1$
- 解码矩阵 $C_n^f$ 行，随n涨得很快，保存不现实；在线解码，基于逆范德蒙矩阵，速度很快
Tradeoff：

构建码

基础 $M\in\{0,1\}^{n\times k}$ ，每行重 $w$ ，为B的mask
B的每一列为一个复杂域上的Reed-Solomon码的码字，M的对应列为支持
GC文章使用MDS码，无法适用于n不等于k时
M定义为列平衡矩阵，
- B的第j列选择为以M的第j列为支撑的RS码字

3. Gradient Coding

提出新的编码理论框架，用于减轻分布式学习中的落后者。精心地复制数据块，对梯度进行编码，可以为同步梯度下降提供故障容错能力。

分布式训练：数据分散到多个worker上，分别计算梯度，发送回master，更新模型
落后者 stragglers：延迟很严重；云上的便宜虚拟机器常见，如亚马逊 EC2，一些机器比常规性能慢5倍
无编码解决此问题：
- 每台worker复制两份数据，通过通信，可以选择返回一份梯度或者两份梯度之和
- 此种情况，需要了解哪个worker是落后者：反馈与协调
有编码：反馈与协调并不需要
- $W_1: \frac{1}{2}g_1+g_2,W_2:g_2-g_3,W_3:\frac{1}{2}g_1+g_3$
- $g_1+g_2+g_3=2(\frac{1}{2}g_1+g_2) - (g_2-g_3)$
考虑一般情况： $n$ 台机器，任意 $s$ 台落后者
- 每个数据分割需要被复制 $s+1$ 次
- 提出两种方案；考虑部分落后者情况
- 与忽略落后者的方案比较
落后者的影响
- 越便宜，性能偏差越大
- 使用低成本实例，搭配编码方案，可能是一种明智的选择
相关工作
- 阿喀琉斯之踵，异步方案：放弃快收敛、方便的分析与调试性能
- 同步小批量SGD、辅助worker
- 估计失败分组的loss函数，使用失败前的数据
- Lee and Dutta：类似使用编码理论，但本文关注恢复批梯度，而他们关注数据重排与矩阵相乘这两种应用
- Li 使用coding到mapreduce
- 之前工作并没有在梯度向量上编码

准备工作

$D = \{(x_1, x_2), \cdots, (x_d,y_d)\},(x,y)\in \mathbb{R}^p\times \mathbb{R}$
$\beta^* = \argmin_{\beta\in \mathbb{R}^p} \sum_{i=1}^d \mathcal{l}(\beta;x_i,y_i) +\lambda R(\beta)$
梯度下降法解优化问题， $h_R$ ：与 $R$ 有关的优化器
- $g = \sum_{i=1}^d \nabla l(\beta^{(t)};x_i,y_i), \beta^{(t+1)} = h_R(\beta{(t)},g)$
- $d$ 很大时，计算的瓶颈为梯度的计算
$d$ 个样本， $n$ 个worker， $k$ 个划分， $s$ 个落后者， $supp(x) = \{i|x_i\neq 0\}$
设置
- $AB=\mathbb{1}_{f\times k},A\in\mathbb{R}^{f\times n}, B\in\mathbb{R}^{n\times k}$ ， $f$ 表示非落后者数量
- $B$ 的每一行对应于一个worker， $\bar{g} = [g_1, g_2,\cdots,g_k]^T$ 表示梯度矩阵，每行为一个划分， $b_i\bar{g}$ 为 $W_i$ 传回的梯度； $supp(b_i)$ 就表示 $W_i$ 被分到的数据分组
- $a_i$ 为一种落后者场景
- $a_iB\bar{g} = [1,1,\cdots,1]\bar{g} = (\sum_{j=1}^k g_j)^T$
- $a_iB\bar{g} = \sum_{k\in supp(a_i)} a_i(k)(b_k\bar{g})$
目的是寻找方案构建 $(A,B)$ ，使得对任意 $s$ 落后者鲁棒
- 完全落后者
- 部分落后者

Full stragglers

$s$ 个落后者， $f=C_n^{n-s} = C_n^s$ ，表示非落后者的组合数
$B$ 需要满足展开条件
- 条件1： $B$ 任取 $n-s$ 行，向量组合可以张成 $1_{1\times k}$
- 条件非常充分
- 根据满足条件的 $B$ 与 $s$ ，可以得到 $A$ 使得 $AB=1_{f\times n}$
问题变成寻求满足条件的 $B$
- 全1矩阵肯定可以，但浪费，表示所有worker分到全部数据
- 希望 $B$ 尽可能稀疏
- 定理1： $B$ 的每行非零元数量相同，那么 $||b_i||_0 \geq \frac{k}{n}(s+1)$
- 假定 $k=n$ ， $B$ 为每行含有 $s+1$ 个非零元的方阵

小数重复方案

要求 $n$ 可以被 $s+1$ 整除，把workers分成 $s+1$ 个小组，每个小组 $n/(s+1)$ 个机器
每个机器上 $(s+1)$ 份数据，这样每个小组就有全部的数据