（中英文对照）World’s first practical hardware for zero-knowledge proofs acceleration

Preliminary results of Matter Labs’ FPGA research project

Today Matter Labs is releasing no fewer than THREE progress reports. We start with our first empirical test results for the FPGA implementation of multiexponentiation operation on the BLS12–381 curve are in! New to multiexponentiation, but feeling brave? Read on!

For newcomers, multiexponentiation is the most time consuming operation used in modern proof systems based on pairing friendly elliptic curves: it takes around 70% of prover time [1] in proof systems like Groth16 or PLONK. Numbers vary on different elliptic curves, proof systems and circuit sizes, but you get the idea.

Our first version achieved a speed of 2.8 million point/scalar pairs per second on a single FPGA board. As a comparison, an 8 physical core CPU can chunk through around 440 thousand point/scalar pairs per second. We emphasize physical cores because cloud providers also count virtual cpus; virtual cpus effectively double the number of physical cores for IO bound computations, but only physical cores matter for intensive arithmetic computations. Researchers at Matter Labs see further optimization opportunities with the potential to scale point/scalar pairs per second a further 2–2.5x. Unfortunately, the BLS12–381 curve is not yet available on Ethereum, but is scheduled for the August 2020 Berlin fork.

Our implementation targets Amazon F1 cloud FPGA servers, as we’re trying to minimize the necessary complexity to contribute. We want to make it possible for anyone to rent a cloud server and spin right up. This has placed some limits on performance: F1 instances have frequency and resource limitations. By targeting the F1 instance, our implementation will be cross-compatible with the majority of Xilinx FPGAs on the market today. Physical FPGAs owners should expect essentially free performance boost for their boards.

These multiexponentiation performance gains enable several new applications. For example, by accelerating multiexponentiation, it is now possible to implement “PLONK without FFTs,” proposed by Justin Drake [1]. Eliminating FFTs from the proof system workflow eliminates a requirement for elliptic curve main subgroup order r for simple radix-2 FFT: that r-1 has a large divisor form of 2^k.

Without such requirement one could, for instance, instantiate efficient constructions of timelock puzzles based on the Sloth++ construction by D. Boneh et at. [2], where validity of the delay function evaluation is proved using a zkSNARK (Sloth++ construction requires r = 3 mod 4). Timelock puzzles require that output of a delay function is unique. Timelock puzzles prevent one from obtaining a valid decryption key sooner than 1 minute after publication of some data. VDFs permit potentially multiple outputs, that nevertheless can not be computed faster than some bound.

We plan to extend our solution to support wider set of elliptic curves such as BN254 (available for zkSNARKs in Ethereum at the moment) and BLS12–377 (used in Zexe construction [3], a close cousin of BLS12–381). As usual, we will publish our work as open source and release an integration to our proof generation libraries.

1] J. Drake, “PLONK without FFTs”, https://www.youtube.com/watch?v=ffXgxvlCBvo

2] D. Boneh et at., “Verifiable Delay Functions”, https://eprint.iacr.org/2018/601.pdf

3] S. Bowe et al., “Zexe: Enabling Decentralized Private Computation”, https://eprint.iacr.org/2018/962.pdf

Matter Labs FPGA 研究项目初步成果

今天，Matter Labs 发布了不少于三份进度报告。我们从我们在 BLS12-381 曲线上多指数运算的 FPGA 实现的第一个经验测试结果开始！不熟悉乘幂，但感觉很勇敢？继续阅读！

对于新手来说，多幂运算是基于配对友好椭圆曲线的现代证明系统中使用的最耗时的操作：在 Groth16 或 PLONK 等证明系统中，它需要大约 70% 的证明者时间 [1]。数字在不同的椭圆曲线、证明系统和电路尺寸上有所不同，但你明白了。

我们的第一个版本在单个 FPGA 板上实现了每秒 280 万点/标量对的速度。作为比较，一个 8 个物理核心 CPU 每秒可以分块大约 44 万个点/标量对。我们强调物理核心，因为云提供商也计算虚拟 CPU；虚拟 CPU 有效地使 IO 绑定计算的物理内核数量增加了一倍，但只有物理内核对密集算术计算很重要。Matter Labs 的研究人员看到了进一步的优化机会，有可能将每秒的点/标量对再扩大 2-2.5 倍。不幸的是，BLS12-381 曲线在以太坊上尚不可用，但计划于2020 年 8 月柏林分叉。

我们的实施以 Amazon F1 云 FPGA 服务器为目标，因为我们正在努力将贡献所需的复杂性降至最低。我们希望让任何人都可以租用云服务器并立即启动。这对性能造成了一些限制：F1 实例具有频率和资源限制。通过以 F1 实例为目标，我们的实现将与当今市场上的大多数 Xilinx FPGA 交叉兼容。物理 FPGA 所有者应该期望他们的电路板基本上可以免费提升性能。

这些多幂运算性能提升支持多个新应用程序。例如，通过加速多幂运算，现在可以实现 Justin Drake [1] 提出的“没有 FFT 的 PLONK”。从证明系统工作流程中消除 FFT 消除了对简单基 2 FFT 的椭圆曲线主子群阶“r”的要求：“r-1”具有较大的除数形式“2^k”。

例如，如果没有这样的要求，可以根据 D. Boneh 等人的 Sloth++ 构造实例化时间锁谜题的有效构造。[2]，其中使用 zkSNARK 证明了延迟函数评估的有效性（Sloth++ 构造需要r = 3 mod 4）。时间锁谜题要求延迟函数的输出是唯一的。时间锁难题可防止人们在某些数据发布后 1 分钟内获得有效的解密密钥。VDF 允许潜在的多个输出，但计算速度不能超过某个界限。

我们计划扩展我们的解决方案以支持更广泛的椭圆曲线集，例如 BN254（目前可用于以太坊中的 zkSNARK）和 BLS12-377（用于 Zexe 构造 [3]，是 BLS12-381 的近亲）。像往常一样，我们将把我们的工作作为开源发布，并发布到我们的证明生成库的集成。

1] J. Drake，“没有 FFT 的 PLONK”，https://www.youtube.com/watch?v= ffXgxvlCBvo

2] D. Boneh 等人，“可验证的延迟函数”，https://eprint.iacr.org/2018/601.pdf

3] S. Bowe 等人，“Zexe：实现去中心化私有计算”，https ://eprint.iacr.org/2018/962.pdf