Hardware Acceleration

Leading Problem

Proof generation is time-consuming, high-latency, and not efficient enough on traditional CPUs
As L2 rollups and other applications of ZKP grow, we need to consider hardware as a potential breakthrough of performance improvement

Business Models

zk-mining: Plug-n-play FPGA accelerator cards (revenue through hardware sale)
zk-Rollup: public, private and on-premise cloud (revenue through hourly premiere)
Software auxiliary: SDK for dApps and API (license to program FPGA in developer-friendly way

3 Potential Scenarios

Permissioned networks, e.g. StarkNet
Permissionless networks where provers compete on cost, e.g. Mina’s Snarketplace
Permissionless networks where provers compete on latency, e.g. Polygon Hermes & Zero

Selection metrics

Performance: latency, throughput, and power consumption
Total cost of ownership (capital expenditures and maintenance costs)
NRE cost (non recurring engineering: onboarding difficulties)
Long-term comparative advantage: competitor performance may double in a year with the same MSPR (e.g. Intel, Nvidia GPU)

Structure of ZKP Hardware Acceleration

Current popular structure is to combine CPU and FPGA/GPU to perform proof generation. CPU tackles single-threaded pieces at higher frequency and deal with non-determinism
MSM and FFT can be parallelized, but arithmetization and commitments may be single threaded (different from the “embarrassingly parallel” PoW mining structure)
Most FPGA design tools are proprietary: FPGAs have higher barrier to enter than GPU accelerations

Light Client

Leading Problem

Full node is resource-draining: downloading and verifying the whole chain of blocks takes time and resources
Not retail friendly to be on mobile and perform quick actions like sending transactions and verifying balances

Technology Highlight

Trust-minimized interaction with full nodes and do not interact directly with the blockchain
Much lighter on resources and storage but require higher network bandwidth
Linear time based on chain length: verify from the root to the target height (may be from a trusted height, therefore constant length)
Process overview:
- retrieve a chain of headers instead of full block
- verify the signatures of the intermediary signed headers and resolve disputes through checking with full nodes iteratively
Superlight client: requires downloading only a logarithmic number of block headers while storing only a single block header between executions

Commit-and-proof Schemes

Name	Mechanism	Prover Complexity	Verifier Complexity	Trusted Setup	Reference
Recursive SNARKs	Cycle of elliptic curves	High	Low	Yes	https://minaprotocol.com/wp-content/uploads/technicalWhitepaper.pdf
KVC (Key-Value Commitment)	Constant size key-value map	Med-high	Medium	No	https://eprint.iacr.org/2020/1161.pdf
AMT (authenticated multipoint evaluation tree)	authenticate a polynomial multipoint evaluation at the first n roots of unity	Medium	Medium	No	https://people.csail.mit.edu/devadas/pubs/scalable_thresh.pdf
KZG Polynomial Commitment	Elliptic curve pairings with BLS12-381	High	Low	Yes	https://www.iacr.org/archive/asiacrypt2010/6477178/6477178.pdf
Verkle Tree	(with anonymous revocation)	High	Low	No	https://math.mit.edu/research/highschool/primes/materials/2018/Kuszmaul.pdf
Vector Commitment	RSA and computational Diffie-Hellman over bilinear groups	High	Low	Yes	https://eprint.iacr.org/2011/495.pdf
One-to-many prover VSS	Prover batching with sumcheck and Fiat-Shamir	Medium	Low	No	https://www.usenix.org/system/files/sec22summer_zhang-jiaheng.pdf

Arithmetic Fields

Leading Problem

Proof systems encode computation as a set of polynomial equations defined over a field. Operating on a different field will lead to gigantic circuits.
For pairing based SNARKs, there is a limited selection of pairing-friendly curves such as BN254 and BLS12-381.
FFTs require factors of two for the curves in order to make the polynomial computations practical.
Proof recursions require encoding arithmetic statements of a field into equations over a different field.
Bridge operators and roll-ups need interoperability with non-pairing friendly and not highly 2-adic curves, such as Bitcoin and Ethereum’s ECDSA over secp256k1. (BN254 precompiled contract)
Choosing the field is a tradeoff of speed and security.

Solutions

Brute force: decompose elements into limbs and operate on the limbs; Reduce and recompose the elements only when necessary
- Very expensive: may represent up to 1000X the original constraints in the circuit
- xJsnark framework
- gnark implementation
2-chains and cycles: use matching base fields to implement one curve inside of the other (assume pairing based schemes)
- Non-pairing based scheme can use non-pairing friendly or hybrid cycles at a cost, such as using PCS (polynomial commitment scheme) or pasta curves with linear time.
- Pasta Curves
- 2-chains of elliptic curves
Lower the requirements on the field: Polynomial Commitment Scheme (PCS) may not involve elliptic curves at all to be instantiated on arbitrary fields

Reference Reading

Efficient Signatures

Leading Problems

Signature verification is a problem that arises in many ZK related projects, from Zcash to zk-rollups
Different types of signatures depend on different field or curve arithmetic and need different speed-up methods
Here we will discuss some different options for signature schemes, and strategies for efficient implementations

Specialized ZK signatures

Given a ZKP system, a simple signature scheme can be constructed as follows. The signer generates a random secret key sk, and hashes it to get the associated public key pk. They then prove, in zero knowledge, that they know some sk such that hash(sk) = pk. Picnic is a variant of this idea, using a block cipher rather than a hash function
Since the statement being proved involves just a single evaluation of a hash or block cipher, these schemes can be highly efficient, particularly if using arithmetic-friendly primitives such as LowMC or Poseidon
The caveat here is that the proof must be generated by the user who owns the secret key. If the signature is part of a larger circuit, such as Zcash’s spend circuit, this may be undesirable, as a user cannot outsource proving to another entity without revealing their secret key to that entity

Native curves

If we wish to support delegated proving, it may be best to use a conventional signature scheme such as ECDSA, and include its verification steps in our circuit. This way the user can generate just a signature, and a separate prover can prove that they witnessed a valid signature
To make this efficient, we can choose an elliptic curve such that the base field of the curve matches the “native” field of our argument system. An example of this is the Baby Jubjub curve, whose base field matches the scalar field of alt_bn128. If we use an argument system such as Groth16 with alt_bn128 as our curve, Baby Jubjub arithmetic can be done “natively”
However, using a curve like Baby Jubjub isn’t always an option. Sometimes we need compatibility with existing tools which support conventional curves. Polygon Zero, for example, must support Ethereum’s existing signature scheme, which is ECDSA over secp256k1. In these cases, we must simulate arithmetic in another field using bignum techniques

Bignum techniques

For example, if we use standard multiplication algorithms for ED25519 field multiplication and range check each of the 5x5 limb products, we can reduce the cost by:
Since 2^255 = 19 mod p_ed25519, you could combine limb products with the same weight mod 255 (in bits). The combined products will have a few more bits but it should be cheaper overall
If we split each intermediate product into bits and use binary adders, we can minimize the number of splits by adding intermediate products with the same weight “natively” first. For example, if two intermediate products have a weight of 2^51 (the base), we can split the sum into 52 bits, which is just over half the cost of doing two 51-bit splits. When we have an intermediate product with a weight of 2^255 or more, we can change its weight to 1 and multiply it by 19, since 2^255 = 19 mod p_ed25519. That could help to further reduce the number of splits

CRT inspired methods

Alternatively, there’s the Aztec approach of having the prover give a purported product, then checking the identity mod p_native (almost free) and then mod, say, 2^306 (can ignore limb products with weights of 306+)
We’re going to publish a couple other methods soon, which should be cheaper than the ones above

Other tricks

These methods can be generalized to arbitrary low-degree formulas rather than simple multiplications. Checking a whole elliptic curve formula directly can be much more efficient
Finally, we would expect the cost to come down significantly if you used a PLONK or STARK implementation that supported lookup-based range checks, instead of range checking via binary decomposition

Edge cases

We can use incomplete formulae if we’re careful

Conventional optimizations

In the scalar multiplication code, we would suggest looking into windowing
If we’re checking an ECDSA or EdDSA signature, we can leverage the fact that one of the multiplications has a fixed base, enabling various preprocessing tricks
Curve-specific: secp256k1 supports the GLV endomorphism, for example, which allows us to split a curve multiplication into two smaller ones, opening the door to multi-scalar multiplication methods such as Yao’s

Suggestions

Use “ZK native” signatures if possible (i.e. Picnic)
Use “native” curves if possible (i.e. JubJub)
In non-native curve arithmetic, for example ED25519, combine the limb products with the same weight mod 255 (in bits)
1. minimize the number of splits by adding intermediate products with the same weight first. For example, when we have 2 intermediate projects of weight 2^51, we can split the sum into 52 bits
2. If we have an intermediate product with weight of 2^255 or more, we can change its weight to 1 and multiply by 19, since 2^255 = 19 mod p_ed25519
Use CRT related method like Aztec’s approach to have the prover give a purported product and check the identity mod p_native to ignore large limb products
Use a PLONK or STARK can cut down the cost significantly
Check windowing in the scalar multiplication code
Leverage the constant generator in one of the multiplications in EdDSA
GLV Endomorphism can split a curve multiplication into two smaller ones in curves like secp256k1

Proof Aggregation

Leading Problem

When comparing zk-SNARKs, we are usually primarily interested in three different metrics: prover time, verifier time, and proof length - some may also want to add quantum security and setup trust assumptions to that list.

Within the context of blockchains, there is one further criterion which I would argue to be as important as those three, and we’ll give it the clunky name aggregation-friendliness.

A zk-SNARK with universal setup can be said to be aggregation-friendly if (informally) there exists some m so that it is easy to merge m proofs P1,...,PmP1,...,Pm into one proof P1,...,mP1,...,m so that P1,...,mP1,...,m is not much bigger than either PiPi, and certainly much smaller than the two of them together. Specifically there should exist a deterministic polynomial-time algorithm 𝔄A (which we call the aggregator) and 𝔙𝔄VA (which we call the aggregation verifier) such that for any P1,...,PmP1,...,Pm:

The aggregation time |𝔄(P1,...,Pm)||A(P1,...,Pm)| is (quasi-)linear in m
The proof size |P1,...,m||P1,...,m| is sublinear in m and quasilinear in maxi(‖Pi‖)maxi(‖Pi‖)
|𝔙𝔄(P1,...,m)||VA(P1,...,m)| is quasilinear in |𝔙(argmaxi(|Pi|))||V(argmaxi(|Pi|))|, where 𝔙V is the verifier of the given zk-SNARK

So why is aggregation-friendliness so important to us? In blockchains, especially for scalability purposes, SNARKs are used to prove the validity of a bunch of statements at once. Now, if we want to prove a block of n transactions of, say for simplicity, the same size m, with a single zk-SNARK, the prover time will typically be O(nmlognmnmlog⁡nm). Often both proof length and verifier time are also non-constant in n. Bigger n therefore leads to significantly bigger costs, and furthermore, this doesn’t really scale with hardware parallelization or multiple provers.

On the other hand, imagine if the same proof system were aggregation-friendly, and let us for now ignore network limitations. You could have n different provers, each proving one transaction in time O(mlogmmlog⁡m), and then aggregating them in time sublinear in n, therefore generating a block proof much more quickly. Hopefully the resulting proof would also not be much bigger than the original one or much more difficult to verify.

To hammer the point home, assume we have two schemes, S1:=(1,1),S1:=(2,2)S1:=(P1,V1),S1:=(P2,V2). 1P1 is more efficient that 2P2, with the respective proving times for a statement of length m being t1=5mlog2m,t2=10mlog22mt1=5mlog2⁡m,t2=10mlog22⁡m. However, while S2S2 is aggregation-friendly, S1S1 is not. There exists an aggregator A for S2S2 which can aggregate n proofs of a length-m statement in time 4log22n4log22⁡n. In total, the proving time for P1P1 will be 5nmlog2nm5nmlog2⁡nm. On the other hand, we can run P2P2 in parallel for each of the statements, and then run A to aggregate them into a single proof, for a total time of

10mlog22m⋅4log22n=40mlog22mlog22n10mlog22⁡m⋅4log22⁡n=40mlog22⁡mlog22⁡n

Let’s look at how t1t1 and t2t2 compare for some different values of n and m:

For n = 2, m = 2, we have t1=40,t2=80t1=40,t2=80
For n = 8, m = 4, we have t1=800,t2=5760t1=800,t2=5760
For n = 4096, m = 16, we have t1=5242880,t2=1474560≃0.28⋅t1t1=5242880,t2=1474560≃0.28⋅t1

Here we see that, although 1P1 performs significantly better than 2+P2+A for small n, it’s a very different story for big n, and should make it clear why aggregation-friendliness is important.

This post will focus on some different approaches to zk-SNARK proof aggregation. At a high level, there are three axes which divide them into categories:

Recursive/Batching: Recursive proofs are proofs of proofs. We refer to non-recursive aggregation methods as "batching"
Sequential/Parallel: Can the scheme be executed in parallel?
Algebraic/Arithmetization-based: This distinction will become clear once we look at some examples.

Popular zk-SNARK Aggregation Constructions

Halo

Recursive
Sequential
Algebraic

As so much of cryptography, zero-knowledge and otherwise, Halo is based on elliptic curves. Specifically, it uses a polynomial commitment scheme (PCS), which was previously used in another well-known zk-SNARK scheme, namely Bulletproofs: Given an elliptic curve E (where we write E(𝔽)E(F) for the group defined by the points on E in the field 𝔽F over which we work) and random points Pi∈E(𝔽)Pi∈E(F), one can commit to a polynomial p(x) = ∑iaixi∑iaixi with the element ∑iai⋅Pi∑iai⋅Pi, where ⋅⋅ denotes scalar multiplication in the additive group E(𝔽)E(F). This commitment is also often written as <a,P><a,P> as if it were an inner product.

The "inner product argument" (IPA) is a protocol between the prover and the verifier where the prover tries to convince the verifier that he knows the polynomial corresponding to his commitment. Roughly speaking, he does this by iteratively chopping his tuples a and P in two equally long tuples and then putting them together so as to get two new tuples of half the length. Eventually, after a logarithmic number of rounds, the prover is left with a single equation, where, almost if and only if he has acted honestly throughout will he be able to satisfy the verifier’s final challenge.

The problem is that, although the number of rounds is nice and small, the verifier still needs to compute two "inner products" of "vectors" of the original length. One may think that this ruins the protocol: the verifier’s work is now linear rather than logarithmic. However, this is where the recursive proof composition (aggregation) comes in:

The aggregation technique largely relies on an ad hoc polynomial gu1,...,uku1,...,uk(x), which is defined in terms of the verifier’s challenges uiui to the prover in the IPA. As it turns out, gu1,...,uku1,...,uk(x) can be used to compute both of the "inner products", where we will focus on one of them: At the end of the IPA, the prover has a single elliptic curve point P. As it turns out, by the construction of gu1,...,uku1,...,uk(x), the commitment to it defined by the above scheme yields exactly P! The other "inner product" can also be defined as an evaluation of gu1,...,uku1,...,uk(x). This does not help us directly, as evaluating the polynomial still takes linear time, however, the prover can now create such polynomials for many statements, commit to them, and then at the end prove that a linear combination of them corresponds to a desired value - this uses a property of the PCS which we have not mentioned thus far, namely that it’s additively homomorphic (specifically, it’s based on Pedersen commitments).

To recap, the verifier previously had to do a linear-time operation for each statement. By invoking a certain polynomial with some nice properties, the prover can defer the final opening proof until he has computed and committed to arbitrarily many such polynomials. Halo people often talk about the "inner" and the "outer" circuit to reflect the distinction between the point of view of the prover (who finishes most of each proof before computing an aggregated proof for all together) and the verifier (for whom the proof really boils down to the linear-time computation she has to perform).

For specifics, we refer to the original Halo paper. The Halo approach is clearly algebraic in its nature: It depends on the nice homomorphic properties of Pedersen commitments, as well as a particular polynomial. Moreover, it is sequential: The prover iteratively adds commitments G1,G2,...G1,G2,... to his inner proof, before opening a linear combination of them at the end of the protocol. Finally, it is recursion-based: The prover proves the validity of the previous inner proof to the current one.

Halo 2 is used by several protocols, and its main selling point is the aggregation property. Notably, Mina protocol uses Halo to create a constant-sized blockchain, in the sense that the validity of the current state is proven simply by verifying the validity of the proof of the previous state together with the transactions in the last block. Finally, a proof is computed, representing everything that has happened up to that state.

Electric Coin Company, the developer of the scheme, recently incorporated Halo 2 as part of their upgrade to the Orchard shielded pools.

Halo 2 is also used by the zkEVM rollup project Scroll, in order to aggregate multiple blocks into one final proof and thereby amortize the L1 verification costs between them.

Plonky2

Recursive
Batching
Arithmetization-based