Hardware Acceleration
Leading Problem
- Proof generation is time-consuming, high-latency, and not efficient enough on traditional CPUs
- As L2 rollups and other applications of ZKP grow, we need to consider hardware as a potential breakthrough of performance improvement
Business Models
- zk-mining: Plug-n-play FPGA accelerator cards (revenue through hardware sale)
- zk-Rollup: public, private and on-premise cloud (revenue through hourly premiere)
- Software auxiliary: SDK for dApps and API (license to program FPGA in developer-friendly way
3 Potential Scenarios
- Permissioned networks, e.g. StarkNet
- Permissionless networks where provers compete on cost, e.g. Mina’s Snarketplace
- Permissionless networks where provers compete on latency, e.g. Polygon Hermes & Zero
Selection metrics
- Performance: latency, throughput, and power consumption
- Total cost of ownership (capital expenditures and maintenance costs)
- NRE cost (non recurring engineering: onboarding difficulties)
- Long-term comparative advantage: competitor performance may double in a year with the same MSPR (e.g. Intel, Nvidia GPU)
Structure of ZKP Hardware Acceleration
- Current popular structure is to combine CPU and FPGA/GPU to perform proof generation. CPU tackles single-threaded pieces at higher frequency and deal with non-determinism
- MSM and FFT can be parallelized, but arithmetization and commitments may be single threaded (different from the “embarrassingly parallel” PoW mining structure)
- Most FPGA design tools are proprietary: FPGAs have higher barrier to enter than GPU accelerations
Light Client
Leading Problem
- Full node is resource-draining: downloading and verifying the whole chain of blocks takes time and resources
- Not retail friendly to be on mobile and perform quick actions like sending transactions and verifying balances
Technology Highlight
- Trust-minimized interaction with full nodes and do not interact directly with the blockchain
- Much lighter on resources and storage but require higher network bandwidth
- Linear time based on chain length: verify from the root to the target height (may be from a trusted height, therefore constant length)
- Process overview:
- retrieve a chain of headers instead of full block
- verify the signatures of the intermediary signed headers and resolve disputes through checking with full nodes iteratively
- Superlight client: requires downloading only a logarithmic number of block headers while storing only a single block header between executions
Commit-and-proof Schemes
Arithmetic Fields
Leading Problem
- Proof systems encode computation as a set of polynomial equations defined over a field. Operating on a different field will lead to gigantic circuits.
- For pairing based SNARKs, there is a limited selection of pairing-friendly curves such as BN254 and BLS12-381.
- FFTs require factors of two for the curves in order to make the polynomial computations practical.
- Proof recursions require encoding arithmetic statements of a field into equations over a different field.
- Bridge operators and roll-ups need interoperability with non-pairing friendly and not highly 2-adic curves, such as Bitcoin and Ethereum’s ECDSA over secp256k1. (BN254 precompiled contract)
- Choosing the field is a tradeoff of speed and security.
Solutions
- Brute force: decompose elements into limbs and operate on the limbs; Reduce and recompose the elements only when necessary
- 2-chains and cycles: use matching base fields to implement one curve inside of the other (assume pairing based schemes)
- Non-pairing based scheme can use non-pairing friendly or hybrid cycles at a cost, such as using PCS (polynomial commitment scheme) or pasta curves with linear time.
- Pasta Curves
- 2-chains of elliptic curves
- Lower the requirements on the field: Polynomial Commitment Scheme (PCS) may not involve elliptic curves at all to be instantiated on arbitrary fields
Reference Reading
Efficient Signatures
Leading Problems
- Signature verification is a problem that arises in many ZK related projects, from Zcash to zk-rollups
- Different types of signatures depend on different field or curve arithmetic and need different speed-up methods
- Here we will discuss some different options for signature schemes, and strategies for efficient implementations
Specialized ZK signatures
- Given a ZKP system, a simple signature scheme can be constructed as follows. The signer generates a random secret key sk, and hashes it to get the associated public key pk. They then prove, in zero knowledge, that they know some sk such that hash(sk) = pk. Picnic is a variant of this idea, using a block cipher rather than a hash function
- Since the statement being proved involves just a single evaluation of a hash or block cipher, these schemes can be highly efficient, particularly if using arithmetic-friendly primitives such as LowMC or Poseidon
- The caveat here is that the proof must be generated by the user who owns the secret key. If the signature is part of a larger circuit, such as Zcash’s spend circuit, this may be undesirable, as a user cannot outsource proving to another entity without revealing their secret key to that entity
Native curves
- If we wish to support delegated proving, it may be best to use a conventional signature scheme such as ECDSA, and include its verification steps in our circuit. This way the user can generate just a signature, and a separate prover can prove that they witnessed a valid signature
- To make this efficient, we can choose an elliptic curve such that the base field of the curve matches the “native” field of our argument system. An example of this is the Baby Jubjub curve, whose base field matches the scalar field of alt_bn128. If we use an argument system such as Groth16 with alt_bn128 as our curve, Baby Jubjub arithmetic can be done “natively”
- However, using a curve like Baby Jubjub isn’t always an option. Sometimes we need compatibility with existing tools which support conventional curves. Polygon Zero, for example, must support Ethereum’s existing signature scheme, which is ECDSA over secp256k1. In these cases, we must simulate arithmetic in another field using bignum techniques
Bignum techniques
- For example, if we use standard multiplication algorithms for ED25519 field multiplication and range check each of the 5x5 limb products, we can reduce the cost by:
- Since 2^255 = 19 mod p_ed25519, you could combine limb products with the same weight mod 255 (in bits). The combined products will have a few more bits but it should be cheaper overall
- If we split each intermediate product into bits and use binary adders, we can minimize the number of splits by adding intermediate products with the same weight “natively” first. For example, if two intermediate products have a weight of 2^51 (the base), we can split the sum into 52 bits, which is just over half the cost of doing two 51-bit splits. When we have an intermediate product with a weight of 2^255 or more, we can change its weight to 1 and multiply it by 19, since 2^255 = 19 mod p_ed25519. That could help to further reduce the number of splits
CRT inspired methods
- Alternatively, there’s the Aztec approach of having the prover give a purported product, then checking the identity mod p_native (almost free) and then mod, say, 2^306 (can ignore limb products with weights of 306+)
- We’re going to publish a couple other methods soon, which should be cheaper than the ones above
Other tricks
- These methods can be generalized to arbitrary low-degree formulas rather than simple multiplications. Checking a whole elliptic curve formula directly can be much more efficient
- Finally, we would expect the cost to come down significantly if you used a PLONK or STARK implementation that supported lookup-based range checks, instead of range checking via binary decomposition
Edge cases
- We can use incomplete formulae if we’re careful
Conventional optimizations
- In the scalar multiplication code, we would suggest looking into windowing
- If we’re checking an ECDSA or EdDSA signature, we can leverage the fact that one of the multiplications has a fixed base, enabling various preprocessing tricks
- Curve-specific: secp256k1 supports the GLV endomorphism, for example, which allows us to split a curve multiplication into two smaller ones, opening the door to multi-scalar multiplication methods such as Yao’s
Suggestions
- Use “ZK native” signatures if possible (i.e. Picnic)
- Use “native” curves if possible (i.e. JubJub)
- In non-native curve arithmetic, for example ED25519, combine the limb products with the same weight mod 255 (in bits)
- minimize the number of splits by adding intermediate products with the same weight first. For example, when we have 2 intermediate projects of weight 2^51, we can split the sum into 52 bits
- If we have an intermediate product with weight of 2^255 or more, we can change its weight to 1 and multiply by 19, since 2^255 = 19 mod p_ed25519
- Use CRT related method like Aztec’s approach to have the prover give a purported product and check the identity mod p_native to ignore large limb products
- Use a PLONK or STARK can cut down the cost significantly
- Check windowing in the scalar multiplication code
- Leverage the constant generator in one of the multiplications in EdDSA
- GLV Endomorphism can split a curve multiplication into two smaller ones in curves like secp256k1
Proof Aggregation
Leading Problem
When comparing zk-SNARKs, we are usually primarily interested in three different metrics: prover time, verifier time, and proof length - some may also want to add quantum security and setup trust assumptions to that list.
Within the context of blockchains, there is one further criterion which I would argue to be as important as those three, and we’ll give it the clunky name aggregation-friendliness.
A zk-SNARK with universal setup can be said to be aggregation-friendly if (informally) there exists some m so that it is easy to merge m proofs P1,...,PmP1,...,Pm into one proof P1,...,mP1,...,m so that P1,...,mP1,...,m is not much bigger than either PiPi, and certainly much smaller than the two of them together. Specifically there should exist a deterministic polynomial-time algorithm 𝔄A (which we call the aggregator) and 𝔙𝔄VA (which we call the aggregation verifier) such that for any P1,...,PmP1,...,Pm:
- The aggregation time |𝔄(P1,...,Pm)||A(P1,...,Pm)| is (quasi-)linear in m
- The proof size |P1,...,m||P1,...,m| is sublinear in m and quasilinear in maxi(‖Pi‖)maxi(‖Pi‖)
- |𝔙𝔄(P1,...,m)||VA(P1,...,m)| is quasilinear in |𝔙(argmaxi(|Pi|))||V(argmaxi(|Pi|))|, where 𝔙V is the verifier of the given zk-SNARK
So why is aggregation-friendliness so important to us? In blockchains, especially for scalability purposes, SNARKs are used to prove the validity of a bunch of statements at once. Now, if we want to prove a block of n transactions of, say for simplicity, the same size m, with a single zk-SNARK, the prover time will typically be O(nmlognmnmlognm). Often both proof length and verifier time are also non-constant in n. Bigger n therefore leads to significantly bigger costs, and furthermore, this doesn’t really scale with hardware parallelization or multiple provers.
On the other hand, imagine if the same proof system were aggregation-friendly, and let us for now ignore network limitations. You could have n different provers, each proving one transaction in time O(mlogmmlogm), and then aggregating them in time sublinear in n, therefore generating a block proof much more quickly. Hopefully the resulting proof would also not be much bigger than the original one or much more difficult to verify.
To hammer the point home, assume we have two schemes, S1:=(1,1),S1:=(2,2)S1:=(P1,V1),S1:=(P2,V2). 1P1 is more efficient that 2P2, with the respective proving times for a statement of length m being t1=5mlog2m,t2=10mlog22mt1=5mlog2m,t2=10mlog22m. However, while S2S2 is aggregation-friendly, S1S1 is not. There exists an aggregator A for S2S2 which can aggregate n proofs of a length-m statement in time 4log22n4log22n. In total, the proving time for P1P1 will be 5nmlog2nm5nmlog2nm. On the other hand, we can run P2P2 in parallel for each of the statements, and then run A to aggregate them into a single proof, for a total time of
10mlog22m⋅4log22n=40mlog22mlog22n10mlog22m⋅4log22n=40mlog22mlog22n
Let’s look at how t1t1 and t2t2 compare for some different values of n and m:
- For n = 2, m = 2, we have t1=40,t2=80t1=40,t2=80
- For n = 8, m = 4, we have t1=800,t2=5760t1=800,t2=5760
- For n = 4096, m = 16, we have t1=5242880,t2=1474560≃0.28⋅t1t1=5242880,t2=1474560≃0.28⋅t1
Here we see that, although 1P1 performs significantly better than 2+P2+A for small n, it’s a very different story for big n, and should make it clear why aggregation-friendliness is important.
This post will focus on some different approaches to zk-SNARK proof aggregation. At a high level, there are three axes which divide them into categories:
- Recursive/Batching: Recursive proofs are proofs of proofs. We refer to non-recursive aggregation methods as "batching"
- Sequential/Parallel: Can the scheme be executed in parallel?
- Algebraic/Arithmetization-based: This distinction will become clear once we look at some examples.
Popular zk-SNARK Aggregation Constructions
Halo
- Recursive
- Sequential
- Algebraic
As so much of cryptography, zero-knowledge and otherwise, Halo is based on elliptic curves. Specifically, it uses a polynomial commitment scheme (PCS), which was previously used in another well-known zk-SNARK scheme, namely Bulletproofs: Given an elliptic curve E (where we write E(𝔽)E(F) for the group defined by the points on E in the field 𝔽F over which we work) and random points Pi∈E(𝔽)Pi∈E(F), one can commit to a polynomial p(x) = ∑iaixi∑iaixi with the element ∑iai⋅Pi∑iai⋅Pi, where ⋅⋅ denotes scalar multiplication in the additive group E(𝔽)E(F). This commitment is also often written as <a,P><a,P> as if it were an inner product.
The "inner product argument" (IPA) is a protocol between the prover and the verifier where the prover tries to convince the verifier that he knows the polynomial corresponding to his commitment. Roughly speaking, he does this by iteratively chopping his tuples a and P in two equally long tuples and then putting them together so as to get two new tuples of half the length. Eventually, after a logarithmic number of rounds, the prover is left with a single equation, where, almost if and only if he has acted honestly throughout will he be able to satisfy the verifier’s final challenge.
The problem is that, although the number of rounds is nice and small, the verifier still needs to compute two "inner products" of "vectors" of the original length. One may think that this ruins the protocol: the verifier’s work is now linear rather than logarithmic. However, this is where the recursive proof composition (aggregation) comes in:
The aggregation technique largely relies on an ad hoc polynomial gu1,...,uku1,...,uk(x), which is defined in terms of the verifier’s challenges uiui to the prover in the IPA. As it turns out, gu1,...,uku1,...,uk(x) can be used to compute both of the "inner products", where we will focus on one of them: At the end of the IPA, the prover has a single elliptic curve point P. As it turns out, by the construction of gu1,...,uku1,...,uk(x), the commitment to it defined by the above scheme yields exactly P! The other "inner product" can also be defined as an evaluation of gu1,...,uku1,...,uk(x). This does not help us directly, as evaluating the polynomial still takes linear time, however, the prover can now create such polynomials for many statements, commit to them, and then at the end prove that a linear combination of them corresponds to a desired value - this uses a property of the PCS which we have not mentioned thus far, namely that it’s additively homomorphic (specifically, it’s based on Pedersen commitments).
To recap, the verifier previously had to do a linear-time operation for each statement. By invoking a certain polynomial with some nice properties, the prover can defer the final opening proof until he has computed and committed to arbitrarily many such polynomials. Halo people often talk about the "inner" and the "outer" circuit to reflect the distinction between the point of view of the prover (who finishes most of each proof before computing an aggregated proof for all together) and the verifier (for whom the proof really boils down to the linear-time computation she has to perform).
For specifics, we refer to the original Halo paper. The Halo approach is clearly algebraic in its nature: It depends on the nice homomorphic properties of Pedersen commitments, as well as a particular polynomial. Moreover, it is sequential: The prover iteratively adds commitments G1,G2,...G1,G2,... to his inner proof, before opening a linear combination of them at the end of the protocol. Finally, it is recursion-based: The prover proves the validity of the previous inner proof to the current one.
Halo 2 is used by several protocols, and its main selling point is the aggregation property. Notably, Mina protocol uses Halo to create a constant-sized blockchain, in the sense that the validity of the current state is proven simply by verifying the validity of the proof of the previous state together with the transactions in the last block. Finally, a proof is computed, representing everything that has happened up to that state.
Electric Coin Company, the developer of the scheme, recently incorporated Halo 2 as part of their upgrade to the Orchard shielded pools.
Halo 2 is also used by the zkEVM rollup project Scroll, in order to aggregate multiple blocks into one final proof and thereby amortize the L1 verification costs between them.
Plonky2
- Recursive
- Batching
- Arithmetization-based