开源zkml：为所有人提供无信任机器学习

We’re excited to announce the open-source release of zkml, our framework for producing zero-knowledge proofs of ML model execution. zkml builds on our earlier paper on scaling zero-knowledge proofs to ImageNet models but contains many improvements for usability, functionality, and scalability. With our improvements, we can verify execution of models that achieve 92.4% accuracy on ImageNet, a 13% improvement compared to our initial work! zkml can also prove an MNIST model with 99% accuracy in four seconds.

In this post, we’ll describe our vision for zkml and how to use zkml. In future posts, we’ll describe several applications of zkml in detail, including trustless audits, decentralized prompt marketplaces, and privacy-preserving face ID. We’ll also describe the technical challenges and details behind zkml. In the meantime, check out our open-source code!

Why do we need trustless machine learning?

Over the past few years, we’ve seen two inescapable trends: more of our world moving online and ML/AI methods becoming increasingly powerful. These ML/AI technologies have enabled new forms of art and incredible productivity increases… However, these technologies are increasingly concealed behind closed APIs.

Although these providers want to protect trade secrets, we want to have assurances of their models: that the training data doesn’t contain copyrighted material or that it isn’t biased. We also want assurances that a specific model was executed in high-stakes scenarios, such as medical industries.

In order to do so, the model provider can take two steps: commit to a model trained on a hidden dataset, and provide audits of the hidden dataset after training. In the first step, the model provider releases proofs of training on a given dataset and a commitment to the weights at the end of the process. Importantly, the weights can be kept hidden! By doing so, any third party can be assured that the training happened honestly. Then, the audit can be done using zero-knowledge proofs over the hidden data.

We have been imagining a future where ML models can be executed trustlessly. As we’ll describe in future posts, trustless execution of ML models will enable a range of applications:

Trustless audits of ML-powered applications, such as proving that no copyrighted images were used in a training dataset, as we described above.
Verification that specific ML models were run by ML-as-a-service providers for regulated industries.
Decentralized prompt marketplaces for generative AI, where creators can sell access to their prompts.
Privacy-preserving biometric authentication, such as enabling smart contracts to use face ID.

ZK-SNARKs for trustless ML

In order to trustlessly execute ML models, we can turn to powerful tools in cryptography. We focus on ZK-SNARKs (zero-knowledge succinct non-interactive argument of knowledge), which are tools that allow a prover to prove an arbitrary computation was done correctly using a short proof. ZK-SNARKs also have the amazing property that the inputs and intermediate variables (e.g., activations) can be hidden!

In the context of ML, we can use a ZK-SNARK to prove that a model was executed correctly on a given input, while hiding the model weights, inputs, and outputs. We can further choose to selectively reveal any of the weights, inputs, or outputs depending on the application at hand.

With this powerful primitive, we can enable trustless audits and all the other applications we described above!

zkml: a first step towards trustless ML

As a first step towards trustless ML model execution for all, we’ve open-sourced zkml. To use zkml, consider proving the execution of an MNIST model by producing a ZK-SNARK. Using zkml, we can run the following commands:

# Installs rust, skip if you already have rust installed
curl --proto '=https' --tlsv1.2 -sSf <https://sh.rustup.rs> | sh

git clone <https://github.com/ddkang/zkml.git>
cd zkml
rustup override set nightly
cargo build --release
mkdir params_kzg

# This should take ~8s to run the first time and ~4s to run the second time
./target/release/time_circuit examples/mnist/model.msgpack examples/mnist/inp.msgpack kzg

On a regular laptop, the proving time is as little as 4 seconds and consumes ~2GB of RAM. We’re also the first framework to be able to compute ZK-SNARKs at ImageNet scale. As a sneak preview, we can achieve non-trivial accuracy on ImageNet in under 4 minutes and 92.4% in under 45 minutes of proving:

We’ve increased accuracy by 13%, decreased proving cost by 6x, and decreased verification times by 500x compared to our initial work!

Our primary focus for zkml is high efficiency. Existing approaches are resource-intensive, taking days to prove small models, many gigabytes of RAM, or producing large proofs. We’ll describe how zkml works under the hood in future posts.

We believe that efficiency is critical because it enables the future where anyone can execute ML trustlessly and we’ll continue pushing towards that goal. Currently, models like GPT-4 and Stable Diffusion are out of reach and we hope to change that soon!

Furthermore, zkml can enable trustless audits and all of the other applications we’ve mentioned! In addition to performance improvements, we’ve also been working on new features, including enabling proofs of training and trustless audits. We’ve also been adding features for models beyond vision models.

https://medium.com/@danieldkang/open-sourcing-zkml-trustless-machine-learning-for-all-f5ee1dbf2499

我们很高兴地宣布 zkml 的开源版本，这是我们用于生成ML模型执行的零知识证明的框架。

zkml建立在我们早期的论文基础上，该论文介绍了如何将零知识证明扩展到ImageNet模型，但在可用性、功能和可扩展性方面进行了许多改进。通过我们的改进，我们可以验证在 ImageNet 上模型的执行达到**92.4% 的准确率，这比我们最初的工作相比提高了 13%！**zkml 还可以在四秒内以证明一个 MNIST 模型的准确率达到99%。

在本文中，我们将描述我们对 zkml 的愿景以及如何使用 zkml。在以后的文章中，我们将详细描述 zkml 的几个应用程序，包括去信任审计、去中心化提示市场和隐私保护面容 ID。我们还将描述 zkml 背后的技术挑战和细节。于此同时，请查看我们的开源代码！

为什么我们需要去信任机器学习？

在过去的几年里，我们看到了两个不可避免的趋势：我们的世界越来越多地向在线转移，而ML/AI方法变得越来越强大。这些ML/AI技术已经实现了新形式的艺术和令人难以置信的生产力增长...然而，这些技术越来越多地隐藏在封闭的API后面。

尽管这些供应商想要保护商业秘密，但我们希望对他们的模型有保证：训练数据不包含受版权保护的材料或不带有偏见。我们还希望确保特定模型在高风险场景（例如医疗行业）中执行。

为了做到这一点，模型提供者可以采取两个步骤：承诺在一个“隐藏”的数据集上训练的模型，并在训练后提供对隐藏数据集的审计。在第一步中，模型提供者发布在给定数据集上的训练证明和在过程结束时的权重承诺。重要的是，权重可以保持隐藏！通过这样做，任何第三方都可以确保训练是诚实的。然后，可以使用零知识证明对隐藏数据进行审计。

我们一直在想象一个未来，其中ML模型可以被无信任地执行。正如我们将在未来的文章中描述的那样，ML模型的无信任执行将使一系列应用成为可能：

ML 支持的应用程序的去信任审计，例如证明训练数据集中没有使用受版权保护的图像，如上所述。
验证ML-as-a-service提供商为受监管行业运行了特定的ML模型。
用于生成AI的去中心化提示市场，创作者可以出售对其提示的访问权限。
私保护的生物识别认证，例如使智能合约使用面部ID。

用于无信任 ML 的 ZK-SNARKs

为了无需信任地执行 ML 模型，我们可以利用强大的密码学工具。我们专注于ZK-SNARKs（零知识简洁非交互式知识证明），这是一种允许证明者使用简短证明证明任意计算正确完成的工具。ZK-SNARKs 还有一个惊人的特性，即可以隐藏输入和中间变量（例如，激活）！

在 ML 的上下文中，我们可以使用ZK-SNARK来证明在给定输入上正确执行了模型，同时隐藏模型权重、输入和输出。我们还可以根据手头的应用程序选择性地揭示任何权重、输入或输出。

有了这个强大的原语，我们可以实现无信任审计和我们上面描述的所有其他应用！

zkml：迈向无信任机器学习的第一步

作为实现无信任ML模型执行的第一步，我们已经开源了zkml。要使用 zkml，请考虑通过生成 ZK-SNARK 来证明 MNIST 模型的执行。使用 zkml，我们可以运行以下命令：

# 安装 rust，如果你已经安装了 rust，则跳过
curl --proto'=https' --tlsv1.2 -sSf <https://sh.rustup.rs> |sh

gitclone <https://github.com/ddkang/zkml.git>
cd zkml
rustup overrideset nightly
cargo build --release
mkdir params_kzg

# 第一次运行需要 ~8 秒，第二次运行需要 ~4 秒
。/目标/发布/time_circuit 示例/mnist/model.msgpack 示例/mnist/inp.msgpack kzg

在普通的笔记本电脑上，证明时间只需 4 秒，消耗约 2GB 的 RAM。我们也是第一个能够在 ImageNet 规模上计算 ZK-SNARK 的框架。作为预览，我们可以在不到 4 分钟的时间内在 ImageNet 上实现非凡的准确率，并在 45 分钟的证明时间内实现 92.4% 的准确率：

与我们的初始工作相比，我们的准确性提高了 13%，验证成本降低了 6 倍，并将验证时间缩短了 500 倍！

我们对 zkml 的主要关注点是高效率。现有方法是资源密集型的，需要数天时间来证明小模型、需要大量的内存或生成大型证明。我们将在未来的文章中详细介绍 zkml 的工作原理。

我们相信高效性至关重要，因为它能够实现任何人都可以无需信任地执行机器学习的未来，我们将继续朝着这个目标努力。目前，像 GPT-4 和 Stable Diffusion 这样的模型还无法实现，而我们希望尽快改变这种情况！

此外，zkml 可以启用无信任审计和我们提到的其他所有应用！除了性能改进之外，我们还在开发新功能，包括启用培训证明和去信任审计。我们还一直在为视觉模型之外的模型添加功能。