Building a ETL process with Zero-Knowledge Proofs — Part 1

8 min readJan 1, 2024

In the evolving landscape of decentralized data platforms, ensuring fair compensation for data providers while maintaining customer privacy is paramount. This article delves into the use of zk-SNARKs in the context of an ETL (Extract, Transform, Load) process, focusing on aggregating data consumption from various nodes. We’ll explore how to build a system that validates data providers’ payments and customer charges without compromising the privacy of the underlying consumption data.

The Challenge

Consider a decentralized platform where customers consume data from multiple nodes. As customers access various objects (like downloading files or streaming videos), these objects are served by different data provider nodes. The system needs to aggregate how much data each customer has consumed and from which nodes to calculate payments and receipts. However, given the computational and cost constraints of blockchains, this aggregation needs to be performed off-chain, typically using a map-reduce approach across multiple ETL nodes.

The issue is that some ETL nodes aggregating this data might be malicious, potentially skewing the data to benefit certain data nodes unfairly. Also, the actual consumption packets might contain sensitive user data, necessitating privacy preservation in the aggregation and validation process.

Enter SNARK: Ensuring Integrity without Compromising Privacy

SNARKs provide a way for ETL nodes to prove that their data aggregation is accurate without revealing the underlying data, crucial for protecting user privacy. The essence of a SNARK is that it allows one party to prove to another that a specific computation was executed correctly, without either party needing to reveal the inputs or outputs of that computation.

In our scenario, ETL nodes perform the data aggregation and generate a corresponding SNARK proof. This proof, alongside the aggregated data (such as total consumption and payments due), is then posted on the blockchain.

Creating the Aggregation and Proof

Each ETL node will perform a map-reduce operation off-chain. They aggregate consumption data into a format where each customer Ai owes Xi dollars, and each data provider Bj should receive Yj dollars.

Concurrently, it generates a SNARK proof asserting the correctness of this aggregation. This proof essentially says, “We’ve correctly calculated the payments and receipts without tampering with the data, and here’s a proof that you can verify without knowing the underlying data.”

The aggregations and their respective proofs are then uploaded to the blockchain. This step ensures transparency and allows network validators to access the proofs.

Verification by Validators

Validators or any party interested in ensuring fairness can now verify the zk-SNARK proofs. By executing the SNARK verification algorithm, they can ascertain the correctness of the aggregation without accessing the raw consumption data.

They don’t need access to individual consumption packets, thus preserving privacy. If a proof doesn’t check out, it indicates potential tampering or errors in the aggregation process.

Example

Consider a scenario with 10 packets, managed by 2 ETL nodes. Let’s say:

Node 1 processes 4 packets, resulting in Customer A1…A10 owing $10 and Provider B1 to receive $8, and provider B2 to receive 2$.
Node 2 processes 6 packets, with customer A11…A20 owing $7 and provider B2 to receive $5, and provider B3 to receive 2$.

For each case, a SNARK proof is generated and posted to the blockchain alongside the aggregated data.

The ETL Process with Rust and Bellman

Rust in conjunction with the Bellman library offers a general-purpose approach for implementing such proofs. This allows us to leverage Rust’s robustness and Bellman’s capabilities to build the ETL process for our decentralized data platform.

1. Input: Event Packet Structure

In our Rust implementation, we define an event packet with the following structure:

Object ID: The ID of the object being consumed.
Packet ID: The ID of the packet for a particular object being consumed.
Timestamp: The time at which the data was consumed.
Bytes Consumed: The amount of data (in bytes) consumed in the session.
Customer and Data Provider IDs: Both are now public keys. These public keys will be used to verify the signatures attached to each packet.
Signatures: They are cryptographic proofs that the customer and data provider have approved the particular data packet. These signatures would typically cover the content of the packet, such as the object ID, timestamp, and bytes consumed.

When designing systems that use ZK-SNARKs for verification, choosing a signature scheme that is efficient within the SNARK environment is crucial. ECDSA, while popular and widely used in many cryptographic applications, is not necessarily the most efficient for ZK-SNARKs due to its complex arithmetic. Instead, other signature schemes are often preferred for their simplicity and efficiency when proving and verifying within a SNARK. One such scheme is the EdDSA signature scheme, particularly when used with specific curves like Baby Jubjub. However, in this article, we will be using ECDSA with the secp256k1 curve, as it is widely utilized and you are more likely to encounter it in practice compared to more recent signature schemes.

When working with keys and signatures in Rust, the specific types you’ll use can depend on the cryptographic library you’ve chosen. Popular choices for cryptographic operations in Rust include libraries like ring, rust-crypto, secp256k1, rustls, zokrates, and others. Many of these libraries provide their own types for public keys and signatures, particularly suited to the elliptic curves and cryptographic algorithms they support.

2. Processing: Aggregation Logic

The core logic involves aggregating this data over a specified period. The aggregation will be based on the customer ID and will summarize the total bytes consumed along with the start and end times of the aggregation period.

3. Output: Aggregated Data and SNARK Proof

After processing, our Rust program with Bellman will output:

Aggregated Data: This includes the customer ID, total bytes consumed, and the start and end times of the aggregation period.
SNARK Proof: A cryptographic proof generated by Bellman, asserting that the aggregation is correct.

4. Storage: Blockchain Integration

The generated aggregated data and SNARK proof are then stored on the blockchain, ensuring transparency, integrity, and enabling validation without compromising individual consumption data privacy.

Similarly to our Sudoku example, we start with the Circuit trait and implement synthesizemethod.

extern crate bellman;
extern crate secp256k1;

use secp256k1::{PublicKey, Signature};
use bellman::{Circuit, ConstraintSystem, SynthesisError};
use bellman::groth16::{create_random_proof, generate_random_parameters, prepare_verifying_key, verify_proof};

// Define the structure of the event packet
struct EventPacket {
    object_id: u64,
    packet_id: u64,
    timestamp: u64,
    bytes_consumed: u64,
    customer_id: PublicKey,
    data_provider_id: PublicKey,
    customer_signature: Signature,
    data_provider_signature: Signature,
}

// Define the circuit for the aggregation
struct AggregationCircuit {
    // Inputs to the circuit
    event_packets: Vec<EventPacket>,
    // Outputs of the circuit
    aggregated_data: Vec<(u32, u64, u64, u64)>, // (customer_id, total_bytes, start_time, end_time)
}

impl Circuit for AggregationCircuit {
    fn synthesize<CS: ConstraintSystem>(&self, cs: &mut CS) -> Result<(), SynthesisError> {
        // Logic to aggregate data and generate constraints for the SNARK proof
        // ...

        Ok(())
    }
}

Transition to Cryptographic Types

In previous discussions, we defined the EventPacket structure using typical Rust types such as u64 for identifiers and timestamps, along with specific types for signatures and public keys to mirror real-world application structures. However, when dealing with cryptographic operations, particularly in the context of elliptic curve cryptography and zero-knowledge proofs, a different approach is necessary. These operations require arithmetic in a finite field, usually a prime field, where every operation — be it addition or multiplication — is performed modulo a large prime number to ensure results remain within the finite field’s bounds.

Elliptic curve operations like point addition and multiplication necessitate the use of specially crafted types and functions that comprehend the curves’ unique structures and the underlying field arithmetic. This is where E::Fr comes into play in Bellman and similar libraries. It represents an element of the finite field, acting as the basic data type for all inputs, outputs, and intermediate values within the circuit. Unlike ordinary integers or strings, E::Fr and similar types are specifically designed to facilitate the operations and constraints inherent to cryptographic proofs.

The shift from conventional types like u64 or a PublicKey structure to E::Fr is essential in a ZK-SNARK circuit. Here, all variables and their operations are formulated within the circuit’s constraints, which are designed to work exclusively with field elements. These constraints are attuned to the prime field’s arithmetic, which is critical to ensuring the cryptographic security and efficiency of the system. Consequently, every piece of data, regardless of its original format, must be converted into this field representation to ensure compatibility with the cryptographic operations and constraints of the system. This is fundamental not just for the integrity of the cryptographic process but also for its security and efficiency.

We’ll finalize the Part 1 article with conversion logic between Rust primitives and cryptographic types:

extern crate bellman;
extern crate pairing;

use secp256k1::{PublicKey, Signature};
use bellman::pairing::{
    Engine,     // Abstract engine for cryptographic operations
    PrimeField, // Trait for prime field arithmetic
    ff::FromPrimitive,  // For converting numbers to Fr
};

struct EventPacketCircuit<E: Engine> {
    packet_id: E::Fr,
    object_id: E::Fr,
    bytes_consumed: E::Fr,
    timestamp: E::Fr,
    customer_public_key_x: E::Fr, // x component of the pubkey
    customer_public_key_y: E::Fr, // y component of the pubkey
    customer_signature_r: E::Fr,  // r component of the signature
    customer_signature_s: E::Fr,  // s component of the signature
    provider_public_key_x: E::Fr,
    provider_public_key_y: E::Fr,
    provider_signature_r: E::Fr,
    provider_signature_s: E::Fr,
}

// Define the generic trait for conversion
trait ConvertToFr<E: Engine> {
    fn convert_to_fr(&self) -> Vec<E::Fr>;
}

// Implement the trait for u64
impl<E: Engine> ConvertToFr<E> for u64 {
    fn convert_to_fr(&self) -> Vec<E::Fr> {
        vec![E::Fr::from_repr((*self).into()).unwrap()]
    }
}

// Implement the trait for PublicKey
impl<E: Engine> ConvertToFr<E> for PublicKey {
    fn convert_to_fr(&self) -> Vec<E::Fr> {
        let serialized = *self.serialize_uncompressed();
    
        // Extract x and y coordinates from the serialized public key (skip the first byte, which is 0x04)
        let x_bytes = &serialized[1..33]; // Next 32 bytes are the x coordinate
        let y_bytes = &serialized[33..65]; // Next 32 bytes are the y coordinate
    
        // Convert bytes to E::Fr
        let x_fr = E::Fr::from_repr(x_bytes.into()).unwrap();
        let y_fr = E::Fr::from_repr(y_bytes.into()).unwrap();

        vec![x_fr, y_fr]  // Return a vector containing the x and y components as field elements
    }
}

// Implement the trait for Signature
impl<E: Engine> ConvertToFr<E> for Signature {
    fn convert_to_fr(&self) -> Vec<E::Fr> {
      // Decompose the signature into r and s values
      let (r, s) = *self.serialize_compact();
  
      // Convert r and s components to E::Fr
      // Note: r and s are big integers, and conversion might involve more steps in a real-world scenario
      let r_fr = E::Fr::from_repr(r.into()).unwrap();
      let s_fr = E::Fr::from_repr(s.into()).unwrap();

      vec![r_fr, s_fr]  // Return a vector containing the x and y components as field elements
    }
}

Conclusion

By adapting the conceptual workflow from the Sudoku example to the ETL process verification, you establish a robust framework for ensuring the integrity and accuracy of data aggregation in a decentralized data platform. Implementing this framework using Rust and Bellman leverages the power of zero-knowledge proofs to validate data processing without revealing the underlying sensitive data. In the next article we’ll get deeper in the implementation of synthesizemethod and proof creation.