Iceberg is an increasingly popular open-source format for representing and sharing large datasets. Vendia IceBlock enables Iceberg data to be shared reliably across different parties, clouds, and data lakes while preserving security, trust, and transparency in the underlying data. It extends Vendia’s existing solutions for trusted, auditable operational data sharing in multi-party environments to terabyte-scale datasets.
Vendia IceBlock enables real-time distributed data clean rooms, distributed data fabrics, heterogeneous data sharing, and other use cases while improving governance, privacy protection and auditability of shared data at effectively unlimited scale.
Background
Apache Iceberg, often referred to simply as “Iceberg,” is a popular, open-source format for representing large data sets. Originally devised for analytical and BI use cases, it’s also found favor in AI/ML, data sharing and monetization, and a host of other applications.
Iceberg is designed for efficient storage and querying of large datasets, enabling safe and incremental updates by applications. While Iceberg’s storage layer has been ported to multiple platforms, public-cloud object stores like Amazon S3 are the most popular.
Iceberg’s open nature appeals to both developers and companies as an alternative to proprietary, “internal” database and data lake formats. Iceberg’s rise occurred in parallel with a related movement to making data storage “neutral territory,” usually on a public cloud provider’s object or blob store, such as Amazon’s S3 service.
With data lakes, ETL approaches, and AI/ML tools increasingly using this open “Iceberg-tables-in-a-cloud-blob-store” approach as their preferred storage tier solution, this combination has emerged as a standard approach for companies of all sizes across industries.
Enterprise blockchains are private, permissioned distributed ledgers that securely store immutable and verifiable records, synchronized across all participants. As companies increasingly share vast amounts of data internally and externally, ensuring the right dataset version reaches the intended destination across diverse environments—including multiple clouds, geographies, systems, and legal entities—presents a significant challenge.
This information isn’t just of theoretical importance: It’s also the basis for compliance regimes such as GDPR, CCPA/Prop 24, open banking, antitrust provisions, and more. Having an auditable trace in which all parties can definitively prove “who knew what and when” dramatically simplifies the challenges of sharing sensitive and secure data, enabling all parties to use data with confidence in its accuracy and fidelity.
Up until now, the benefits of distributed ledgers have been limited to small amounts of data or a single “file at a time” approach. Companies could either share small amounts of data with trust, control, and traceability or share large amounts of data without trust, using a data exchange format like Iceberg. In other words, they couldn’t have their cake and eat it too.
Vendia’s IceBlock-enabled platform delivers the best of both worlds, providing all the conventional benefits of a distributed ledger at terabyte scale and with Iceberg compatibility.
The Challenges of Sharing Large Datasets
Data sharing features in data lakes, such as Snowflake’s table-sharing capabilities, have revolutionized how data operators and companies approach large-scale data collaboration, simplifying the process of making vast quantities of data trivially accessible to business partners and across a company’s many divisions and functions.
But what happens when the two parties sharing data are on different clouds, data centers, or data lakes? Even when both parties share a common technology stack and set of vendors, remembering which subset of which version of which data went to which part(ies) can be a frustrating and complicated exercise, and only grows worse when the data contains PII, PHI, sovereignty/locale restrictions, or is subject to other regulatory or legal constraints. Tracking these details at scale and in real time becomes a daunting task.
Vendia IceBlock now extends the easy data-sharing capabilities on Snowflake and similar platforms to a broader spectrum of use cases, enabling seamless data movement across clouds, data lakes, and geographic boundaries. This ensures data lineage, access control, and privacy compliance are maintained throughout its journey, and establishes an immutable and auditable record of all data-sharing activities—i.e., who shared what with whom. Vendia IceBlock brings the security and reliability traditionally associated with smaller data payloads to terabyte-scale datasets, all while preserving ease of use and optimal performance.
Importantly, Vendia IceBlock also:
- Retains the open, standards-based format of Iceberg datasets, including the ability to leverage all existing Iceberg-conformant tools, features, and services.
- Iceberg files adhere to industry standards and are fully compatible with major cloud object stores, including Amazon S3 and Azure Blob Storage.
- Enables sharing large, terabyte-sized datasets across different lakes, clouds, geographic regions, and parties, as well as between on-prem and cloud environments.
- Adds the security, fidelity, immutability, and irrefutability of blockchains without adding an appreciable amount of cost or latency or reducing the availability or scalability of large dataset sharing solutions.
- Integrates with on-prem systems through JDBC/ODBC connectors or SFTP.
How Vendia IceBlock Works: Hierarchical File Digests
When a file is stored in a cloud object service, such as Amazon S3, how does the company storing the information know that the resulting file is correct? That it hasn’t been truncated or corrupted?
To ensure fidelity, services such as S3 compute a digest or hash, which is a short string that summarizes the entire file’s content. By comparing digest values, companies can efficiently verify file integrity without needing to compare the entire file. For files stored in S3, this hash uses an algorithm known as “MD5”. The file owner can independently calculate the MD5 and then compare it to the hash generated by S3. This comparison verifies that the file was successfully delivered and that its integrity remains intact – initially, or over time.
Vendia IceBlock extends this capability from one file to a collection of files stored in Iceberg format. In Iceberg, files are already immutable: Instead of rewriting a file to update its content, a new file is created. This process also extends to directories. Instead of changing a directory when a file is updated (really, replaced), a new version of the directory is also created, and so on all the way up the hierarchy. This immutability is great news for placing Iceberg content in a ledger, because it avoids the problem of having to worry about different versions of a given file’s content.
To extend S3’s individual MD5 file hash approach, Vendia IceBlock also computes cumulative hashes at each directory level and then stores the final (root) hash in the data catalog recording Iceberg table information. In technical terms, it creates a Merkle Tree of hashes isomorphic to the original Iceberg file collection’s structure. This also means that Iceberg’s support for incrementality—i.e., the ability to efficiently make updates to the data—naturally extends to IceBlock as well.
IceBlock focuses on the file layout of the data, designed to ensure that its original and subsequent versions can be shared with any number of parties, clouds, lakes, and geographies in a read-only manner. Like an MD5 hash, it also ensures that data hasn’t drifted or been tampered with over time.
SaaS Data Sharing with Vendia
Creating data fabrics and sharing ecosystems, both within and across business boundaries, is one of the most critically important IT challenges today. With mission-critical imperatives like GenAI, ML, and BI at the top of nearly every CEO’s priority list, getting the right data in and out of an organization has never been more timely or critical—thus, the ease with which this data can be shared, without losing trust, fidelity, or control, is equally crucial.
Businesses are increasingly sharing all types of information (e.g., structured, semi-structured, and unstructured) in more ways (e.g., APIs, SQL, SFTP, data lake sharing, etc.) at a time when regulations and privacy concerns are growing by the day. Yet the solution to this complexity isn’t a return to the past problems of data silos and data islands. Rather, it’s to retain and extend the ease of data sharing while bringing in all the necessary trust, control, security, and governance capabilities to the process.
Vendia IceBlock extends Vendia’s existing ability to easily share operational and file data to effectively unlimited scale, allowing terabyte-sized datasets to be shared with blockchain-grade protections. Both materialized views and virtual sharing can be accomplished across “data canyons”—i.e., on-prem versus public cloud, different public clouds, multiple geographies, two or more parties or divisions, and more.
Vendia IceBlock extends support for governance, trust, and privacy-preserving data sharing to a new dimension of corporate data, unifying all forms of data sharing in a single, easy-to-use and API-driven platform. Use cases such as data monetization, real-time clean rooms, and TCO optimization between on-prem and public cloud have never been easier, or safer, to achieve.
To learn more about Vendia’s innovative approach to making data sharing both trustworthy and simple, visit our website at www.vendia.com.
Vendia IceBlock is a patent-pending technology of Vendia. Vendia and Vendia Iceberg are registered marks of the company.