TPC-DS at 100TB and 10TB Scale Now Available in Snowflake's Samples
We are happy to announce that a full 100 TB version of TPC-DS data, along with samples of all the benchmark's 99 queries, are available now to all Snowflake customers for exploration and testing. We also provide a 10TB version if you are interested in smaller scale testing.
The STORE_SALES sub-schema from the TPC-DS Benchmark
Source: TPC Benchmark™ DS Specification
You can find the tables in:
- Database: SNOWFLAKE_SAMPLE_DATA
- Schema: TPCDS_SF100TCL (100TB version) or TPCDS_SF10TCL (10TB version) .
(Note that the raw data compresses in Snowflake to less than 1/3 of it's original size.)
Sample TPC-DS queries are available as a tutorial under the + menu in the Snowflake Worksheet UI:
Accessing Sample TPC-DS queries in the Snowflake Worksheet UI
What is TPS-DS?
TPC-DS data has been used extensively by Database and Big Data companies for testing performance, scalability and SQL compatibility across a range of Data Warehouse queries — from fast, interactive reports to complex analytics. It reflects a multi-dimensional data model of a retail enterprise selling through 3 channels (stores, web, and catalogs), while the data is sliced across 17 dimensions including Customer, Store, Time, Item, etc. The bulk of the data is contained in the large fact tables: Store Sales, Catalog Sales, Web Sales — representing daily transactions spanning 5 years.
The 100TB version of TPC-DS is the largest public sample relational database we know of available on any platform for public testing and evaluation. For perspective, the STORE_SALES table alone contains over 280 billion rows loaded using 42 terabytes of CSV files.
Full details of the TPC-DS schema and queries, including business descriptions of each query, can be found in the TPC Benchmark™ DS Specification. To test examples of different types of queries, consider:
Type | Queries |
Interactive (1-3 months of data scanned) — Simple star-join queries | 19, 42, 52, 55 |
Reporting (1 year of data scanned) — Simple star-join queries | 3, 7, 53, 89 |
Analytic (Multiple years, customer patterns) — Customer extracts, star joins | 34, 34, 59 |
Complex — Fact-to-fact joins, windows, extensive subqueries | 23, 36, 64, 94 |
- At 10 TB scale, the full set of 99 queries should complete in well under 2 hours on a Snowflake 2X-Large virtual warehouse.
- At 100 TB, we recommend using the largest size virtual warehouse available. For example, on a 3X-Large warehouse, you can expect all 99 queries to complete within 7 hours.
Note that, if you plan to run identical queries multiple times or concurrently, be sure to disable result caching in Snowflake when you run tests by adding the following to your script:
alter session set use_cached_result = false;
TPC-DS Benchmark Kit and Working with Date Ranges
While we provide samples of the 99 queries containing specific parameter values, the TPC-DS Benchmark Kit includes tools for generating random permutations of parameters for each query — which is what we use in our internal testing.
In all queries, the date ranges are supplied using predicates on the DATE_DIM table — as specified by the TPC-DS benchmark — rather than using Date Key restrictions directly on the large fact tables (a strategy that some vendors have used to unrealistically simplify queries). If you want to create variations on these queries without using the benchmark kit, you can create versions that scan different ranges by changing the year, month and day restrictions in the WHERE clauses.
Conclusion
TPC-DS data (and other sample data sets) are made available to you through Snowflake's unique Data Sharing feature, which allows the contents of any database in Snowflake to be shared with other Snowflake customers without requiring copies of the data.
We hope you enjoy working with this demanding and diverse workload, and invite you to compare your Snowflake results with other platforms.
And, be sure to keep an eye on this blog or follow us on Twitter (@snowflakedb) for all the news and happenings here at Snowflake.