Berkeley upc benchmark uts

8/4/2023

We have added Tez as a supported platform.It is difficult to account for changes resulting from modifications to Hive as opposed to changes in the underlying Hadoop distribution. As a result, direct comparisons between the current and previous Hive results should not be made. We changed the Hive configuration from Hive 0.10 on CDH4 to Hive 0.12 on HDP 2.0.6.

The workload here is simply one set of queries that most of these systems these can complete. Traditional MPP databases are strictly SQL compliant and heavily optimized for relational queries. MapReduce-like systems (Shark/Hive) target flexible and large-scale computation, supporting complex User Defined Functions (UDF's), tolerating failures, and scaling to thousands of nodes. Keep in mind that these systems have very different sets of capabilities. This benchmark measures response time on a handful of relational queries: scans, aggregations, joins, and UDF's, across different data sizes. For now, we've targeted a simple comparison between these systems with the goal that the results are understandable and reproducible. The choice of a simple storage format, compressed SequenceFile, omits optimizations included in columnar formats such as ORCFile and Parquet. We are aware that by choosing default configurations we have excluded many optimizations. This benchmark is not intended to provide a comprehensive overview of the tested platforms. This remains a work in progress and will evolve to include additional frameworks and new capabilities. Stinger/Tez - Tez is a next generation Hadoop execution engine currently in development (v0.2.0).Impala - a Hive-compatible * SQL engine with its own MPP-like execution engine.Shark - a Hive-compatible SQL engine which runs on top of the Spark computing framework.Hive - a Hadoop-based data warehousing system.Redshift - a hosted MPP database offered by based on the ParAccel data warehouse.We have used the software to provide quantitative and qualitative comparisons of five systems: because we use different data sets and have modified one of the queries ( see FAQ). Please note that results obtained with this software are not directly comparable with results in the paper from Pavlo et al. The software we provide here is an implementation of these workloads that is entirely hosted on EC2 and can be reproduced from your computer. In order to provide an environment for comparing these systems, we draw workloads and queries from "A Comparison of Approaches to Large-Scale Data Analysis" by Pavlo et al. Among them are inexpensive data-warehousing solutions based on traditional Massively Parallel Processor (MPP) architectures ( Redshift), systems which impose MPP-like execution engines on top of Hadoop ( Impala, HAWQ), and systems which optimize MapReduce to improve performance on analytical workloads ( Shark, Stinger/Tez). Several analytic frameworks have been announced in the last year. Click Here for the previous version of the benchmark Introduction

0 Comments

Berkeley upc benchmark uts

Leave a Reply.

Author

Archives

Categories