Manifold improvements in Presto


With SQL gaining traction in the Big Data landscape, many vendors have released their SQL engines in the market. Now the success of these engines is basically dependent on the performance of these engines, which is why Facebook has recently released numbers showing how much has their Presto SQL on Hadoop engine which is being employed by Netflix, Dropbox, Airbnb, and AWS.
Created by Facebook for their Big Data Warehouse, Presto is a great open source SQL on Hadoop which was designed to provide with low latency analysis of huge data sets of all sizes. Built with the objective to be faster than the native query framework, it can be easily compared the Impala by Cloudera and HAWQ by Pivotal. In a recent announcement, Dain Sundstrom revealed the crucial elements which play an important role in improving Presto's performance along with the figures showing how great these improvements are.
The areas which were emphasized are the ORC (the Optimized Row Columnar), the ORC reader, and the columnar reads, lazy reads and predicate pushdown; making the file format design efficient enough to provide a quick way to store data on the Hadoop File System. He said that the decision to create ORC reader for Presto was not easy. After having worked with the vectorised reader of Hive 13 and seeing a great performance. The decision came through because of the absence of mature code and no support for structures, lists or maps or generally readers supporting the lazy reads.
To serve the same purpose, Facebook uses a fork of the ORC which is called the DWRF. It also fails to support predicate pushdown and columnar reads, but it does support lazy reads. To achieve the desired results a new reader was required and now they have it which seems to work seamlessly with the SQL on Hadoop engine in Presto. The recent report shows that the new reader a great deal of improvements over the predecessor and the binary reader. Also, the mentioned update brings the improvements such as a jump of 2-4 times in the wall clock along with the CPU Time much over its Hive based predecessor with almost identical figures of the RCFile reader.
Will this update actually give you the required speed? As anyone with logical thinking will say, it highly depends on the query you execute. The result of these queries in the test was heavily based on the fact that the queries were crafted to stress the reader as much as possible. A simple query such as "Select * from ‘Table'" will see very little to no performance upgrade, the inner joins and the left joins and any such complex query on the other hand will deliver the returned data results like instantaneously.
What the new reader does for the Social Network's SQL on Hadoop engine, Presto does is that it clears the CPU headroom in order to make room for the query execution which results in faster results or enhanced concurrency in your data warehouse. 

0 komentar