BeeScala 2016: Jacek Laskowski - Speak Spark SQL for better performance

This talk was recorded at BeeScala 2016 in Ljubljana, Slovenia. Follow along on Twitter @BeeScalaConf and on the website for more information http://bee-scala.org. Abstract: Spark SQL is now the de-facto driving force behind Apache Spark 2.0’s success. It comes with enough cool features to keep you busy for few days and made Spark MLlib even more pleasant to use. In Spark 2.0, Spark SQL comes with Datasets, encoders, logical and physical plans. They are the frontends to the other low-level components called Catalyst optimizer and Tungsten that are supposed to make your queries be faster. During this presentation you will find out how your structured queries end up as Datasets, the difference between Datasets, DataFrames and RDDs, and finally how Spark SQL’s Catalyst optimizer could make your queries faster when properly structured.