Join Dan Sullivan for an in-depth discussion in this video Review of Scala for data science, part of Scala Essential Training for Data Science.
- [Instructor] Scala is a language well suited for data science. Several of it's features are especially important. For example, Scala is a functional programming language. So, we can use things like the map operator too apply computations to members of collections. It's object oriented so we can organize our data and methods for operating on that data into logical groups. Scala is also a scalable language and it gives us access to a wide range of Java libraries including JDBC.
Scala is especially known for efficient computation. It compiles to Java bytecode and runs on the JVM. This means we get to take advantage of all kinds of advances in Java Compiler Design, like the just in time compiler. Parallel collections are especially useful for taking advantage of multi-core processors on our desktops and our laptops. It's especially useful if we're running on servers that have even more cores. Now, if you're working with big data, that is data that's too big to efficiently process on a single server, look into Spark.
We can use RDDs and DataFrames as high levels of extraction for distributed parallel processing. Also, there are some data science tools available in Scala that we didn't have time to look at but are worth considering. Saddle is a package for data manipulation. Breeze is a package designed to support numeric and scientific processing. If you've done data science in Python, you may have come across NumPy and SciPy, Breeze is analogous to those. Also, JDBC is fairly bare bones when it comes to working with SQL.
A package called Scala-like JDBC adds additional support. Finally, if you're doing a lot of Scala development, you might want to look into an integrated development environment like IntelliJ or Eclipse. Scala is a language well suited to data science and it will become increasingly important to data science practitioners.
Dan also focuses on using Scala with Spark, a distributed processing platform. He first describes how to work with Resilient Distributed Datasets (RDDs)—a fundamental Spark data structure—and then explains how to use Scala with Spark DataFrames, a new class of data structure specially designed for analytic processing. He wraps up the course by providing a summary of advantages of using Scala for data science.
- The advantages of Scala for data science
- Scala data types
- Scala arrays, vectors, and ranges
- Parallel processing in Scala
- Mapping functions over parallel collections
- When and when not to use parallel collections
- Using SQL in Scala
- Scala and Spark RDDs
- Scala and Spark DataFrames
- Creating DataFrames