In this video, learn how to create a job using Cloud Dataproc/Apache Spark.
all of my professional work on huge scale cloud-based data pipelines, mostly for genomics, I'm fascinated by the services that are becoming available. Last week at GCP Next, Google announced Cloud Data Fusion which is in beta as of this recording. So as it says here, it's a fully managed cloud native enterprise data integration service for quickly building and managing data pipelines. So similar to Dataprep in that it can be used but it can be used for lots more than that. Also, in this architecture article, it's interesting to think about the differences in architecture. which is implementing Apache Beam on virtual machine clusters. Cloud Data Fusion runs as it says, within one Compute Engine zone. It's composed of several GCP technologies. Interestingly, it's abstracted at the level of the application using GKE, Cloud SQL, Cloud Storage, and Key Management Service. It's also interesting that rather than Beam, it's using Apache Spark, so a full discussion of the differences between Beam and Spark is beyond the scope of what I'm going to talk about here, but I'll give you some references in the repository that I have on GitHub so that you can read more about some of the differences, I know that this is an active area of work for me understanding the capabilities of Beam and Spark when one technology is better for one use case versus the other, but it is interesting to see that the abstraction here is built at the Kubernetes level. So I fired up an instance which did take a little bit of time but it is beta and this is the interface that you get. You can see that here we're in the Control Center and you can create, manage, operate, and monitor datasets and applications. The next component is a Pipeline Studio by connecting nodes in a logical flow, so this is kind of similar to Dataprep's flow, the difference that I see out of the box here even in the screenshot there's external data sources such as Excel and files. Then there's a Wrangler tool which allows you to connect to a variety of data sources and cleanse data using point and click interactions. So again, similar to the Recipe idea of Dataprep, the thinking here though is more transformations and more different data sources. And here we have Metadata, so Metadata enables data discovery through search and data governance through lineage, so this Data Fusion tool looks more similar to the typical type of integration tools that I've worked with mostly through third party vendors such as Informatica or even way back when SQL Server Integration Services, more complete ETL kind of tool if you will and I'm really looking forward to exploring working with it as it goes from beta to GA. And you can see the last component here is the Hub, the Hub allows administrators to distribute reusable pipelines, applications, plugins, and solutions to all Cloud Data Fusion users in their organizations. So there is a quick start in the Google Documentation that again, I'll leave you to work on if this is an area of interest, I would tell you at the very minimum, you're going to want to compare Cloud Dataprep with Cloud Data Fusion if you want to work at the level of abstraction of a graphical user interface. To summarize, there's been really a large number of services over the past 12 months in data warehousing. So the reasons I created a section on it. So BigQuery of course is your core engine and you have SQL and now machine learning queries available. You of course have batch and stream insert with BigQuery. Files can be in BigQuery storage or in Google Cloud Storage and again, we've talked quite a lot about bucket configuration, life cycle, really important in warehousing scenarios because of cost. Data movement, very interesting, previous to this last time I've updated this course, it was basically third party vendor solutions such as Informatica or Matillion which are still useful tools or writing your own scripts, but Google now has Dataprep which uses Dataflow, Apache Beam on GCE, and the brand new Data Fusion which uses Dataproc which is Spark on GKE. Data visualization, you have Data Studio or partners such as Tableau. Now just for fun, I actually built a little flow in Data Fusion, and I really just wanted to try the product out myself, so I was pulling on some of the components and just building sort of a typical, simple flow to see what it would look like, and so here we've got a component that works with a bucket and each of these components have references and they make 'em easy to use, and then connect it to a Transform which is a Formatter and then connect it to a Analytics component which is a Deduplicator, just really looking at what capabilities are available right out of the box in the beta here, and then a Sync with BigQuery. So the typical use case of loading BigQuery here, and then looking at the ability to Configure was really interesting in that, you can set the amount of Resources so you can set Spark and notice it changes to Spark Streaming, so because I've been doing a lot of big data pipeline work in Spark, this is really interesting to me to see this at this level of abstraction. in Java or Scala or even some library in Python, and I will be exploring this as this product matures.
- Enterprise concerns
- Enterprise scenarios
- Setting up your organization’s account
- Managing billing
- Enterprise compute services
- Enterprise storage and database services
- Enterprise data pipelines
- GCP developer and DevOps tools