Understand the need to use the GCP sdk to create a client to programmatically load data into a Cloud Spanner database.
- [Instructor] So, at this point in this course, we have been looking at how to work with Spanner by setting up instances and objects such as databases and tables and working with schema, which is super important. However, we hadn't got to a key aspect of working with databases yet, which is working with data. So, that was by design. Now, of course, as I've mentioned, at the time of this recording, Spanner is in a very early beta, so I'm not faulting Google for what parts of the product are available, I'm just reporting the news basically.
So, at this point, to load data you actually have to use a client library, a client SDK. I anticipate this will change and that Google will enable support for testing data inserts, updates, and deletes through their Cloud Console and through gcloud. They just haven't got to it quite yet. So, in order for us to learn more about Spanner, we actually have to switch gears, which is okay, and learn a little bit more about the SDK. Now, if you're totally new to the Google Cloud, again, you may wanna take a look at my other Google Cloud Platform essentials course on LinkedIn Learning to understand a little bit more about client SDKs, because I'm presenting this just in the context of connecting to Spanner and having a bit of assumed knowledge.
So, again, to get data into Spanner, at this point, we have one option, we have to use the client SDK for GCP. The programming languages that are supported are Go, Java, Python, or Node.js. And I've actually included a screenshot here showing that, at the time of this recording, if you tried to use a DML statement, a data modification statement language in SQL, in the insert, update, or delete in the console, you'd actually get a warning saying that that's not supported. Also, of course, gcloud doesn't work.
So, we'll be doing that shortly, but before we do, as we've been learning, Spanner, being a globally distributed transactionally consistent database, has some underlying requirements in terms of schema setup, and these relate to loading data. So, Google has three practices that they've called out as bad practices that you wanna avoid when you load data, and here they are. The first one is to write each row one at a time. They say that each write to Cloud Spanner contains overhead whether the write is big or small.
Writing one row at a time lowers throughput because it requires a high ratio of overhead per write. Again, this is because of the globally distributed transactionally consistent nature of Spanner. You're going around the earth basically, or potentially. Now, if you think back to when we were reviewing best practices for table design, if you have related tables in which you're doing join queries, the recommended practice is interleaving, which is in the schema design indicating that the rows that are related should be placed in the same physical location.
You can see now why this is so important because if the data is close together, you're gonna have a quicker retrieval. The right and acknowledgment process requires network round trips. Performing network round trips for each individual row increases the time. So, writing one row at a time. You wanna write groups of rows. Bad practice two is packaging N random rows into a commit with N mutations. And Google goes on to say writing random rows in a commit or each mutation inserts a single row may be worse than writing one row at a time.
Multiple machines would be involved, because each random row could belong to a different machine. In worse case scenario, each write will involve every machine in your Cloud Spanner instance and you can clearly see how that would be inefficient. As they just mentioned, write throughput is lowered when more machines are involved for a write. And then the third practice that they recommend against is sequentially add all rows in primary key order. We talked about this in an earlier movie when we talked about recommended primary key column selection. And they go on to say sequentially adding rows means your writes get added to the end of your key range.
This causes hot-spotting, which is a concentration of operations on a single node. This lowers the write throughput to the capacity of a single node instead of benefiting from load balance nodes. And we've already reviewed choosing a primary key, so these are practices to be avoided.
- What is Cloud Spanner?
- Creating an instance and database using the console
- Making hot schema updates
- Cloud Spanner queries
- Loading data into Cloud Spanner
- Understanding transactions
- Cloud Spanner DevOps tools
- Auditing logs with Stackdriver
- Cost optimization