Join Chander Dhall for an in-depth discussion in this video Add typed documents from file, part of Cosmos DB: Import, Manipulate, Index and Query.
- [Instructor] One of the big things when it comes to Cosmos DB is add data, and it's important to understand that we're going to have to take data from a lot of different places and then push it over to the cloud instance. So in this case, if you were to navigate to jsonstudio.com/resources, you can see, there are some good data sets here. As you can see, these are real world data sets. One of them is data set of projects funded by the World Bank, and you also have US zip codes and some stock listings data, and then email data from Enron, and the one we are most interested in is going to be the startup company information, which is 14.8 megabytes when it's compressed.
So once you hit Click, it downloads, and then you click here, it says Show In Folder. Next, we'll right-click and say Extract All. And we can see the data is companies.json. For now, we'll copy the file as it is and go to Visual Studio. Once we are in Visual Studio, we can paste the file as it is. And when we refresh this, I can see the file.
And if you don't see the file for some reason, you should have Show All Files checked. And then I can do a right-click, Include In Project, and then double-click the file. So when you double-click the file, you can see we have the data right in front of us. Now, notice, this isn't an array, but it has a lot of JSON objects. Line number one is a JSON object, line number two is another JSON object, and so on for the rest of the lines. If we were to hit End on the keyboard, we can see there's no comma separation between the lines, so the only separation is slash n, which is the next line.
And if I were to go down a little bit on line number two, I can see there isn't any comma, even at the end of line number two. So press Home, and we can navigate this data. So as you can see, ID happens to be on the root level, but inside ID, we have dollar OID, which happens to have the actual grid, which is the actual ID. Name happens to be at the root level, and then we have permalink. If we were to navigate a little further, we can see a lot of other properties, including crunchbase URL, homepage URL, and then we have the blog URL, and then we have blog feed URL.
We can see Twitter user name and then category code. Now, what's interesting is when we have so much data, how do we even find what our partition key is going to be? So in this case, let's look at some of this data. So for example, category code, control C, you can copy it and then control F, you can search it. As you can see, the category code happens to be web, software, social, web again, social again, web again, and you can see that there's just few kinds of categories that are repeated.
Now, that could be a good way to partition your data, because what you don't want your partition to have is unique values. You want partition to happen where you have common values. So for example, all the software data needs to live in the software partition. All the web data needs to be in the web partition. And that's one way to look at the partition key. So now that we have this data, what we're going to do is add typed documents from companies.json, and then import them over to our Document DB cloud instance or the Cosmos DB instance in our cloud.
So we'll go ahead and delete some of these methods that we don't need now. And we'll create a new method, which again, will be a static method, and its says AddTypedDocuments, and in this case it's really companies, but for now, we'll call it Documents, 'cause at the end of the day company is a document. And then we'll pass in the database ID, and the collection ID. That's all we need for now. And then we'll use a stream reader.
Again, we need to use a using statement, so we can control the lifetime of the stream reader, and in order to resolve StreamReader, we might use control dot. Once we have this, we're going to say file equals new StreamReader, and then pass in the file. Now, again, this is hard coding, but it doesn't really matter for now, because we're more focused on Cosmos DB than doing it the right way. Ideally you would pass it in your function. Now we're going to have to do a while loop, and why is that? Well, the reason is simple.
Because at one point of time, we want one line, and we want to read one line as it is, which is going to be in string format. And if it is not null, that means that line exists, then what do we do? First thing we could do is we could actually write that line in console, so we know what's going on. Just for debugging purposes. And then the next thing would be to convert this string to some kind of JSON object.
For that, I'll use JObject, which is part of Newtonsoft.json, and I could say json equals json dot ToObject, and convert it into some kind of object that I have in C#. Now, I really don't have an object at this point of time, so I won't get an intel sense here. So what do we do next? We're going to go ahead and create a company entity and say Add, Class, and I'm going to call it Company. And Company is a class that will have not all of the properties.
We just want to create some properties to show you that this works. So the first property would be an ID, because Document DB requires you to have an ID property. And the next thing would be a string property called Name. Now, keep in mind, in JSON, we cannot have N as capital, because it's camel casing. We are going to have to change that. But before that, let's create some of these properties.
And the next property was HomepageURL, and we also had CrunchBaseURL. And then we had Category Code, which was more like our partition key. And in order to make sure that Newtonsoft understands our properties, we're going to use JsonProperty and give it the same names as we had in the file.
So we have Homepage URL, and then we have CrunchBase URL, with the underscore. And then we also have the final property, which is Category Code, except C is lowercase. And that's it. So now we're here and we're able to convert this particular object. Before we do the conversion, though, we need to make sure that we already parse the object, which in this case was a string, and we have that in the variable line.
And now we have the JSON object. And then the next thing we can do is convert this over to our C# company object from the JSON object. As you can see, this will work. However, as you remember, if we were to go back and look into JSON, we had underscore ID, and then we had dollar underscore ID. Now, this is the data that we have as it is, but when we go to Document DB, this is an unnecessary step.
So how do we take this particular ID and move it to company dot ID? So here's a cool trick I can show you. What you could do is you could say company dot ID equals, and in this case, you could say JSON dot SelectToken, and pick the token. And in this case it will be underscore ID dot dollar OID. And the response of this needs to be converted over to string, and that'll work. So now this particular object, which happens to be underscore ID and has the property dollar OID, this will be flattened to company dot ID.
Now keep in mind, the company object here in C# versus the companies.json object are very different. This company object in C# is actually just a subset of the big company object inside companies.json. But again, I'll show you another way in just a minute on how to use the same data and import it as it is. But this is just the way to show you that you could do it using existing C# objects in case you prefer this approach.
So now, what's the next thing? Once we have this data, we want to make sure that we send this data over to Document DB. So what I could do is CreateTypedDocument, and this means I'm going to create the typed document inside my Azure instance of Cosmos DB or Document DB. Same thing. And then I'm going to create this as more of an async method, so I'll have to do a wait.
So once you come to line number 37, do control dot, and let Visual Studio create the method for you. So once you are here, we want to make sure that this method is async, and then it returns the document. So we'll say task of document, which is a Cosmos DB document, and leave everything as it is.
First thing we need is the collection, so we'll say var collection equals await client dot readdocument, and we've done this quite a few times now, so this should be simple. Create a document collection in an async fashion, use the same URI factory dot createdocumentcollection, and in this case, it will be DocumentCollectionUri, but we already have the database ID and we have the collection ID.
So this gives us the collection, and the next thing would be to create a document instance, document, document equals await client dot CreateDocumentAsync, 'cause we're creating this document inside Azure Cosmos DB. And what does it really need? It needs the self link or the ID for this particular resource, and that will be at collection dot resource dot SelfLink.
And next, it needs the document, and what's the document, in our case? We can pass company as it is, and it gets the job done. Once we have this, we're going to say return document. And our method is ready. So as you can see, the database ID is cast, and you can have a collection name that already exists in Cosmos DB. Currently, we have company and test, and we just choose to use company. One thing to keep in mind, though, is companies.json, if you go to right-click on this and go to Properties, this needs to be at least Copy If Newer, if not Copy Always.
Once this is done, let's do Control F5, and this should work. So as you can see, there's a lot of data, and we're printing every single thing onto the console. And this will take a while, because we have a lot of records. It's about 15 megabits worth of records, and we're making a request one by one. So I'm just going to go ahead and kill this, but you will still be able to see the data in your Azure Cosmos DB instance.
So if I were to go back to the portal, and hit refresh, and go back to Document Explorer, you can see the company has a lot of data now, and you can see the category code is repeated, which makes it a really good partition. If I click on one of these items, you can see that we have about six different properties. Now, if you notice, all these properties were added by us. But then you have the ID property added by Document DB.
And let's take a different category code, you can see News, and you can see the category code is News here, and then this is the ID created by Document DB. But all this data happens to be in your Cosmos DB instance, and you can keep loading more if you want to have a look at the entire data. And any time you add more data, you should hit Refresh, and it will show you the changes.
- Data import scenarios
- Creating a database
- Creating a partitioned collection
- Data manipulation
- Importing documents with a stored procedure
- User-defined functions
- Excluding indexing at a document level
- Range indexing on strings
- Querying with SQL parameters
- Range operations