From the course: Effective Serialization with Python

Picking a serialization format - Python Tutorial

From the course: Effective Serialization with Python

Start my 1-month free trial

Picking a serialization format

There are many serialization formats out there from familiar names such as JSON and XML to less familiar, such as Captain proto, and others. In this video, we'll discuss some of the parameters you need to consider when picking a serialization format. The first thing you should consider is how mature is the format. I love this equation by Martin Wiener. Maturity is blood plus sweat divided by complexity. These days most projects are hosted on GitHub, which makes it easy to check parameters such as how old is the project, how many contributes, does it have, how many open bugs, etc. In general, ask around. Don't get tempted by the shiny new things go for old and boring technologies. Another critter is how many programming languages support this format. This might not be interesting if you write everything in the same language, and mostly communicate internally. However, for external API's, this might be very important. JSON for example, is supported by many programming languages. It's one of the reasons why it's very popular in API's. Another criterion is the type supported by the format. JSON does not have daytime or timestamp format. When passing time information, you will need to convert time to a string or a number. Protocol Buffers does have a timestamp format. And if you use it, you will avoid the need to convert your daytime objects to strings or numbers. Schema is also an important factor. Schema defines how your messages look like for example, it can say that the log message have a time field which is a timestamp. A level field, which is an integer and a message field, which is a string schema is make sure your data is correct and that services agree on what is being sent and received. On the flip side schemas makes it harder to change data format. Ask anyone who's done SQL schema migration how painful it was. In general, I prefer formats with schema over schema less ones. Schema helped me validate the data and detect errors. It also make me think about how the data should be structured. In the long run, it pays out in spades. Performance might be a consideration, make sure to have performance requirements before you select the format. modern computers are very fast, and most from us will be good enough for your needs. By performance, I mean both CPU how much time it takes to serialize and deserialize the data and size, how many bytes are sent or stored. Both CPU time and bandwidth, the size, cost, money and picking the right serialization format can save you a lot of money. Make sure to run benchmarks against a couple of formats on real data before picking one. The last criterion I'll mentioned here is the standard library. Installing third party packages carries a risk and operational complexity. If you use the format that's already in the standard library, such as Jason, you don't need to install anything. And you know, it's well debunked. There are many other criterion you might have want to consider such as security streaming and others. Please, please don't invent your own. There are many formats out there. We don't need another one. The main point is that you should be conscious about how to pick a format. Don't reach to the first thing that comes to your mind, or is considered cool today. So which one should you pick? I can't tell you. It really depends on your use case. What I can say is that most companies I consult with, here's JSON for external API's and Protocol Buffers for internal communication between services, but don't be lazy. Do your homework.

Contents