Learn about three types of data and four elements that affect data quality.
- [Instructor] While big data may not be a trend, it's important for marketers to understand the concept. For instance, how do you tell a good dataset from one that biased? What type or types of data are you producing? Which elements affect the quality of your data and ultimately, your results? We produce 2.5 quintillion bytes of data each year and while that number may seem overwhelming, we can actually break the data itself into three distinct types.
Structured data, unstructured data, and semi-structured data. Once we understand what each is and does big data becomes a lot easier to digest. Let's look at structured data. Structured data is highly organized and labeled and fits well in a spreadsheet. Imagine an Excel file with thousands of labeled columns and millions of rows. It may be a large file, but each cell in the sheet has an identifiable format.
And that's what makes it structured. Unstructured data is really the opposite of that. It doesn't have a predictable organization, so it's harder to classify. Examples include words in a text, blog posts, or email, and images, and video. Data scientists developed algorithms that understand the meaning in a sentence or a paragraph, that's called natural language processing or NLP. And they created computer vision, algorithms that find patterns in images that enable machines to see what we see and identify a cat as a cat, a chair as a chair, and so on.
Semi-structured data is really a combination of the two. A good example is the type of data you'd find on Twitter. The number of followers or number of tweets are structured data, the content or images you share are unstructured. Four factors can affect the quality of your data. Volume, velocity, variety, and veracity, the four Vs. Volume refers to the size of a dataset. With cloud computing organizations can now store and process vast amounts of data safely and cheaply.
Velocity is the speed of data, that is how quickly it changes over a certain period of time. A good way to visualize this is to think about how many new tweets appear when you refresh a hashtag. Variety refers to how diverse or how biased a dataset is and whether it contains structured and unstructured data or some combination. And veracity points to the overall quality of the data and to what extend it can be trusted. So how does all of this work with AI? Well, it's really quite simple.
More data equals better statistical predictions equals better results. Which is why you need to ensure the data you're using is high volume, high quality, and unbiased.
Note: Because this is an ongoing series, viewers will not receive a certificate of completion.