From the course: Cisco CCNP Collaboration 350-801 (CLCOR) Cert Prep: 1 Cisco Collaboration Technologies

Voice digitization

- [Instructor] In this video we want to consider how the spoken voice, this continually varying analog waveform can be converted into ones and zeros. In other words, how we can digitize our voice. Because if we think about our voice it is a continually varying analog waveform. But if we want to send this over a network using data, ones and zeros. We need to digitize that varying analog waveform, we need to represent that as a series of ones and zeros. For ones can be the presence of voltage and zeros can be the absence of voltage. And the way we can do that is by taking samples of that analog waveform. In other words, measuring at regular intervals what is the volume of that waveform right now. It's like we're playing a game of connect the dots. If we observe the amplitude at all these different times and then represent each dot with a series of ones and zeros. When it gets to the far end that far end equipment can just, as I said, connect the dots and rebuild approximately the original waveform. Now, that is assuming that we take enough samples. If I take this many samples that we see on screen this is called oversampling because we're taking more than we absolutely need. If I don't take enough then that's going to create a different signal altogether an aliased signal. Notice in this example that as like connect the dots that reproduced waveform by connecting the dots. It looks nothing like the original waveform, so we want to make sure and take enough samples. And that begs the question, how many samples should we take? And that goes back to the late 1920s when a man named Harry Nyquist was working at Bell Labs. He wrote a paper around 1928 that was really the foundation for the answer to that question. Now, a lot of people say that he developed the actual theory himself back in the twenties. He actually laid the groundwork for it and then other people extrapolated upon his work later. And we now have what's called the Nyquist Theorem and the Nyquist Theorem named for Harry Nyquist, says that if we want to reproduce a waveform without having one of these aliased signals then the number of samples we need to take per second is double the highest frequency. Now, the human ear can hear in the range of approximately 20 Hertz at the low end to about 20,000 Hertz at the high end. And that varies with age and how loud you listen to your music. But with a regular telephone call we're not trying to create some sort of Dolby Surround sound audio. Now we just want to communicate human speech. And over 90% of human speech intelligence happens under 4,000 Hertz, 4,000 cycles per second in other one. And according to Mr Nyquist, if we double that 4,000 times two that gives us 8,000 samples per second that we need to be taking. And then when we play those 8,000 samples back in a rapid succession, it's going to sound like the original voice. Approximately let's compare that to going to the movies. Oftentimes when you are sitting in a theater and you are watching a movie up on the silver screen. It looks approximately like a smooth motion but really it's not. If you were to take a look at the film reel itself up in the projection booth, you would see that it's just a series of still images. But they're being projected sometimes at 24 frames per second because you are seeing those perfectly still images projected in rapid succession it appears to be motion. Same thing here, when we play these audio samples back in rapid succession it appears to be smooth voice. So here's what we're going to do. We're going to take 8,000 samples per second and we're going to measure the volume or the amplitude of each sample. And we're going to represent them using something called PAM, P-A-M, Pulse Amplitude Modulation. A lot of people miss visualize how this is happening, they visualize that as soon as we take a sample, we're immediately digitized, we're a series of ones and zeros. That's not the case. Here we see that we've taken these samples but these samples are still in an analog format. They're all using one frequency and they're at different volumes to represent the volume of the waveform that we were sampling. So the next step is after PAM, after a pulse amplitude modulation we want to do pulse code modulation. That's where we assign a number to each of those amplitudes and notice that some are below the line, some are negative. We need to represent the polarity as well. And there's a challenge with doing this, if we want to conserve bandwidth on our network and we typically do. We don't want to use more bits than are absolutely necessary. So the question is how many different amplitude values do we want to represent? If we were to use a linear scale, we might have something like this. Notice we've got our amplitudes with those bars on screen. And there was a range for the number one and a range for a number two and arrange for number three. And if the amplitude of one of those samples falls in that range for one or two or three, we give it that value, we say you are a one or two or three. And you'll notice there's a delta, there's some arrow in each one of those. None of them perfectly line up with the number. That's going to cause what is called, the quantization noise. It's like a background hiss that you're going to hear on the phone. So here's what we're going to do, instead of using a linear scale we're going to use a logarithmic scale. Back in high school I used to have this logarithmic graph paper we used to use. And instead of being linear it would go up in powers of 10. We're going to use something similar to that to digitize our voice. We're going to break that Y axis into different segments and within each segment, we're going to have a series of step values that we can measure. And notice the steps are closer together for segment zero and they're a little bit further apart for segment one. And I didn't draw the entire thing on screen but we're more accurate at lower volumes. And that's good for two reasons. Number one, most samples occur at lower volumes and number two, the high volumes are so loud anyway, they tend to drown out that background hiss. This tells us we want to be more accurate at the lower volumes. And using this logarithmic scale we can represent one of those samples with just eight bits. Here's how it works. We've got one bit to say, if the amplitude is above the line or below the line, is it positive or negative? What is the polarity? Then we've got three bits to represent which segment does the amplitude fall within and then which step is it close to within that segment? We've got one polarity bit, three segment bits and four-step bits for grand total of eight bits. And Mr. Nyguist told us we should be taking 8,000 samples per second. What's the total amount of bandwidth used there? Lets, eight bits times 8,000 samples, that means we need a bandwidth of 64,000 bits per second. Eight times 8,000 is 64,000. And that number 64k that turns up a lot in the telephony world. In fact, there are different ways of encoding voice. We call those different approaches, codecs, which is short for coder-decoders. And is a sample of some of the codecs you might come across. We might have G.711, G.729 or iLBC. And you'll notice that G.711 has a bandwidth requirement for the payload not including any extra overhead, not IP addresses or Mac addresses, just payload 64,000 kilobits per second. That comes from that eight bits per sample times Mr. Nyquist's 8,000 samples per second. That's where we get the 64k from. But once we add on the header information for layers, two, three and four over an Ethernet network. We're going to be taking up about 82.2 kilobits per second of bandwidth for a G.711 phone call. Another codec that is quite popular is G.729. Notice that its bandwidth is only eight kilobits per second. And once you add on all the headers over ethernet that's going to take up about 31.2 kilobits per second by default. And I'll say by default because we can change that a little bit by adjusting the sample size, how many bytes are in a segment? But that's what we have by default. And another codec that's oftentimes replacing G.729 because it sounds a little better, is iLBC, internet Low Bitrate Codec is what that stands for. And there are a couple of one variants. One variant uses 13.3 kilobits per second or for a little extra quality we could go to 15.2 kilobits per second. And on screen you see the corresponding ethernet bandwidth required. So G.711, that is lossless we're not doing any compression. We are doing compression, however, with a G.729 and iLBC. And compression might be appropriate when we're going over a slower speed or wide area network or a WAN link. Because it's over those links that we really want to conserve our bandwidth and not use more than we absolutely need to. Now that we've taken a look at traditional telephony networks and we've seen how we can not take the analog waveforms coming out of our mouths and convert those into ones and zeros. In our next video, let's take the 30,000 foot view of the Cisco collaboration devices that are out there. And then after that we're going to have subsequent videos that discuss those individual devices. But for now join me in our next video as we take a look at some components of a unified communications network.

Contents