In this video, learn how to enhance our code to support accepting a text input and returning a speech with the text spoken.
- [Instructor] So now that I've successfully authenticated, I can use this access token and call the speech and point and do text-to-speech, and I will say that what you're looking at here is demo code. You don't want to hit the access token endpoint all the time. You know, the access token is valid for 30 minutes to one hour depending upon who issues it. It's their choice, really, so you want to store it and you want to renew it when you need it. It is technically out of scope of this course on how to manage these access tokens, so I won't be covering that here.
Another thing I'll mention is that I'm using a simple post request response mechanism. The speech-to-text API is also quite sophisticated, it can support streams. So those streams, I will not be covering in this course. I'll leave that for a more advanced course, but here at least you get an idea of the possibilities. With that out of the way, let's go ahead and actually call the speech API with this access token so we can do text-to-speech.
First I need to create a payload, and this payload looks like an XML format. So I'm basically saying XML gender, female, name, et cetera et cetera. So you basically specify the options so that this is language we want, this is the dialect, et cetera, and you specify whatever input text you want to be able to translate. There are a lot of other possibilities in this payload. We leave that for a more advanced course, but things like, you know, what voice you want and all the things that you would expect from a text-to-speech engine, you know, you can put that in here or maybe you can, you know, pre-fill some information that is text or so on, so forth.
So there's a lot of other possibilities here. This is a very, very simple payload. With this payload, now I'm going to, as you can guess, probably do a request.post, but before I do the request.post, I need to create the request options in which I specify the header and the body, so let's go ahead and create the request options next. So requestOptions will be of type, request.CoreOptions, and let's go ahead and fill up this object, so I'm going to need to provide some headers and, X-Microsoft-OutputFormat so, it supports a number of formats, so you just can't specify whatever you want.
I'll just go ahead with a very low-resolution format because I'm just using the free SKU, and I'm using simple get post sort of way of submitting to the speech API, so I want to keep the file size small, so I'll just go with a very very simple format, so I'm going to say Content-Type shall be application slash ssml plus xml to match what we are sending up there in the payload. The host shall be speech.platform.bing.com.
Content-Length will be the payload that we're sending and the length of that payload. Authorizations, so this is where the access token comes into play. Bearer, space, the space is very important, accessToken, and User-agent can be really anything, but since I'm using Node.js, I'll simply write NodeJS. And obviously we need to send the actual body of what we want translated, and that'll be the payload.
Next, with the request options created, I just need to do a request.post, so request dot post, and we are going to post to https://speech.platform.bing.com/synthesize okay, so that's important. Passing our request options that we created, and this should return me the actual audio file as a response object.
Now, the audio file that I get back, it's going to show up as a buffer. So I need to write it out to a file so I can play it. So I'm going to say var convertedAudio equals Buffer.from(response.body) and I'll simply go ahead and write out this file using the sync method, just trying to keep it simple. I mean, yes, if this was an enterprise class application, I wouldn't use sync, I would use observables.
There are a lot of other patterns you can use, but those are out of scope of this course, but definitely I encourage you to write clean code. So, output.wav, response.body, and, encoding shall be binary. So this finishes out the actual code. Basically what we're doing here, let me just quickly do a quick walkthrough one more time. So first we, authenticate using this.
That provides us with an access token. With the access token, we craft up a payload, and then the payload, we submit it to the speech endpoint, and whatever results we get back, we're writing them out to a file called output.wav. And I would like to point out that I'm using a very low-resolution audio format because this REST endpoint that we have, you know, with just xpaths in post doesn't support streaming.
There are some file limits on that. So you want to keep this demo a little simple. Yes, the speech API does support streaming, but we're not covering that here. This is just an introductory course. So, you want to make sure that you take a simple file with a very, you know, low end audio format. So all that's left to do is to see this code running.
- Exploring the possibilities of the Vision API
- Submitting an image to the Vision API for processing
- Asking the Vision API to recognize faces
- Working with the Speech API
- Writing speech-to-text code
- Working with the Language API
- Getting languages for translation
- Language Understanding (LUIS) concepts