--- title: "Google Cloud Speech-to-Text API" author: "Mark Edmondson" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Google Cloud Speech-to-Text API} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- The Google Cloud Speech-to-Text API enables you to convert audio to text by applying neural network models in an easy to use API. The API recognizes over 80 languages and variants, to support your global user base. You can transcribe the text of users dictating to an application’s microphone or enable command-and-control through voice among many other use cases. Read more [on the Google Cloud Speech-to-Text Website](https://cloud.google.com/speech-to-text) The Cloud Speech API provides audio transcription. Its accessible via the `gl_speech` function. Arguments include: * `audio_source` - this is a local file in the correct format, or a Google Cloud Storage URI. This can also be a `Wave` class object from the package `tuneR` * `encoding` - the format of the sound file - `LINEAR16` is the common `.wav` format, other formats include `FLAC` and `OGG_OPUS` * `sampleRate` - this needs to be set to what your file is recorded at. * `languageCode` - specify the language spoken as a [`BCP-47` language tag](https://datatracker.ietf.org/doc/html/bcp47) * `speechContexts` - you can supply keywords to help the translation with some context. ### Returned structure The API returns a list of two data.frame tibbles - `transcript` and `timings`. Access them via the returned object and `$transcript` and `$timings` ```r return <- gl_speech(test_audio, languageCode = "en-GB") return$transcript # A tibble: 1 x 2 # transcript confidence # #1 to administer medicine to animals is frequently a very difficult matter and yet sometimes it's necessary to do so 0.9711006 return$timings # startTime endTime word #1 0s 0.100s to #2 0.100s 0.700s administer #3 0.700s 0.700s medicine #4 0.700s 1.200s to # etc... ``` ### Demo for Google Cloud Speech-to-Text API A test audio file is installed with the package which reads: > "To administer medicine to animals is frequently a very difficult matter, and yet sometimes it's necessary to do so" The file is sourced from the University of Southampton's speech detection (`http://www-mobile.ecs.soton.ac.uk/`) group and is fairly difficult for computers to parse, as we see below: ```r library(googleLanguageR) ## get the sample source file test_audio <- system.file("woman1_wb.wav", package = "googleLanguageR") ## its not perfect but...:) gl_speech(test_audio)$transcript ## get alternative transcriptions gl_speech(test_audio, maxAlternatives = 2L)$transcript gl_speech(test_audio, languageCode = "en-GB")$transcript ## help it out with context for "frequently" gl_speech(test_audio, languageCode = "en-GB", speechContexts = list(phrases = list("is frequently a very difficult")))$transcript ``` ### Word transcripts The API [supports timestamps](https://cloud.google.com/speech/reference/rest/v1/speech/recognize#WordInfo) on when words are recognised. These are outputted into a second data.frame that holds three entries: `startTime`, `endTime` and the `word`. ```r str(result$timings) #'data.frame': 152 obs. of 3 variables: # $ startTime: chr "0s" "0.100s" "0.500s" "0.700s" ... # $ endTime : chr "0.100s" "0.500s" "0.700s" "0.900s" ... # $ word : chr "a" "Dream" "Within" "A" ... result$timings # startTime endTime word #1 0s 0.100s a #2 0.100s 0.500s Dream #3 0.500s 0.700s Within #4 0.700s 0.900s A #5 0.900s 1s Dream ``` ## Custom configurations You can also send in other arguments which can help shape the output, such as speaker diagrization (labelling different speakers) - to use such custom configurations create a [`RecognitionConfig`](https://cloud.google.com/speech-to-text/docs/reference/rest/v1p1beta1/RecognitionConfig) object. This can be done via R lists which are converted to JSON via `library(jsonlite)` and an example is shown below: ```r ## Use a custom configuration my_config <- list(encoding = "LINEAR16", diarizationConfig = list( enableSpeakerDiarization = TRUE, minSpeakerCount = 2, maxSpeakCount = 3 )) # languageCode is required, so will be added if not in your custom config gl_speech(my_audio, languageCode = "en-US", customConfig = my_config) ``` ## Asynchronous calls For speech files greater than 60 seconds of if you don't want your results straight away, set `asynch = TRUE` in the call to the API. This will return an object of class `"gl_speech_op"` which should be used within the `gl_speech_op()` function to check the status of the task. If the task is finished, then it will return an object the same form as the non-asynchronous case. ```r async <- gl_speech(test_audio, asynch = TRUE) async ## Send to gl_speech_op() for status ## 4625920921526393240 result <- gl_speech_op(async) ```