I used OpenAI’s new technology to transcribe the audio directly to my laptop

OpenAI, the company behind the DALL-E image and meme generation program and the powerful GPT-3 auto-completion engine, has released a new open-source neural network for transcribing the audio to written text (Going through Tech Crunch). It’s called Whisper, and says the company it “approaches human-level robustness and accuracy on English speech recognition” and that it can also automatically recognize, transcribe, and translate other languages ​​like Spanish, Italian, and Japanese.

As someone who constantly records and transcribes interviews, I was immediately excited by this news – I thought I would be able to write my own application to safely transcribe audio directly from my computer. While cloud-based services like Otter.ai and Trint work for most things and are relatively secure, there are only a few interviews where I, or my sources, would feel more comfortable if the file audio remained off the internet.

Using it turned out to be even easier than I had imagined; Python and various development tools are already configured on my computer. So installing Whisper was as easy as running a single Terminal command. In 15 minutes, I was able to use Whisper to transcribe a test audio clip I had recorded. For someone relatively tech-savvy who hadn’t already set up Python, FFmpeg, Xcode, and Homebrew, this would probably take upwards of an hour or two. There’s already someone working on making the process a lot easier and more user-friendly, which we’ll talk about in a second.

Command-line apps obviously aren’t for everyone, but for something that does relatively complex work, Whisper is very easy to use.

While OpenAI definitely saw this use case as a possibility, it’s pretty clear that the company is primarily targeting researchers and developers with this release. In the blog post announcing Whisper, the team said their code could “serve as a basis for building useful applications and for further research into robust speech processing” and that they hope “the high accuracy and ease of use of Whisper will allow developers to add voice interfaces to a much wider set of applications.” However, this approach is still remarkable – the company has limited access to its most popular machine learning projects like DALL-E or GPT-3, citing a desire to “learn more about real-world usage and continue to iterate on our security systems.”

Image showing a text file with the lyrics transcribed from the song

The text files produced by Whisper are also not the easiest to read if you use them to write an article.

There’s also the fact that it’s not exactly a user-friendly process to install Whisper for most people. However, journalist Peter Sterne joined forces with GitHub developer advocate Christina Warren. to try to fix it, announcing that they are creating a “free, secure, and easy-to-use transcription app for journalists” based on Whisper’s machine learning model. I spoke to Sterne, and he said he decided the program, dubbed Stage Whisper, should exist after doing a few interviews and determining that it was “the best transcript I’ve ever used, at least ‘exception of human transcribers’.

I compared a Whisper-generated transcript to what Otter.ai and Trint posted for the same file, and I’d say it was relatively comparable. There were enough errors in all of them that I would never copy and paste quotes into an article without double-checking the audio (which is, of course, best practice anyway, regardless of service you are using). But Whisper’s version would absolutely do the trick for me; I can search there for the sections I need and then recheck them manually. In theory, Stage Whisper should work exactly the same since it will use the same model, just with a GUI surrounding it.

Sterne admitted that technology from Apple and Google could make Stage Whisper obsolete within a few years – Pixel’s voice recording app has been able to do offline transcriptions for years, and a version of that functionality begins to deploy to other Android devicesand Apple has integrated offline dictation into iOS (although there is currently no good way to transcribe audio files with it). “But we can’t wait that long,” Sterne said. “Journalists like us need good automatic transcription apps today.” He hopes to have a simplified version of the Whisper-based app ready in two weeks.

To be clear, Whisper probably won’t totally obsolete cloud-based services like Otter.ai and Trint, no matter how easy it is to use. For one thing, OpenAI’s model misses one of the biggest features of traditional transcription services: being able to tag who said what. Sterne said Stage Whisper probably wouldn’t support this feature: “we’re not developing our own machine learning model.”

The cloud is just someone else’s computer – which probably means it’s a bit faster

And while you get the benefits of local processing, you also get the downsides. The main one is that your laptop is almost certainly much less powerful than the computers used by a professional transcription service. For example, I fed audio from a 24-minute interview into Whisper, running on my MacBook Pro M1; it took about 52 minutes to transcribe the entire file. (Yes, I made sure he was using Apple Silicon’s version of Python instead of Intel’s.) Otter spat out a transcript in less than eight minutes.

OpenAI’s technology has one big advantage, however: price. Cloud-based subscription services will almost certainly cost you money if you use them professionally (Otter has a free tier, but upcoming changes will make it less useful for people who transcribe things frequently), and transcription features built -in platforms like Microsoft Word or the Pixel requires you to pay for separate software or hardware. Stage Whisper – and Whisper itself – is free and can run on the computer you already own.

Again, OpenAI has higher hopes for Whisper than being the basis of a secure transcription application – and I’m very excited about what researchers will end up doing with it or what they will learn by examining the machine learning model, which was trained on “680,000 hours of multilingual, multitasking supervised data collected from the web.” But the fact that it also has real practical use today makes it all the more exciting.

Disclaimer! TechNewsBoy.com is an automatic aggregator around world media. All content is freely available on the Internet. We just organized it in one platform for educational purposes only. In each content, the hyperlink to the main source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the content owner and do not want us to post your materials on our website, please contact us by E-mail – [email protected]. Content will be deleted within 24 hours.

Comments are closed.