One day I was working next to a dear friend of mine when I couldn't help but notice what her work consisted of. She had to painstakingly type an interview word by word. She was doing tons of interviews at the time and was responsible for transcribing them. I was even more surprised to learn that the transcriptions were just an intermediate artifact.
She needed them to timestamp parts of the conversations and use them as a guide to develop a script for the video editors to ensemble the final piece.
That seemed like an awful lot of work for something that you throwed away...So I came to the rescue!
My spidey sense told me that there must have been some kind of service that did speech to text well. We were in 2021 after all!
I searched around a bit and found two good candidates (I'm sure therea are tons nowadays):
- Mozilla's DeepSpeech model seemed perfect. Free, open source and you could run it on your own. Unfortunately it didn't work that well on Spanish audio, which was 99% of my friend's source material.
- Amazon's Transcribe. This seemed less ideal, ran on demand in the cloud, pretty expensive, and completely closed source, so no idea of what happened inside the magic black box. But hey, it worked!
After some tests wit the later and some cash well spent it seemed to work perfectly fine for our use case, so we went with it. ❤️
Making it accesible
We already had the magic stuff that transcribed stuff, but I yet had to make it usable for my friend. She not being a nerd herself took running a script or creating an AWS account out of the picture. Also, for some reason, I didn't feel like keeping a public facing app connected to a highly expensive Amazon service either(I do love sleeping), so no web apps. I ended up deciding to go for a "non existing" interface.
I looked into the kind of tools my friend used daily to see if we could hack something together. She was already using OneDrive extensively, and luckily if you gave Zapier some of your money you could use some of their fancy connections between OneDrive and AWS, a good starting ponit!
We tried this and little by little iterated on this process until we ended up with the final workflow:
- Drop .mp3 file from an interview into an specific OneDrive folder.
- Receive an email stating that your transcription is starting soon.
- Wait 15 min (you had to wait for Zapier to process workflows again and for Amazon to process the files).
- Receive your fresh new transcription with timestamps in your inbox.
Behind the scenes this is how this bad boy looked:
I used some beautiful S3 event triggers that ran some functions to transform data as needed.
And this way, Transcritter was born! 👾
This system kept running for a couple of years, costing a handful of bucks a month, and saving tons of work hours to my friend. You could see the happy look on her face every time she got an email, seeing her mood shift when she gained all those hours back to her life was priceless ❤️
The tools were a bit expensive ~30 euros/month for the whole stack, but it proved to be way worth it for the sake of my friend's sanity. I even thought of turning it into a product for others to use, but I got sidetracked into other endeavors and never got to it.
This is by far my most satisfying piece of software I've built to date, a simple design with a bunch of duct taped pieces together, no fancy code, no fancy infra and no fancy interfaces, just pure and raw value.
The closer you are to your users the closer you are to see how what you build has or hasn't a true impact in their lives, and nothing can substitute that.