Review of “Automated Transcription” by Bokhove and Downey

By A. V. Howland

Introduction

As a freelance transcriptionist, there’s nothing that delights me more than when a job comes in and it’s already half-complete. It’s easy to tell when a previous worker has been using voice-to-text software to do their job for them. The malapropisms and lack of division between speakers are a dead giveaway, but I honestly can’t blame them for using voice-to-text software. The job of a transcriptionist often pays a low salary despite the tedious, demanding nature of the work and the skills required to do it well. As a transcriptionist, automation can help lighten my workload significantly.

Voice-to-text AI can also be very useful in academic research and educational contexts, saving time and money for researchers, and giving hard-of-hearing/deaf students, ESL learners, and visual learners a needed leg up in their lessons.

Christian Bokhove and Christopher Downey advocate for the use of automated captioning services in their article “Automated generation of ‘good enough’ transcripts as the first step to transcription of audio-recorded data”, published in Methodological Innovations’ May-August 2018 edition. Through the authors’ experiments and their investigation into academic literature about transcription, they conclude that automated captioning services are well worth using, especially as a rudimentary tool to jumpstart the transcription process, with editors to clean up the rough draft after its creation.

Business man holding telephone handset with alphabet letters flying out spreading news

Article summary

The state of transcription

Bokhove and Downey begin by laying out the current state of transcription in academia and describing automated captioning services.

Automated captioning services (ACS) are common technology but have had little integration into academic research. ACS could be extremely useful for transcribing a rough draft of audio recordings. Since there are many transcription styles and most transcriptions need several passes of editing and standardizing to meet certain project-specific guidelines, creating a draft using ACS would save time and money for researchers (1-2).

Some academics doubt the usefulness of transcribed interviews due to their debatable value and the time-consuming nature of the work. ACS can reduce the time and labour of creating a transcript manually. Bokhove and Downey also posit that ACS lacks the human bias of individual transcribers’ interpretations of audio (2-3). (I would like to point out that bias can exist in technologies because they were created and designed by humans, who have bias.) Transcription for research is always a trade-off between quality and resources, but using ACS could help bridge that gap by providing free “good enough” first drafts of transcripts that editors can then adjust (2-3).

Methods and accuracy of captions

Currently, researchers can obtain captions through professional captioning companies, transcribing their captions as subtitles to a video, or through automated captioning services. Auto-captioning services are the least accurate, as their technology is developing and improving. Also, the accuracy required for a transcript depends on the transcript’s purpose (3-4). Accuracy varies widely among different types of automated captioning services and voice-to-text programs, from bad enough to be incomprehensible to being comprehensible (4). Any automated transcript will require editing, but the authors believe that the trade-off between free time and accuracy is worthwhile. A way to improve accuracy is for the researcher to read the text aloud while using voice recognition software that has been trained to their voice, which is what is done in live TV captioning and court reporting. Still, while voice recognition software takes less time than human transcription, it has trouble transcribing multiple speakers’ voices and is much more time-consuming than letting an ACS program run (4).

Academic writing about transcription

There is not much academic writing about using automated captioning services or voice recognition software to assist transcription. Most writing that exists pertains to the use of transcription in educational contexts and has proved that captions of any kind make information accessible to deaf/hard-of-hearing students, increase understanding for special needs and ESL students, and help with student note-taking (4).

The direction that Bokhove and Downey take their experiment has previously remained unexplored. This is likely because easily accessible automated transcription tools are a relatively new technology, so the authors’ research is in a new area.

This small pool of academic research points to a gap that needs to be filled by more research and inquiry into automated transcription services and their potential uses. More academic research is needed to improve content accessibility, especially because academic research gives legitimacy to the topic of transcription and the solution of automated transcription tools.

Authors’ experiment and methodology

The authors wanted to create a proof-of-concept for their idea of using ACS to create a “rough draft” transcript. They used three different sources of audio—single person interviews, a group meeting, a classroom lecture—to test how ACS handles different scenarios and qualities of audio, including variables like background noise and multiple speakers. Each audio source had its transcript, manually and accurately produced, against which the authors compared the transcript created by the ACS (5-6). Bokhove and Downey uploaded each audio source to YouTube and used YouTube’s automatic captions feature to get their transcripts (6). Then they downloaded the text file for the caption, removed the timestamps, and compared the YouTube transcript of each to its professional manual transcript using Turnitin (7-8).

Findings

Through their experiment, Bokhove and Downey found that free online automated captioning tools produced reasonable first draft transcripts to be used in academic research. Since manual transcription takes 4-5 hours for every hour of audio, using ACS would free up that time to be used for other value-adding tasks, like editing, formatting, and proofreading. Depending on the audio quality, around 66-90% of the audio was accurately transcribed (10). The authors discovered that most automated caption errors were relatively small and easy to fix (8). Areas that ACS struggled with are jargon, numbers, slang, and certain word sounds (9). Overall, researchers could save a noteworthy amount of time by starting with an automated transcription instead of doing everything manually.

Author recommendations

Bokhove and Downey recommend that researchers consider using an automated transcription process for projects that require long transcripts (10). They say that any of the available free transcription software can be a viable option to produce first draft transcripts, noting that those drafts should be looked over by human editors to correct any transcription mistakes made by the ACS (11).

My opinion

While I cannot speak to the more technical aspects of the article, such as different software, I find the authors’ recommendation to be a useful one. I can speak to how long transcription can take, and correcting a rough draft is much easier than transcribing every word. Using a program to roughly transcribe an interview or recording saves time and makes editing and formatting the transcript easier.

The authors investigate using ACS for transcription in the context of academic research, but the potential uses go far beyond that situation. Transcription has a multitude of uses especially in the field of technical communication. For example, transcription is crucial for creating accessible audio content or for accommodating different learning styles in education. It would be a good idea to look further into how ACS-based transcription could be relevant to your work.

Conclusion

The authors’ recommendation is a valuable one—researchers have so much to gain from integrating ACS into their research practices. Many automated transcription programs are free, simple to use, and save large amounts of time.

To summarize, transcript drafts created by automated captioning tools and edited by humans can be used for research (10). Though the drafts have a wide range of accuracy, they can save time and money for researchers (4, 2), and thus the authors recommend that researchers at least consider implementing this process before embarking on large research projects. If researchers started using ACS to create a first draft, it would mean as a transcriptionist, I can redirect my efforts to editing and fine-tuning the draft for quality, rather than the time-consuming process of manual transcription.

Beyond automated transcription’s uses for academic research, it plays a much larger role in technical communication as a whole. Transcription and captioning allow information to be accessible to everyone. This is important not only for disabled people to access information, but also for content accessibility affected by platforms, devices, languages, and the environment. Since information is increasingly presented in video, audio, and other non-text formats, captions are necessary to make the non-text information accessible. Besides, online content is constantly growing in the digital age, necessitating transcription for captions. For example, there is e-learning, online conferences, online meetings, TV streaming services, and internet videos. In this pandemic, with online information and media becoming so prevalent, transcription and captioning are more vital than ever for information accessibility.

Bokhove and Downey’s investigation into automated transcription for research purposes opens up a new area of exploration not only for academia but also in the technical communication field.

Consider the technical communication field. How accessible is the content you create? How could transcription change the content you produce, and how will transcription affect it in the future? Start adding closed captions to videos and transcripts to audio posts. Get ahead of the transcription curve and make your content accessible easily with ACS.

About the author

A. V. Howland (they/them/their) uses their writing skills to help all kinds of projects reach their full potential. Their studies at York University have combined with practical work experiences as a library page and IT assistant to sharpen their versatile writing skills and eye for detail. Alongside an editing team, they edited The Game by Joel Lavigne (published April 2021) in their Publishing Practicum class, demonstrating their collaborative editorial skills while meeting close deadlines. A. V. is working on their Bachelor of Arts in English Professional Writing and looks forward to graduating to fully join the professional world. You can contact them on LinkedIn or find more of their work at their website, koi-caper-trcz.squarespace.com.

STC Ontario

Society for Technical Communication