Computer Voice Dictation for Mental Health Professionals

The development of computerized speech dictation has been a long-standing goal of mental health professionals who spend an extraordinary amount of time and money on dictating patient records.

Traditionally, psychiatrists and other mental health professionals dictate their notes or reports into a miniature tape recorder or over the phone. A highly-trained transcriptionist then transcribes the recording into written form, generally on a word processor. The document is then printed and returned to the professional for review and editing (e.g., unrecognized words). It is then given to a secretary or the original transcriptionist for revisions, and again returned to the clinician for approval. This process, which can take anywhere from three days to three weeks, is lengthy and time-consuming for all involved, increasing the chances that vital information will be missing from the patient record.

In an effort to cut time and costs, computerized speech dictation systems seem like a sensible solution. After the clinician dictates his or her notes or a report, the system recognizes the spoken word and generates a document immediately (or perhaps later on, when the clinician isn't using the computer for other tasks). The clinician (or his or her secretary) could then review it for accuracy and, once corrected, the document could be printed and immediately inserted into the patient's record. The entire process of dictation, recognition, correction and printout could be completed in as little as a few minutes! Now such a computerized system- one in which the patient record is entirely electronic, and therefore more accessible while actually increasing security- is one step closer to reality.

For nearly two decades, discrete speech dictation systems have been available for computers. Three of the most popular manufacturers of these systems- Dragon Systems, IBM and Kurzweil- offer packages that many mental health professionals, especially psychiatrists, have tried. Often, though, professionals are disappointed by these systems' low dictation rate (the number of words per minute the system can successfully recognize) and relatively low recognition accuracy (the number of errors made by the software in trying to recognize the spoken word). There are important reasons why discrete speech systems developed to date have been disappointing.

Discrete (or discontinuous) speech requires the user to pause between each word, because this is how the software recognizes the beginning and end of each new word. Pausing between each and every spoken word is awkward and decreases the dictation rate (at best, 50 to 70 words per minute). It takes a lot of practice to learn how to dictate in this manner and is a major reason why discrete speech systems are usually quickly rejected once tried. Even if a therapist can successfully adapt his or her speech to be recognized by the discrete speech system, clinicians rarely make full use of the potential of these systems because of their inherent awkwardness.

The software's recognition rate falters under this system for a number of reasons. First, the word must be in the system's known vocabulary and the speaker must pronounce the word correctly. Discrete speech systems do not handle accents or unknown words very well. Second, speaking with pauses between each word can lead the speaker to falter while trying to dictate. Third, most software developers do not have mental health-specific vocabularies. (Dragon Systems is the notable exception.) This problem means that most diagnoses, medications, psychiatric and psychological terms currently used in our field are not recognized by the system.

As discrete speech recognition systems have become more and more commonplace, despite their limitations (IBM will package this type of voice recognition system with the next release of its operating system, OS/2, later this year), researchers turned to the development of continuous speech recognition software. While the theory of continuous speech has been around a while, serious research and development has occurred only in the past decade or so. This delay in research and development was mainly due to insufficient computing power to make continuous speech on the desktop a reality. The years of research and development have, however, finally begun to pay off; within the next year, a number of continuous speech recognition systems will become available.

Continuous speech systems offer much greater accuracy and ease of use because of their ability to recognize speech at its natural rate and intonation. This accuracy and ease is achieved through a process that recognizes distinct speech sounds, called phonemes, rather than relying on the speaker to pause between words. Algorithms using the Hidden Markov probability model are combined with an English-language lexicon to create an acoustic model. It is from this acoustic model that distinct words are accurately abstracted from natural speech.

This accuracy is also enhanced and refined through the use of a language model. The language model is based upon the ways in which professionals communicate through unique words and word combinations. For instance, Philips Dictation Systems, in conjunction with Voice Input Technologies, a division of CMHC Systems, has developed a system that uses a mental health language model to accurately recognize psychiatric and psychological dictation. This model is built around samples of commonly used documents such as psychiatric evaluations, prescriptions, psychological evaluations or intake evaluations. With more than 4 million words having been used to build this language context, it can achieve accuracy rates of up to 95% and a dictation rate of up to 150 to 170 words per minute, about the normal speaking rate of most people.

While some discrete dictation systems claim to be speaker-independent, all systems on the market today benefit from being trained to a particular speaker's voice. Because of the way discrete systems work, however, recognition of nonnative English speakers is less than optimal. A continuous speech system, on the other hand, uses a short initial training to attune itself to the speaker's pronunciation of phonemes within the English vocabulary. Since continuous speech systems use phonemes rather than entire words to determine word recognition, they are much more accurate with any type of thick or unusual accent.

There will likely be two continuous speech recognition systems available by year's end, with many more to follow in 1997. As mentioned, one will be marketed by Voice Input Technologies specifically for mental health professionals. Because the technology is new, the system is relatively expensive- $15,000 will buy the personal computer hardware and software. Still, this cost must be compared with the price of a transcription service, not with the cost of a typical computer. Another company actively working on a continuous speech system is "fonix" Corp., which promises a product by year's end that is marketed toward integration with other, existing software (such as a word processor). Other companies working on this technology include IBM, Dragon Systems and Kurzweil; Microsoft is also working on continuous speech for the masses. Most of these companies will have offerings in 1997 that will be refined by 1998. These systems likely will be marketed and designed for general consumer use in mainstream word processing. They probably will not have vocabularies and contexts built-in for specific professions.

Mental health professionals who are considering the purchase of a dictation system may be better off waiting until continuous speech systems are available commercially due to their superior technology, increased accuracy and natural language features. Someday, virtually all interactions with computers and technology will be voice-driven and feature natural language recognition and processing. In the meantime, technology has finally delivered a truly useful system with the potential to change the way therapists dictate and store patient records. The future of human interaction with their computers is speech; the keyboard's days are numbered.