HTK Speech Recognition Toolkit

HTK History

1. Development and Releases of HTK

The first version of the HTK Hidden Markov Model Toolkit was developed at the Speech Vision and Robotics Group of the Cambridge University Engineering Department (CUED) in 1989 by Steve Young. HTK was (and is) a set of C library modules and tools that was initially used for speech recognition research (using continuous density HMMs) within the Speech Vision and Robotics Group at Cambridge. It soon became apparent that the software was of general interest and the early versions (up to version 1.2) were distributed in source form at media cost.

In early 1992, Phil Woodland joined Steve as co-developer of HTK and versions of HTK (from V1.3) were sold by the University (via the University company Lynxvale) at a cost of 450/950 pounds for a source site license for academic sites/companies. At this time bug reports and code development was handled by Steve and Phil. Since this maintenance load was becoming large as the number of users grew, an agreement was reached with Entropic Research Laboratories (ERL) to take over distribution and maintenance of HTK from the start of 1993.

The first ERL HTK release was version 1.4D and included a license manager, and the ability to purchase support. At this time the cost of the software was significantly increased. Steve and Phil continued to work on HTK updates. Various bug-fixes and integration with other ERL products (ESPS/xwaves) were added by ERL, primarily by Bill Byrne. Version 1.5 was released in October 1993.

In 1995, the Entropic Cambridge Research Laboratory (ECRL) was established which was a joint venture between ERL and Cambridge University. Steve and Phil were the technical directors at ECRL. HTK V2.0 was developed jointly at the University and at ECRL and released in October 1995. It represented a major redesign of many of the library modules and tools and added further refinements. Major contributers to HTK 2.0 at ECRL were Julian Odell, Valtcho Valtchev and Dave Ollason. After 1996, there was a slower rate of development of the core HTK product at Entropic as the company developed an API (HAPI: later distributed with HTK) and commercial quality large vocabulary decoders. Entropic gradually changed its main business focus away from a supplier of research tools such as HTK.

By 1999, the current version of HTK was V2.2 and all rights to HTK rested with Entropic. At this time Entropic's major business focus was voice-enabling the Web and Microsoft purchased Entropic in November 1999. A final Entropic release of HTK, V2.2_ref, was then produced which incorporated bug fixes and removed the need for a license manager, enabling site-wide use of HTK for all of Entropic's HTK licensees.

Over the years, HTK has been in use by hundreds of sites worldwide, and has a loyal following in the speech research community (and beyond). Microsoft therefore decided to make the core HTK toolkit available again and licensed the software back to CUED so that it could distribute and develop the software.

From September 2000, HTK became available in source form at no cost from a CUED Web site (htk.eng.cam.ac.uk) with the intention of further developing it as a speech recognition research platform. The initial release of HTK 3 was based on the final Entropic release, but contained a few minor bug fixes. However the intention is to develop HTK further and provide infrastructure support for investigating state-of-the-art speech recognition, and other sites are encouraged to make available additions to the core HTK functions.

2. Major Features of Various HTK Releases

This section gives an overview of which features in HTK were added when. It is necessarily very brief.

Version 1.0: Initial CUED-internal release. Small amount of documentation. Initial definition of libraries and tools. Support for diagonal and full covariance Gaussian mixture HMMs.
Version 1.1: First released version. Improved reference manual (48 pages). Cached computation of output probabilities.
Version 1.2: Added automatic parameter coercion and byte swapping; pruning added to HERest and HVite; support for tied output distributions.
Version 1.3: Arbitrary HMM parameter tying; multiple data streams; MFCC analysis; qualifiers for delta coefs; logical to physical HMM mapping via HMM lists; extensive tracing options added; HHEd created. User, reference and programmer manuals created.
Version 1.4: Sub-word based word recognition supported; faster/smaller word-pair grammars in HParse; tee models for inter-word silence; acceleration coefs; robust state clustering; variance floor macro; X-windows version of HGraf and HSLab added; error codes improved and documented. Support for all features of 1992 CUED Resource Management evaluation system. V1.4D added support for ESPS FEA files and included a license manager.
Version 1.5: Master model files (MMFs) and master label files (MLFs) added essential for large-scale systems; forced alignment; parameter file compression; cepstral mean normalisation; addition of the RM recipe. Total documentation ran to 286 pages.
Version 2.0: Major redesign of many library and tools. Documentation via the HTK Book. Support for discrete density HMMs. Complete rewrite of recognition tools using new lattice-based grammar format (HNet/HRec modules). Support for cross-word triphones; lattice and N-best recognition output; and back-off bigram language models. Decision-tree state clustering. Redesigned speech and audio input (HWave/HParm) to support coercion from waveform and real-time audio input. Configuration files.
Version 2.1: HParm partially re-designed and an energy based silence detector included. HNet optimised. Pronunciation probabilities in HVite. Automatic byte swapping for all binary file formats. Support for Microsoft WAV format.
Version 2.2: HEAdapt inlcuded for MLLR (mean and variance) and MAP adaptation. HVite also supports adaptation.
Version 3.0: Code based on 2.2 release with minor bug fixes and C++ compatibility. The major changes are the new licensing and distribution arrangements.
Version 3.1: Perceptual Linear Prediction (PLP) frontend implemented; support for Vocal Tract Length Normalisation (VTLN) and cluster-based cepstral mean and variance normalisation added.
Version 3.2: HLM language modelling toolkit integrated. HLRescore lattice post-processing tool added. Support for global feature space transforms. 2-model re-estimation in HERest.
Version 3.3: Adaptation code rewritten and extended, supports MLLR, Constrained MLLR and variance transforms. In addition Speaker Adaptive Training with COnstrained MLLR added, HERest replaces HEAdapt as the tool to generate linear transforms.
Version 3.4: Discriminative training, both MPE and MMI, using HMMIRest added. Code for estimating Semi-Tied and HLDA transform added to HERest. A large vocabulary decoder (HDecode) that supports trigram decoding with cross-wrod triphone models added as an extension to HTK V3.4. HDecode is distributed under a more restrictive license that the main code-base.

3. Other HTK-Related Software

Entropic produced and sold various products that were related to HTK. These included HAPI (HTK API) which was bundled with HTK in later versions. Other Entropic-produced software included Graphvite (a graphical grammar builder/tester) and Transcriber (a large vocabulary recognition engine and toolkit). None of these are included in the HTK 3 release.

At various points in time other HTK software has been produced at CUED and released on a restricted basis. This has included the Lattice Toolkit and a large vocabulary decoder called JRlx written by Julian Odell. None of this software is part of the HTK 3 release although HLRescore and HDecode support much of the functionality.

Phil Woodland
September 2000
Gunnar Evermann
December 2002
Mark Gales
December 2006