HTK History
1. Development and Releases of HTK
The first version of the HTK Hidden Markov Model Toolkit was developed
at the
Speech Vision and
Robotics Group of the Cambridge University Engineering Department
(CUED) in 1989 by Steve Young. HTK was (and is) a set of C library
modules and tools that was initially used for speech recognition
research (using continuous density HMMs) within the Speech Vision and
Robotics Group at Cambridge. It soon became apparent that the software
was of general interest and the early versions (up to version 1.2)
were distributed in source form at media cost.
In early 1992, Phil Woodland joined Steve as co-developer of HTK and
versions of HTK (from V1.3) were sold by the University (via the
University company Lynxvale) at a cost of 450/950 pounds for a source
site license for academic sites/companies. At this time bug reports
and code development was handled by Steve and Phil. Since this
maintenance load was becoming large as the number of users grew, an
agreement was reached with Entropic Research Laboratories (ERL) to
take over distribution and maintenance of HTK from the start of 1993.
The first ERL HTK release was version 1.4D and included a license
manager, and the ability to purchase support. At this time the cost of
the software was significantly increased. Steve and Phil continued to
work on HTK updates. Various bug-fixes and integration with other ERL
products (ESPS/xwaves) were added by ERL, primarily by Bill Byrne.
Version 1.5 was released in October 1993.
In 1995, the Entropic Cambridge Research Laboratory (ECRL) was
established which was a joint venture between ERL and Cambridge
University. Steve and Phil were the technical directors at ECRL. HTK
V2.0 was developed jointly at the University and at ECRL and released
in October 1995. It represented a major redesign of many of the
library modules and tools and added further refinements. Major
contributers to HTK 2.0 at ECRL were Julian Odell, Valtcho Valtchev
and Dave Ollason. After 1996, there was a slower rate of development
of the core HTK product at Entropic as the company developed an API
(HAPI: later distributed with HTK) and commercial quality large
vocabulary decoders. Entropic gradually changed its main business
focus away from a supplier of research tools such as HTK.
By 1999, the current version of HTK was V2.2 and all rights to HTK
rested with Entropic. At this time Entropic's major business focus was
voice-enabling the Web and Microsoft purchased Entropic in November
1999. A final Entropic release of HTK, V2.2_ref, was then produced
which incorporated bug fixes and removed the need for a license
manager, enabling site-wide use of HTK for all of Entropic's HTK
licensees.
Over the years, HTK has been in use by hundreds of sites worldwide,
and has a loyal following in the speech research community (and
beyond). Microsoft therefore decided to make the core HTK toolkit
available again and licensed the software back to CUED so that it
could distribute and develop the software.
From September 2000, HTK became available in source form at no cost
from a CUED Web site (htk.eng.cam.ac.uk) with the intention of further
developing it as a speech recognition research platform. The initial
release of HTK 3 was based on the final Entropic release, but
contained a few minor bug fixes. However the intention is to develop
HTK further and provide infrastructure support for investigating
state-of-the-art speech recognition, and other sites are encouraged to
make available additions to the core HTK functions.
2. Major Features of Various HTK Releases
This section gives an overview of which features in HTK were added when. It
is necessarily very brief.
- Version 1.0: Initial CUED-internal release. Small amount of
documentation. Initial definition of libraries and
tools. Support for diagonal and full covariance
Gaussian mixture HMMs.
- Version 1.1: First released version. Improved reference manual (48 pages).
Cached computation of output probabilities.
- Version 1.2: Added automatic parameter coercion and byte swapping; pruning
added to HERest and HVite; support for tied output distributions.
- Version 1.3: Arbitrary HMM parameter tying; multiple data streams;
MFCC analysis; qualifiers for delta coefs; logical to physical
HMM mapping via HMM lists; extensive tracing options added;
HHEd created. User, reference and programmer manuals created.
- Version 1.4: Sub-word based word recognition supported; faster/smaller
word-pair grammars in HParse; tee models for inter-word
silence; acceleration coefs; robust state clustering;
variance floor macro; X-windows version of HGraf and HSLab
added; error codes improved and documented. Support for
all features of 1992 CUED Resource Management evaluation system.
V1.4D added support for ESPS FEA files and included a license
manager.
- Version 1.5: Master model files (MMFs) and master label files (MLFs) added
essential for large-scale systems; forced alignment; parameter
file compression; cepstral mean normalisation; addition of the
RM recipe. Total documentation ran to 286 pages.
- Version 2.0: Major redesign of many library and tools. Documentation via the
HTK Book. Support for discrete density HMMs. Complete rewrite of
recognition tools using new lattice-based grammar format (HNet/HRec modules).
Support for cross-word triphones; lattice and N-best recognition output;
and back-off bigram language models. Decision-tree state clustering.
Redesigned speech and audio input (HWave/HParm) to
support coercion from waveform and real-time audio
input. Configuration files.
- Version 2.1: HParm partially re-designed and an energy based silence detector included.
HNet optimised. Pronunciation probabilities in HVite. Automatic byte
swapping for all binary file formats. Support for Microsoft WAV format.
- Version 2.2: HEAdapt inlcuded for MLLR (mean and variance) and MAP adaptation. HVite
also supports adaptation.
- Version 3.0: Code based on 2.2 release with minor bug fixes and C++ compatibility.
The major changes are the new licensing and distribution arrangements.
- Version 3.1: Perceptual Linear Prediction (PLP) frontend
implemented; support for Vocal Tract Length Normalisation (VTLN)
and cluster-based cepstral mean and variance normalisation added.
- Version 3.2: HLM language modelling toolkit integrated.
HLRescore lattice post-processing tool added. Support for global
feature space transforms. 2-model re-estimation in HERest.
- Version 3.3: Adaptation code rewritten and extended, supports MLLR, Constrained
MLLR and variance transforms. In addition Speaker Adaptive Training with COnstrained MLLR
added, HERest replaces HEAdapt as the tool to generate linear transforms.
- Version 3.4: Discriminative training, both MPE and MMI, using HMMIRest added.
Code for estimating Semi-Tied and HLDA transform added to HERest. A large vocabulary
decoder (HDecode) that supports trigram decoding with cross-wrod triphone models added
as an extension to HTK V3.4. HDecode is distributed under a more restrictive license
that the main code-base.
3. Other HTK-Related Software
Entropic produced and sold various products that were related to
HTK. These included HAPI (HTK API) which was bundled with HTK in later
versions. Other Entropic-produced software included Graphvite (a
graphical grammar builder/tester) and Transcriber (a large vocabulary
recognition engine and toolkit). None of these are included in the HTK
3 release.
At various points in time other HTK software has been produced at CUED
and released on a restricted basis. This has included the Lattice
Toolkit and a large vocabulary decoder called JRlx written by Julian
Odell. None of this software is part of the HTK 3 release although
HLRescore and HDecode support much of the functionality.
Phil Woodland
September 2000
Gunnar Evermann
December 2002
Mark Gales
December 2006