For example, if each leaf node of decision trees makes a regression class,
# of regress classes becomes tens of thousands.  In the current HTK
implementation, out-product of an observation vector is first stored to
bDiagMat of RegAccs of all regression classes as follows:

   for (b=1;b<=bclass->numClasses;b++) {
      ...
      /* now update the outerproduct observation */
      if (svec != NULL) {
        nblock = (int)(ra->bDiagMat[0]);
        for (bl=1, cnt=1; bl<=nblock;bl++){
          bsize = TriMatSize(ra->bDiagMat[bl]);
          m = ra->bDiagMat[bl];
          for (i=1, cnti=cnt; i<=bsize; i++,cnti++) { /* Fill the outer product
*/
            for (j=1,cntj=cnt; j<=i; j++,cntj++)
              m[i][j] = svec[cnti]*svec[cntj];
          }
          cnt +=bsize;
        }
      }
   }

Then they are stored to bTriMat of RegAccs of regression classes if its
bVector[1]>0 as follows:

      if ((ra->bTriMat != NULL) && (ra->bVector[1]>0)) {
         acc = ra->bVector;
         nblock = (int)(ra->bDiagMat[0]);
         for (bl=1,cnti=1;bl<=nblock;bl++) {
           m = ra->bDiagMat[bl];
           bsize = TriMatSize(m);
           for (i=1;i<=bsize;i++,cnti++) { /* Fill the accumulate stores */
             tm = ra->bTriMat[cnti];
             for (j=1; j<=bsize; j++)
               for (k=1; k<=j; k++)
                 tm[j][k] += m[j][k] * acc[cnti];
           }
         }
         ZeroDVector(ra->bVector);
      }

As you can see, it consumes a lot of computational costs.  If # of
regression class is large, calculating and storing outer-product takes
long time.  But most of bVector[0] are 0, so most of outer-product
computations are wasted.