kaldi 解码流程

前言

记录一下梳理大致kaldi解码流程

chain的模型(语音模型)

训练输出final.mdl 里面包含了两个模型一个是TransitionModel 另一个是nnet3(具体的chain)的模型

TransitionModel trans_model;
nnet3::AmNnetSimple am_nnet;
{
      bool binary;
      Input ki(nnet3_rxfilename, &binary);
      trans_model.Read(ki.Stream(), binary);
      am_nnet.Read(ki.Stream(), binary);
      SetBatchnormTestMode(true, &(am_nnet.GetNnet()));
      SetDropoutTestMode(true, &(am_nnet.GetNnet()));
      nnet3::CollapseModel(nnet3::CollapseModelConfig(), &(am_nnet.GetNnet()));
 }

上述代码可以看出加载了两种模型

TransitionModel

hmm-info final.mdl

通过hmm-info可以看到 final.mal中TransitionModel 的一些基本

hmm-info ./final.mdl 
number of phones 479 
number of pdfs 8024
number of transition-ids 60134
number of transition-states 30067

就是一些索引的范围

// List of the possible mappings TransitionModel can do:
//   (phone, HMM-state, forward-pdf-id, self-loop-pdf-id) -> transition-state
//                   (transition-state, transition-index) -> transition-id
//  Reverse mappings:
//                        transition-id -> transition-state
//                        transition-id -> transition-index
//                     transition-state -> phone
//                     transition-state -> HMM-state
//                     transition-state -> forward-pdf-id
//                     transition-state -> self-loop-pdf-id

可在transition-model.h找到上述注释，所以可以理解为TransitionModel里面是个map可以实现 phone transition-id HMM-state forward-pdf-id 相互映射(转化)

nnet3

nnet3-info final.mdl

会输出nnet3的模型结构可以看到最后输出的维度是和TransitionModel中的pdfs一样的可以理解nnet3的输出是每种pdfs的可能性（概率）

语言模型(HCLG.fst)

他是由4个模型通过。。。。而来的

G: 可以理解为语法模型但是实际上输入输出都一样都是词索引
L: 可以理解为词法模型是音到词的词法模型输入音索引输出词索引
C: 上下文状态到音索引的模型
H: HMM的释义到上下文状态的模型

fst::Fst<fst::StdArc> *decode_fst = ReadFstKaldiGeneric(fst_rxfilename);

特征提取器

将pcm语音数据转成nnet3所用的特征矩阵

OnlineNnet2FeaturePipeline feature_pipeline(feature_info);

像个管道一样输入音频输出特征

解码的大致过程

audio(pcm data)=>特征提取器=>语音模型=>语言模型=>词序列

特征提取

feature_pipeline.AcceptWaveform(samp_freq, wave_part);

具体实现 file:online-nnet2-feature-pipeline.cc

void OnlineNnet2FeaturePipeline::AcceptWaveform(
    BaseFloat sampling_rate,
    const VectorBase<BaseFloat> &waveform) {
  base_feature_->AcceptWaveform(sampling_rate, waveform);
  if (pitch_)
    pitch_->AcceptWaveform(sampling_rate, waveform);
}

主要有两部分 base_feature_ 跟 pitch(根据参数决定是否启用) base_feature_ 常见三种 OnlineNnet2FeaturePipeline构造时实现的

if (info_.feature_type == "mfcc") {
    base_feature_ = new OnlineMfcc(info_.mfcc_opts);
  } else if (info_.feature_type == "plp") {
    base_feature_ = new OnlinePlp(info_.plp_opts);
  } else if (info_.feature_type == "fbank") {
    base_feature_ = new OnlineFbank(info_.fbank_opts);
  } else {
    KALDI_ERR << "Code error: invalid feature type " << info_.feature_type;
  }

具体实现可以去看看src/feat/feature-{name}.c ps:ivector的特征就不说了

解码

decoder.AdvanceDecoding();

其实把语音语言模型的计算都包装了在一起

关键在两个类 decodable-online-looped.h 中的DecodableAmNnetLoopedOnline 用于语音模型的计算以及保存输出

lattice-faster-decoder.h 中的LatticeFasterDecoderTpl<FST, Token> 对于chain的模型 LatticeFasterDecoderTpl中是用了DecodableAmNnetLoopedOnline

decodable->LogLikelihood(frame, arc.ilabel)

decodable是DecodableAmNnetLoopedOnline的实例 LogLikelihood就是获取nnet3模型输出的对应帧数据的计算结果第一个参数代表帧索引第二个参数是TransitionId的索引

BaseFloat DecodableAmNnetLoopedOnline::LogLikelihood(int32 subsampled_frame,
                                                    int32 index) {
  subsampled_frame += frame_offset_;
  EnsureFrameIsComputed(subsampled_frame);
  return current_log_post_(
      subsampled_frame - current_log_post_subsampled_offset_,
      trans_model_.TransitionIdToPdfFast(index));
}

current_log_post_ 是个二维矩阵因为nnet3 输出的 pdfs的概率所以要先将TransitionId转成pdf的索引具体就是靠TransitionModel

其中EnsureFrameIsComputed(subsampled_frame); 是检查这帧数据是否被nnet3计算过的如果没有就计算，然后再取结果，计算过程在下面的函数中实现的

void DecodableNnetLoopedOnlineBase::AdvanceChunk()

配合HCLG.fst，由此计算出Lattice 具体实现可沿着

void LatticeFasterDecoderTpl<FST, Token>::AdvanceDecoding(DecodableInterface *decodable,
                                                int32 max_num_frames)

看进去

后续

通过Lattice找出最优解，然后转成词索引数组，接着配合词典得到句子