1.[add] 为窗口添加icon;

2.[modified] 基于asr的输出结果中添加旁白字数推荐，规范起止时间为2位小数; 3.[modified] 调整音频合成部分临时音频的存储位置; 4.[modified] 为输出的表格添加自动换行

1.[add] 为窗口添加icon;
5b0ac742 · 翟艳秋（20软） · 945a2b39 · 081f7807 · 5b0ac742 · 5b0ac742
Commit 5b0ac742 authored Feb 19, 2022 by 翟艳秋（20软）
91 changed files
--- a/PaddlePaddle_DeepSpeech2 @ 081f7807
+++ b/PaddlePaddle_DeepSpeech2 @ 081f7807
-Subproject commit 081f7807a2ce0e12b98e6f0a0da0e650133f2d9e
--- a/PaddlePaddle_DeepSpeech2/LICENSE
+++ b/PaddlePaddle_DeepSpeech2/LICENSE
--- a/PaddlePaddle_DeepSpeech2/Paddle deepspeech安装.docx
+++ b/PaddlePaddle_DeepSpeech2/Paddle deepspeech安装.docx
--- a/PaddlePaddle_DeepSpeech2/PaxHeader/Paddle deepspeech安装.docx
+++ b/PaddlePaddle_DeepSpeech2/PaxHeader/Paddle deepspeech安装.docx
+21 hdrcharset=BINARY
+62 path=PaddlePaddle-DeepSpeech2/Paddle deepspeech安装.docx
+30 mtime=1640747647.779457699
--- a/PaddlePaddle_DeepSpeech2/README.md
+++ b/PaddlePaddle_DeepSpeech2/README.md
+# DeepSpeech2 语音识别
+
+![License](https://img.shields.io/badge/license-Apache%202-red.svg)
+![python version](https://img.shields.io/badge/python-3.7+-orange.svg)
+![support os](https://img.shields.io/badge/os-linux-yellow.svg)
+![GitHub Repo stars](https://img.shields.io/github/stars/yeyupiaoling/PaddlePaddle-DeepSpeech?style=social)
+
+本项目是基于PaddlePaddle的[DeepSpeech](https://github.com/PaddlePaddle/DeepSpeech) 项目开发的，做了较大的修改，方便训练中文自定义数据集，同时也方便测试和使用。DeepSpeech2是基于PaddlePaddle实现的端到端自动语音识别（ASR）引擎，其论文为[《Baidu's Deep Speech 2 paper》](http://proceedings.mlr.press/v48/amodei16.pdf) ，本项目同时还支持各种数据增强方法，以适应不同的使用场景。支持在Windows，Linux下训练和预测，支持Nvidia Jetson等开发板推理预测，该分支为新版本，如果要使用旧版本，请查看[release/1.0分支](https://github.com/yeyupiaoling/PaddlePaddle-DeepSpeech/tree/release/1.0)。
+
+本项目使用的环境：
+ - Python 3.7
+ - PaddlePaddle 2.2.0
+ - Windows or Ubuntu
+
+## 更新记录
+
+ - 2021.11.26: 修改集束解码bug。
+ - 2021.11.09: 提供WenetSpeech数据集制作脚本。
+ - 2021.09.05: 提供GUI界面识别部署。
+ - 2021.09.04: 提供三个公开数据的预训练模型。
+ - 2021.08.30: 支持中文数字转阿拉伯数字，具体请看[预测文档](docs/infer.md)。
+ - 2021.08.29: 完成训练代码和预测代码，同时完善相关文档。
+ - 2021.08.07: 支持导出预测模型，使用预测模型进行推理。使用webrtcvad工具，实现长语音识别。
+ - 2021.08.06: 将项目大部分的代码修改为PaddlePaddle2.0之后的新API。
+
+## 模型下载
+| 数据集 | 卷积层数量 | 循环神经网络的数量 | 循环神经网络的大小 | 测试集字错率 | 下载地址 |
+| :---: | :---: | :---: | :---: | :---: | :---: |
+| aishell(179小时) | 2 | 3 | 1024 | 0.084532 | [点击下载](https://download.csdn.net/download/qq_33200967/21773253) |
+| free_st_chinese_mandarin_corpus(109小时) | 2 | 3 | 1024 | 0.170260 | [点击下载](https://download.csdn.net/download/qq_33200967/21866900) |
+| thchs_30(34小时) | 2 | 3 | 1024 | 0.026838 | [点击下载](https://download.csdn.net/download/qq_33200967/21774247) |
+| 超大数据集(1600多小时真实数据)+(1300多小时合成数据) | 2 | 3 | 1024 | 训练中 | [训练中]() |
+
+**说明：** 这里提供的是训练参数，如果要用于预测，还需要执行[导出模型](docs/export_model.md)，使用的解码方法是集束搜索。
+
+>有问题欢迎提 [issue](https://github.com/yeyupiaoling/PaddlePaddle-DeepSpeech/issues) 交流
+
+
+## 文档教程
+
+- [快速安装](docs/install.md)
+- [数据准备](docs/dataset.md)
+- [WenetSpeech数据集](docs/wenetspeech.md)
+- [合成语音数据](docs/generate_audio.md)
+- [数据增强](docs/augment.md)
+- [训练模型](docs/train.md)
+- [集束搜索解码](docs/beam_search.md)
+- [执行评估](docs/eval.md)
+- [导出模型](docs/export_model.md)
+- 预测
+   - [本地模型](docs/infer.md)
+   - [长语音模型](docs/infer.md)
+   - [Web部署模型](docs/infer.md)
+   - [Nvidia Jetson部署](docs/nvidia-jetson.md)
+
+
+## 快速预测
+
+ - 下载作者提供的模型或者训练模型，然后执行[导出模型](docs/export_model.md)，使用`infer_path.py`预测音频，通过参数`--wav_path`指定需要预测的音频路径，完成语音识别，详情请查看[模型部署](docs/infer.md)。
+```shell script
+python infer_path.py --wav_path=./dataset/test.wav
+```
+
+输出结果：
+```
+-----------  Configuration Arguments -----------
+alpha: 1.2
+beam_size: 10
+beta: 0.35
+cutoff_prob: 1.0
+cutoff_top_n: 40
+decoding_method: ctc_greedy
+enable_mkldnn: False
+is_long_audio: False
+lang_model_path: ./lm/zh_giga.no_cna_cmn.prune01244.klm
+mean_std_path: ./dataset/mean_std.npz
+model_dir: ./models/infer/
+to_an: True
+use_gpu: True
+use_tensorrt: False
+vocab_path: ./dataset/zh_vocab.txt
+wav_path: ./dataset/test.wav
+------------------------------------------------
+消耗时间：132, 识别结果: 近几年不但我用书给女儿儿压岁也劝说亲朋不要给女儿压岁钱而改送压岁书, 得分: 94
+```
+
+
+ - 长语音预测
+
+```shell script
+python infer_path.py --wav_path=./dataset/test_vad.wav --is_long_audio=True
+```
+
+
+ - Web部署
+
+![录音测试页面](docs/images/infer_server.jpg)
+
+
+ - GUI界面部署
+
+![GUI界面](docs/images/infer_gui.jpg)
+
+
+## 相关项目
+ - 基于PaddlePaddle实现的声纹识别：[VoiceprintRecognition-PaddlePaddle](https://github.com/yeyupiaoling/VoiceprintRecognition-PaddlePaddle)
+ - 基于PaddlePaddle动态图实现的语音识别：[PPASR](https://github.com/yeyupiaoling/PPASR)
+ - 基于Pytorch实现的语音识别：[MASR](https://github.com/yeyupiaoling/MASR)
--- a/PaddlePaddle_DeepSpeech2/conf/augmentation.json
+++ b/PaddlePaddle_DeepSpeech2/conf/augmentation.json
+[
+  {
+    "type": "noise",
+    "aug_type": "audio",
+    "params": {
+      "min_snr_dB": 10,
+      "max_snr_dB": 50,
+      "noise_manifest_path": "dataset/manifest.noise"
+    },
+    "prob": 0.5
+  },
+  {
+    "type": "speed",
+    "aug_type": "audio",
+    "params": {
+      "min_speed_rate": 0.9,
+      "max_speed_rate": 1.1,
+      "num_rates": 3
+    },
+    "prob": 1.0
+  },
+  {
+    "type": "shift",
+    "aug_type": "audio",
+    "params": {
+      "min_shift_ms": -5,
+      "max_shift_ms": 5
+    },
+    "prob": 1.0
+  },
+  {
+    "type": "volume",
+    "aug_type": "audio",
+    "params": {
+      "min_gain_dBFS": -15,
+      "max_gain_dBFS": 15
+    },
+    "prob": 1.0
+  },
+  {
+    "type": "specaug",
+    "aug_type": "feature",
+    "params": {
+      "W": 0,
+      "warp_mode": "PIL",
+      "F": 10,
+      "n_freq_masks": 2,
+      "T": 50,
+      "n_time_masks": 2,
+      "p": 1.0,
+      "adaptive_number_ratio": 0,
+      "adaptive_size_ratio": 0,
+      "max_n_time_masks": 20,
+      "replace_with_zero": true
+    },
+    "prob": 1.0
+  }
+]
\ No newline at end of file
--- a/PaddlePaddle_DeepSpeech2/create_data.py
+++ b/PaddlePaddle_DeepSpeech2/create_data.py
+import argparse
+import functools
+import json
+import os
+import wave
+from collections import Counter
+from zhconv import convert
+
+import numpy as np
+import soundfile
+from tqdm import tqdm
+
+from data_utils.normalizer import FeatureNormalizer
+from utils.utility import add_arguments, print_arguments, read_manifest, change_rate
+
+parser = argparse.ArgumentParser(description=__doc__)
+add_arg = functools.partial(add_arguments, argparser=parser)
+add_arg('annotation_path',      str,  'dataset/annotation/',      '标注文件的路径，如果annotation_path包含了test.txt，就全部使用test.txt的数据作为测试数据')
+add_arg('manifest_prefix',      str,  'dataset/',                 '训练数据清单，包括音频路径和标注信息')
+add_arg('is_change_frame_rate', bool, True,                       '是否统一改变音频为16000Hz，这会消耗大量的时间')
+add_arg('max_test_manifest',    int,  10000,                      '最大的测试数据数量')
+add_arg('count_threshold',      int,  2,                          '字符计数的截断阈值，0为不做限制')
+add_arg('vocab_path',           str,  'dataset/zh_vocab.txt',     '生成的数据字典文件')
+add_arg('num_workers',          int,   8,                         '读取数据的线程数量')
+add_arg('manifest_paths',       str,  'dataset/manifest.train',   '数据列表路径')
+add_arg('num_samples',          int,  1000000,                    '用于计算均值和标准值得音频数量，当为-1使用全部数据')
+add_arg('output_path',          str,  './dataset/mean_std.npz',   '保存均值和标准值得numpy文件路径，后缀 (.npz).')
+args = parser.parse_args()
+
+
+# 创建数据列表
+def create_manifest(annotation_path, manifest_path_prefix):
+    data_list = []
+    test_list = []
+    durations_all = []
+    duration_0_10 = 0
+    duration_10_20 = 0
+    duration_20 = 0
+    # 获取全部的标注文件
+    for annotation_text in os.listdir(annotation_path):
+        durations = []
+        print('正在创建%s的数据列表，请等待 ...' % annotation_text)
+        annotation_text_path = os.path.join(annotation_path, annotation_text)
+        # 读取标注文件
+        with open(annotation_text_path, 'r', encoding='utf-8') as f:
+            lines = f.readlines()
+        for line in tqdm(lines):
+            audio_path = line.split('\t')[0]
+            try:
+                # 过滤非法的字符
+                text = is_ustr(line.split('\t')[1].replace('\n', '').replace('\r', ''))
+                # 保证全部都是简体
+                text = convert(text, 'zh-cn')
+                # 重新调整音频格式并保存
+                if args.is_change_frame_rate:
+                    change_rate(audio_path)
+                # 获取音频的长度
+                f_wave = wave.open(audio_path, "rb")
+                duration = f_wave.getnframes() / f_wave.getframerate()
+                if duration <= 10:
+                    duration_0_10 += 1
+                elif 10 < duration <= 20:
+                    duration_10_20 += 1
+                else:
+                    duration_20 += 1
+                durations.append(duration)
+                d = json.dumps(
+                        {
+                            'audio_filepath': audio_path.replace('\\', '/'),
+                            'duration': duration,
+                            'text': text
+                        },
+                        ensure_ascii=False)
+
+                if annotation_text == 'test.txt':
+                    test_list.append(d)
+                else:
+                    data_list.append(d)
+            except Exception as e:
+                print(e)
+                continue
+        durations_all.append(sum(durations))
+        print("%s数据一共[%d]小时!" % (annotation_text, int(sum(durations) / 3600)))
+        print("0-10秒的数量：%d，10-20秒的数量：%d，大于20秒的数量：%d" % (duration_0_10, duration_10_20, duration_20))
+
+    # 将音频的路径，长度和标签写入到数据列表中
+    f_train = open(os.path.join(manifest_path_prefix, 'manifest.train'), 'w', encoding='utf-8')
+    f_test = open(os.path.join(manifest_path_prefix, 'manifest.test'), 'w', encoding='utf-8')
+    for line in test_list:
+        f_test.write(line + '\n')
+    interval = 500
+    if len(data_list) / 500 > args.max_test_manifest:
+        interval = len(data_list) // args.max_test_manifest
+    for i, line in enumerate(data_list):
+        if i % interval == 0 and i != 0:
+            if len(test_list) == 0:
+                f_test.write(line + '\n')
+            else:
+                f_train.write(line + '\n')
+        else:
+            f_train.write(line + '\n')
+    f_train.close()
+    f_test.close()
+    print("创建数量列表完成，全部数据一共[%d]小时!" % int(sum(durations_all) / 3600))
+
+
+# 过滤非文字的字符
+def is_ustr(in_str):
+    out_str = ''
+    for i in range(len(in_str)):
+        if is_uchar(in_str[i]):
+            out_str = out_str + in_str[i]
+        else:
+            out_str = out_str + ' '
+    return ''.join(out_str.split())
+
+
+# 判断是否为文字字符
+def is_uchar(uchar):
+    if u'\u4e00' <= uchar <= u'\u9fa5':
+        return True
+    if u'\u0030' <= uchar <= u'\u0039':
+        return False
+    if (u'\u0041' <= uchar <= u'\u005a') or (u'\u0061' <= uchar <= u'\u007a'):
+        return False
+    if uchar in ('-', ',', '.', '>', '?'):
+        return False
+    return False
+
+
+# 生成噪声的数据列表
+def create_noise(path='dataset/audio/noise', min_duration=30):
+    if not os.path.exists(path):
+        print('噪声音频文件为空，已跳过！')
+        return
+    json_lines = []
+    print('正在创建噪声数据列表，路径：%s，请等待 ...' % path)
+    for file in tqdm(os.listdir(path)):
+        audio_path = os.path.join(path, file)
+        try:
+            # 噪声的标签可以标记为空
+            text = ""
+            # 重新调整音频格式并保存
+            if args.is_change_frame_rate:
+                change_rate(audio_path)
+            f_wave = wave.open(audio_path, "rb")
+            duration = f_wave.getnframes() / f_wave.getframerate()
+            # 拼接音频
+            if duration < min_duration:
+                wav = soundfile.read(audio_path)[0]
+                data = wav
+                for i in range(int(min_duration / duration) + 1):
+                    data = np.hstack([data, wav])
+                soundfile.write(audio_path, data, samplerate=16000)
+                f_wave = wave.open(audio_path, "rb")
+                duration = f_wave.getnframes() / f_wave.getframerate()
+            json_lines.append(
+                json.dumps(
+                    {
+                        'audio_filepath': audio_path.replace('\\', '/'),
+                        'duration': duration,
+                        'text': text
+                    },
+                    ensure_ascii=False))
+        except Exception as e:
+            continue
+    with open(os.path.join(args.manifest_prefix, 'manifest.noise'), 'w', encoding='utf-8') as f_noise:
+        for json_line in json_lines:
+            f_noise.write(json_line + '\n')
+
+
+# 获取全部字符
+def count_manifest(counter, manifest_path):
+    manifest_jsons = read_manifest(manifest_path)
+    for line_json in manifest_jsons:
+        for char in line_json['text']:
+            counter.update(char)
+
+
+# 计算数据集的均值和标准值
+def compute_mean_std(manifest_path, num_samples, output_path):
+    # 随机取指定的数量计算平均值归一化
+    normalizer = FeatureNormalizer(mean_std_filepath=None,
+                                   manifest_path=manifest_path,
+                                   num_samples=num_samples,
+                                   num_workers=args.num_workers)
+    # 将计算的结果保存的文件中
+    normalizer.write_to_file(output_path)
+    print('计算的均值和标准值已保存在 %s！' % output_path)
+
+
+def main():
+    print_arguments(args)
+    print('开始生成数据列表...')
+    create_manifest(annotation_path=args.annotation_path,
+                    manifest_path_prefix=args.manifest_prefix)
+    print('='*70)
+    print('开始生成噪声数据列表...')
+    create_noise(path='dataset/audio/noise')
+    print('='*70)
+
+    print('开始生成数据字典...')
+    counter = Counter()
+    # 获取全部数据列表中的标签字符
+    count_manifest(counter, args.manifest_paths)
+    # 为每一个字符都生成一个ID
+    count_sorted = sorted(counter.items(), key=lambda x: x[1], reverse=True)
+    with open(args.vocab_path, 'w', encoding='utf-8') as fout:
+        fout.write('<blank>\t-1\n')
+        for char, count in count_sorted:
+            # 跳过指定的字符阈值，超过这大小的字符都忽略
+            if count < args.count_threshold: break
+            fout.write('%s\t%d\n' % (char, count))
+    print('数据词汇表已生成完成，保存与：%s' % args.vocab_path)
+    print('='*70)
+
+    print('开始抽取%s条数据计算均值和标准值...' % args.num_samples)
+    compute_mean_std(args.manifest_paths, args.num_samples, args.output_path)
+    print('='*70)
+
+
+if __name__ == '__main__':
+    main()
--- a/PaddlePaddle_DeepSpeech2/data_utils/__init__.py
+++ b/PaddlePaddle_DeepSpeech2/data_utils/__init__.py
--- a/PaddlePaddle_DeepSpeech2/data_utils/audio.py
+++ b/PaddlePaddle_DeepSpeech2/data_utils/audio.py
--- a/PaddlePaddle_DeepSpeech2/data_utils/audio_process.py
+++ b/PaddlePaddle_DeepSpeech2/data_utils/audio_process.py
+from data_utils.featurizer.speech_featurizer import SpeechFeaturizer
+from data_utils.normalizer import FeatureNormalizer
+from data_utils.speech import SpeechSegment
+
+
+class AudioInferProcess(object):
+    """
+    识别程序所使用的是对音频预处理的工具
+
+    :param vocab_filepath: 词汇表文件路径
+    :type vocab_filepath: str
+    :param mean_std_filepath: 平均值和标准差的文件路径
+    :type mean_std_filepath: str
+    :param stride_ms: 生成帧的跨步大小(以毫秒为单位)
+    :type stride_ms: float
+    :param window_ms: 用于生成帧的窗口大小(毫秒)
+    :type window_ms: float
+    :param use_dB_normalization: 提取特征前是否将音频归一化至-20 dB
+    :type use_dB_normalization: bool
+    """
+
+    def __init__(self,
+                 vocab_filepath,
+                 mean_std_filepath,
+                 stride_ms=10.0,
+                 window_ms=20.0,
+                 use_dB_normalization=True):
+        self._normalizer = FeatureNormalizer(mean_std_filepath)
+        self._speech_featurizer = SpeechFeaturizer(vocab_filepath=vocab_filepath,
+                                                   stride_ms=stride_ms,
+                                                   window_ms=window_ms,
+                                                   use_dB_normalization=use_dB_normalization)
+
+    def process_utterance(self, audio_file):
+        """对语音数据加载、预处理
+
+        :param audio_file: 音频文件的文件路径或文件对象
+        :type audio_file: str | file
+        :return: 预处理的音频数据
+        :rtype: 2darray
+        """
+        speech_segment = SpeechSegment.from_file(audio_file, "")
+        specgram, _ = self._speech_featurizer.featurize(speech_segment, False)
+        specgram = self._normalizer.apply(specgram)
+        return specgram
+
+    @property
+    def vocab_size(self):
+        """返回词汇表大小
+
+        :return: 词汇表大小
+        :rtype: int
+        """
+        return self._speech_featurizer.vocab_size
+
+    @property
+    def vocab_list(self):
+        """返回词汇表列表
+
+        :return: 词汇表列表
+        :rtype: list
+        """
+        return self._speech_featurizer.vocab_list
--- a/PaddlePaddle_DeepSpeech2/data_utils/augmentor/__init__.py
+++ b/PaddlePaddle_DeepSpeech2/data_utils/augmentor/__init__.py
--- a/PaddlePaddle_DeepSpeech2/data_utils/augmentor/augmentation.py
+++ b/PaddlePaddle_DeepSpeech2/data_utils/augmentor/augmentation.py
+"""Contains the data augmentation pipeline."""
+
+import json
+import os
+import random
+import sys
+from datetime import datetime
+
+from data_utils.augmentor.volume_perturb import VolumePerturbAugmentor
+from data_utils.augmentor.shift_perturb import ShiftPerturbAugmentor
+from data_utils.augmentor.speed_perturb import SpeedPerturbAugmentor
+from data_utils.augmentor.noise_perturb import NoisePerturbAugmentor
+from data_utils.augmentor.spec_augment import SpecAugmentor
+from data_utils.augmentor.resample import ResampleAugmentor
+
+
+class AugmentationPipeline(object):
+    """Build a pre-processing pipeline with various augmentation models.Such a
+    data augmentation pipeline is oftern leveraged to augment the training
+    samples to make the model invariant to certain types of perturbations in the
+    real world, improving model's generalization ability.
+
+    The pipeline is built according the the augmentation configuration in json
+    string, e.g.
+    
+    .. code-block::
+    [
+      {
+        "type": "noise",
+        "params": {
+          "min_snr_dB": 10,
+          "max_snr_dB": 50,
+          "noise_manifest_path": "dataset/manifest.noise"
+        },
+        "prob": 0.5
+      },
+      {
+        "type": "speed",
+        "params": {
+          "min_speed_rate": 0.9,
+          "max_speed_rate": 1.1,
+          "num_rates": 3
+        },
+        "prob": 1.0
+      },
+      {
+        "type": "shift",
+        "params": {
+          "min_shift_ms": -5,
+          "max_shift_ms": 5
+        },
+        "prob": 1.0
+      },
+      {
+        "type": "volume",
+        "params": {
+          "min_gain_dBFS": -15,
+          "max_gain_dBFS": 15
+        },
+        "prob": 1.0
+      },
+      {
+        "type": "specaug",
+        "params": {
+          "W": 0,
+          "warp_mode": "PIL",
+          "F": 10,
+          "n_freq_masks": 2,
+          "T": 50,
+          "n_time_masks": 2,
+          "p": 1.0,
+          "adaptive_number_ratio": 0,
+          "adaptive_size_ratio": 0,
+          "max_n_time_masks": 20,
+          "replace_with_zero": true
+        },
+        "prob": 1.0
+      }
+    ]
+    This augmentation configuration inserts two augmentation models
+    into the pipeline, with one is VolumePerturbAugmentor and the other
+    SpeedPerturbAugmentor. "prob" indicates the probability of the current
+    augmentor to take effect. If "prob" is zero, the augmentor does not take
+    effect.
+
+    :param augmentation_config: Augmentation configuration in json string.
+    :type augmentation_config: str
+    :param random_seed: Random seed.
+    :type random_seed: int
+    :raises ValueError: If the augmentation json config is in incorrect format".
+    """
+
+    def __init__(self, augmentation_config, random_seed=0):
+        self._rng = random.Random(random_seed)
+        self._augmentors, self._rates = self._parse_pipeline_from(augmentation_config, aug_type='audio')
+        self._spec_augmentors, self._spec_rates = self._parse_pipeline_from(augmentation_config, aug_type='feature')
+
+    def transform_audio(self, audio_segment):
+        """Run the pre-processing pipeline for data augmentation.
+
+        Note that this is an in-place transformation.
+        
+        :param audio_segment: Audio segment to process.
+        :type audio_segment: AudioSegmenet|SpeechSegment
+        """
+        for augmentor, rate in zip(self._augmentors, self._rates):
+            if self._rng.uniform(0., 1.) < rate:
+                augmentor.transform_audio(audio_segment)
+
+    def transform_feature(self, spec_segment):
+        """spectrogram augmentation.
+
+        Args:
+            spec_segment (np.ndarray): audio feature, (D, T).
+        """
+        for augmentor, rate in zip(self._spec_augmentors, self._spec_rates):
+            if self._rng.uniform(0., 1.) < rate:
+                spec_segment = augmentor.transform_feature(spec_segment)
+        return spec_segment
+
+    def _parse_pipeline_from(self, config_json, aug_type):
+        """Parse the config json to build a augmentation pipelien."""
+        try:
+            configs = []
+            configs_temp = json.loads(config_json)
+            for config in configs_temp:
+                if config['aug_type'] != aug_type: continue
+                if config['type'] == 'noise' and not os.path.exists(config['params']['noise_manifest_path']):
+                    print('%s不存在，已经忽略噪声增强操作！' % config['params']['noise_manifest_path'], file=sys.stderr)
+                    continue
+                print('[%s] 数据增强配置：%s' % (datetime.now(), config))
+                configs.append(config)
+            augmentors = [self._get_augmentor(config["type"], config["params"]) for config in configs]
+            rates = [config["prob"] for config in configs]
+        except Exception as e:
+            raise ValueError("Failed to parse the augmentation config json: %s" % str(e))
+        return augmentors, rates
+
+    def _get_augmentor(self, augmentor_type, params):
+        """Return an augmentation model by the type name, and pass in params."""
+        if augmentor_type == "volume":
+            return VolumePerturbAugmentor(self._rng, **params)
+        elif augmentor_type == "shift":
+            return ShiftPerturbAugmentor(self._rng, **params)
+        elif augmentor_type == "speed":
+            return SpeedPerturbAugmentor(self._rng, **params)
+        elif augmentor_type == "resample":
+            return ResampleAugmentor(self._rng, **params)
+        elif augmentor_type == "noise":
+            return NoisePerturbAugmentor(self._rng, **params)
+        elif augmentor_type == "specaug":
+            return SpecAugmentor(self._rng, **params)
+        else:
+            raise ValueError("Unknown augmentor type [%s]." % augmentor_type)
--- a/PaddlePaddle_DeepSpeech2/data_utils/augmentor/base.py
+++ b/PaddlePaddle_DeepSpeech2/data_utils/augmentor/base.py
+"""Contains the abstract base class for augmentation models."""
+
+from abc import ABCMeta, abstractmethod
+
+
+class AugmentorBase(object):
+    """Abstract base class for augmentation model (augmentor) class.
+    All augmentor classes should inherit from this class, and implement the
+    following abstract methods.
+    """
+
+    __metaclass__ = ABCMeta
+
+    @abstractmethod
+    def __init__(self):
+        pass
+
+    @abstractmethod
+    def transform_audio(self, audio_segment):
+        """Adds various effects to the input audio segment. Such effects
+        will augment the training data to make the model invariant to certain
+        types of perturbations in the real world, improving model's
+        generalization ability.
+        
+        Note that this is an in-place transformation.
+
+        :param audio_segment: Audio segment to add effects to.
+        :type audio_segment: AudioSegmenet|SpeechSegment
+        """
+        pass
--- a/PaddlePaddle_DeepSpeech2/data_utils/augmentor/impulse_response.pyc
+++ b/PaddlePaddle_DeepSpeech2/data_utils/augmentor/impulse_response.pyc
--- a/PaddlePaddle_DeepSpeech2/data_utils/augmentor/noise_perturb.py
+++ b/PaddlePaddle_DeepSpeech2/data_utils/augmentor/noise_perturb.py
+"""Contains the noise perturb augmentation model."""
+
+from data_utils.augmentor.base import AugmentorBase
+from data_utils.utility import read_manifest
+from data_utils.audio import AudioSegment
+
+
+class NoisePerturbAugmentor(AugmentorBase):
+    """用于添加背景噪声的增强模型
+
+    :param rng: Random generator object.
+    :type rng: random.Random
+    :param min_snr_dB: Minimal signal noise ratio, in decibels.
+    :type min_snr_dB: float
+    :param max_snr_dB: Maximal signal noise ratio, in decibels.
+    :type max_snr_dB: float
+    :param noise_manifest_path: Manifest path for noise audio data.
+    :type noise_manifest_path: str
+    """
+
+    def __init__(self, rng, min_snr_dB, max_snr_dB, noise_manifest_path):
+        self._min_snr_dB = min_snr_dB
+        self._max_snr_dB = max_snr_dB
+        self._rng = rng
+        self._noise_manifest = read_manifest(manifest_path=noise_manifest_path)
+
+    def transform_audio(self, audio_segment):
+        """Add background noise audio.
+
+        Note that this is an in-place transformation.
+
+        :param audio_segment: Audio segment to add effects to.
+        :type audio_segment: AudioSegmenet|SpeechSegment
+        """
+        noise_json = self._rng.sample(self._noise_manifest, 1)[0]
+        if noise_json['duration'] >= audio_segment.duration:
+            diff_duration = noise_json['duration'] - audio_segment.duration
+            start = self._rng.uniform(0, diff_duration)
+            end = start + audio_segment.duration
+            noise_segment = AudioSegment.slice_from_file(noise_json['audio_filepath'], start=start, end=end)
+            snr_dB = self._rng.uniform(self._min_snr_dB, self._max_snr_dB)
+            audio_segment.add_noise(noise_segment, snr_dB, allow_downsampling=True, rng=self._rng)
--- a/PaddlePaddle_DeepSpeech2/data_utils/augmentor/online_bayesian_normalization.pyc
+++ b/PaddlePaddle_DeepSpeech2/data_utils/augmentor/online_bayesian_normalization.pyc
--- a/PaddlePaddle_DeepSpeech2/data_utils/augmentor/resample.py
+++ b/PaddlePaddle_DeepSpeech2/data_utils/augmentor/resample.py
+"""Contain the resample augmentation model."""
+
+from data_utils.augmentor.base import AugmentorBase
+
+
+class ResampleAugmentor(AugmentorBase):
+    """重采样的增强模型
+
+    See more info here:
+    https://ccrma.stanford.edu/~jos/resample/index.html
+    
+    :param rng: Random generator object.
+    :type rng: random.Random
+    :param new_sample_rate: New sample rate in Hz.
+    :type new_sample_rate: int
+    """
+
+    def __init__(self, rng, new_sample_rate):
+        self._new_sample_rate = new_sample_rate
+        self._rng = rng
+
+    def transform_audio(self, audio_segment):
+        """Resamples the input audio to a target sample rate.
+
+        Note that this is an in-place transformation.
+
+        :param audio: Audio segment to add effects to.
+        :type audio: AudioSegment|SpeechSegment
+        """
+        audio_segment.resample(self._new_sample_rate)
--- a/PaddlePaddle_DeepSpeech2/data_utils/augmentor/shift_perturb.py
+++ b/PaddlePaddle_DeepSpeech2/data_utils/augmentor/shift_perturb.py
+"""Contains the volume perturb augmentation model."""
+
+from data_utils.augmentor.base import AugmentorBase
+
+
+class ShiftPerturbAugmentor(AugmentorBase):
+    """添加随机位移扰动的增强模型
+    
+    :param rng: Random generator object.
+    :type rng: random.Random
+    :param min_shift_ms: Minimal shift in milliseconds.
+    :type min_shift_ms: float
+    :param max_shift_ms: Maximal shift in milliseconds.
+    :type max_shift_ms: float
+    """
+
+    def __init__(self, rng, min_shift_ms, max_shift_ms):
+        self._min_shift_ms = min_shift_ms
+        self._max_shift_ms = max_shift_ms
+        self._rng = rng
+
+    def transform_audio(self, audio_segment):
+        """Shift audio.
+
+        Note that this is an in-place transformation.
+
+        :param audio_segment: Audio segment to add effects to.
+        :type audio_segment: AudioSegmenet|SpeechSegment
+        """
+        shift_ms = self._rng.uniform(self._min_shift_ms, self._max_shift_ms)
+        audio_segment.shift(shift_ms)
--- a/PaddlePaddle_DeepSpeech2/data_utils/augmentor/spec_augment.py
+++ b/PaddlePaddle_DeepSpeech2/data_utils/augmentor/spec_augment.py
+import random
+
+import numpy as np
+from PIL import Image
+from PIL.Image import BICUBIC
+
+from data_utils.augmentor.base import AugmentorBase
+
+
+class SpecAugmentor(AugmentorBase):
+    """Augmentation model for Time warping, Frequency masking, Time masking.
+
+    SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
+        https://arxiv.org/abs/1904.08779
+    SpecAugment on Large Scale Datasets
+        https://arxiv.org/abs/1912.05533
+    """
+
+    def __init__(self,
+                 rng,
+                 F,
+                 T,
+                 n_freq_masks,
+                 n_time_masks,
+                 p=1.0,
+                 W=40,
+                 adaptive_number_ratio=0,
+                 adaptive_size_ratio=0,
+                 max_n_time_masks=20,
+                 replace_with_zero=True,
+                 warp_mode='PIL'):
+        """SpecAugment class.
+        Args:
+            rng (random.Random): random generator object.
+            F (int): parameter for frequency masking
+            T (int): parameter for time masking
+            n_freq_masks (int): number of frequency masks
+            n_time_masks (int): number of time masks
+            p (float): parameter for upperbound of the time mask
+            W (int): parameter for time warping
+            adaptive_number_ratio (float): adaptive multiplicity ratio for time masking
+            adaptive_size_ratio (float): adaptive size ratio for time masking
+            max_n_time_masks (int): maximum number of time masking
+            replace_with_zero (bool): pad zero on mask if true else use mean
+            warp_mode (str):  "PIL" (default, fast, not differentiable)
+                 or "sparse_image_warp" (slow, differentiable)
+        """
+        super().__init__()
+        self._rng = rng
+        self.inplace = True
+        self.replace_with_zero = replace_with_zero
+
+        self.mode = warp_mode
+        self.W = W
+        self.F = F
+        self.T = T
+        self.n_freq_masks = n_freq_masks
+        self.n_time_masks = n_time_masks
+        self.p = p
+
+        # adaptive SpecAugment
+        self.adaptive_number_ratio = adaptive_number_ratio
+        self.adaptive_size_ratio = adaptive_size_ratio
+        self.max_n_time_masks = max_n_time_masks
+
+        if adaptive_number_ratio > 0:
+            self.n_time_masks = 0
+        if adaptive_size_ratio > 0:
+            self.T = 0
+
+        self._freq_mask = None
+        self._time_mask = None
+
+    @property
+    def freq_mask(self):
+        return self._freq_mask
+
+    @property
+    def time_mask(self):
+        return self._time_mask
+
+    def __repr__(self):
+        return f"specaug: F-{self.F}, T-{self.T}, F-n-{self.n_freq_masks}, T-n-{self.n_time_masks}"
+
+    def time_warp(self, x, mode='PIL'):
+        """time warp for spec augment
+        move random center frame by the random width ~ uniform(-window, window)
+
+        Args:
+            x (np.ndarray): spectrogram (time, freq)
+            mode (str): PIL or sparse_image_warp
+
+        Raises:
+            NotImplementedError: [description]
+            NotImplementedError: [description]
+
+        Returns:
+            np.ndarray: time warped spectrogram (time, freq)
+        """
+        window = max_time_warp = self.W
+        if window == 0:
+            return x
+
+        if mode == "PIL":
+            t = x.shape[0]
+            if t - window <= window:
+                return x
+            # NOTE: randrange(a, b) emits a, a + 1, ..., b - 1
+            center = random.randrange(window, t - window)
+            warped = random.randrange(center - window, center +
+                                      window) + 1  # 1 ... t - 1
+
+            left = Image.fromarray(x[:center]).resize((x.shape[1], warped),
+                                                      BICUBIC)
+            right = Image.fromarray(x[center:]).resize((x.shape[1], t - warped),
+                                                       BICUBIC)
+            if self.inplace:
+                x[:warped] = left
+                x[warped:] = right
+                return x
+            return np.concatenate((left, right), 0)
+        elif mode == "sparse_image_warp":
+            raise NotImplementedError('sparse_image_warp')
+        else:
+            raise NotImplementedError(
+                "unknown resize mode: " + mode +
+                ", choose one from (PIL, sparse_image_warp).")
+
+    def mask_freq(self, x, replace_with_zero=False):
+        """freq mask
+
+        Args:
+            x (np.ndarray): spectrogram (time, freq)
+            replace_with_zero (bool, optional): Defaults to False.
+
+        Returns:
+            np.ndarray: freq mask spectrogram (time, freq)
+        """
+        n_bins = x.shape[1]
+        for i in range(0, self.n_freq_masks):
+            f = int(self._rng.uniform(a=0, b=self.F))
+            f_0 = int(self._rng.uniform(a=0, b=n_bins - f))
+            assert f_0 <= f_0 + f
+            if replace_with_zero:
+                x[:, f_0:f_0 + f] = 0
+            else:
+                x[:, f_0:f_0 + f] = x.mean()
+            self._freq_mask = (f_0, f_0 + f)
+        return x
+
+    def mask_time(self, x, replace_with_zero=False):
+        """time mask
+
+        Args:
+            x (np.ndarray): spectrogram (time, freq)
+            replace_with_zero (bool, optional): Defaults to False.
+
+        Returns:
+            np.ndarray: time mask spectrogram (time, freq)
+        """
+        n_frames = x.shape[0]
+
+        if self.adaptive_number_ratio > 0:
+            n_masks = int(n_frames * self.adaptive_number_ratio)
+            n_masks = min(n_masks, self.max_n_time_masks)
+        else:
+            n_masks = self.n_time_masks
+
+        if self.adaptive_size_ratio > 0:
+            T = self.adaptive_size_ratio * n_frames
+        else:
+            T = self.T
+
+        for i in range(n_masks):
+            t = int(self._rng.uniform(a=0, b=T))
+            t = min(t, int(n_frames * self.p))
+            t_0 = int(self._rng.uniform(a=0, b=n_frames - t))
+            assert t_0 <= t_0 + t
+            if replace_with_zero:
+                x[t_0:t_0 + t, :] = 0
+            else:
+                x[t_0:t_0 + t, :] = x.mean()
+            self._time_mask = (t_0, t_0 + t)
+        return x
+
+    def __call__(self, x, train=True):
+        if not train:
+            return x
+        return self.transform_feature(x)
+
+    def transform_feature(self, x: np.ndarray):
+        """
+        Args:
+            x (np.ndarray): `[T, F]`
+        Returns:
+            x (np.ndarray): `[T, F]`
+        """
+        assert isinstance(x, np.ndarray)
+        assert x.ndim == 2
+        x = self.time_warp(x, self.mode)
+        x = self.mask_freq(x, self.replace_with_zero)
+        x = self.mask_time(x, self.replace_with_zero)
+        return x
--- a/PaddlePaddle_DeepSpeech2/data_utils/augmentor/speed_perturb.py
+++ b/PaddlePaddle_DeepSpeech2/data_utils/augmentor/speed_perturb.py
+"""Contain the speech perturbation augmentation model."""
+import numpy as np
+
+from data_utils.augmentor.base import AugmentorBase
+
+
+class SpeedPerturbAugmentor(AugmentorBase):
+    """添加速度扰动的增强模型
+
+    See reference paper here:
+    http://www.danielpovey.com/files/2015_interspeech_augmentation.pdf
+
+    :param rng: Random generator object.
+    :type rng: random.Random
+    :param min_speed_rate: Lower bound of new speed rate to sample and should
+                           not be smaller than 0.9.
+    :type min_speed_rate: float
+    :param max_speed_rate: Upper bound of new speed rate to sample and should
+                           not be larger than 1.1.
+    :type max_speed_rate: float
+    """
+
+    def __init__(self, rng, min_speed_rate=0.9, max_speed_rate=1.1, num_rates=3):
+        if min_speed_rate < 0.9:
+            raise ValueError("Sampling speed below 0.9 can cause unnatural effects")
+        if max_speed_rate > 1.1:
+            raise ValueError("Sampling speed above 1.1 can cause unnatural effects")
+        self._min_speed_rate = min_speed_rate
+        self._max_speed_rate = max_speed_rate
+        self._rng = rng
+        self._num_rates = num_rates
+        if num_rates > 0:
+            self._rates = np.linspace(self._min_speed_rate, self._max_speed_rate, self._num_rates, endpoint=True)
+
+    def transform_audio(self, audio_segment):
+        """Sample a new speed rate from the given range and
+        changes the speed of the given audio clip.
+
+        Note that this is an in-place transformation.
+
+        :param audio_segment: Audio segment to add effects to.
+        :type audio_segment: AudioSegment|SpeechSegment
+        """
+        if self._num_rates < 0:
+            speed_rate = self._rng.uniform(self._min_speed_rate, self._max_speed_rate)
+        else:
+            speed_rate = self._rng.choice(self._rates)
+
+        if speed_rate == 1.0: return
+        audio_segment.change_speed(speed_rate)
--- a/PaddlePaddle_DeepSpeech2/data_utils/augmentor/volume_perturb.py
+++ b/PaddlePaddle_DeepSpeech2/data_utils/augmentor/volume_perturb.py
+"""Contains the volume perturb augmentation model."""
+
+from data_utils.augmentor.base import AugmentorBase
+
+
+class VolumePerturbAugmentor(AugmentorBase):
+    """添加随机体积扰动的增强模型
+    
+    This is used for multi-loudness training of PCEN. See
+
+    https://arxiv.org/pdf/1607.05666v1.pdf
+
+    for more details.
+
+    :param rng: Random generator object.
+    :type rng: random.Random
+    :param min_gain_dBFS: Minimal gain in dBFS.
+    :type min_gain_dBFS: float
+    :param max_gain_dBFS: Maximal gain in dBFS.
+    :type max_gain_dBFS: float
+    """
+
+    def __init__(self, rng, min_gain_dBFS, max_gain_dBFS):
+        self._min_gain_dBFS = min_gain_dBFS
+        self._max_gain_dBFS = max_gain_dBFS
+        self._rng = rng
+
+    def transform_audio(self, audio_segment):
+        """Change audio loadness.
+
+        Note that this is an in-place transformation.
+
+        :param audio_segment: Audio segment to add effects to.
+        :type audio_segment: AudioSegmenet|SpeechSegment
+        """
+        gain = self._rng.uniform(self._min_gain_dBFS, self._max_gain_dBFS)
+        audio_segment.gain_db(gain)
--- a/PaddlePaddle_DeepSpeech2/data_utils/data.py
+++ b/PaddlePaddle_DeepSpeech2/data_utils/data.py
--- a/PaddlePaddle_DeepSpeech2/data_utils/featurizer/__init__.py
+++ b/PaddlePaddle_DeepSpeech2/data_utils/featurizer/__init__.py
--- a/PaddlePaddle_DeepSpeech2/data_utils/featurizer/audio_featurizer.py
+++ b/PaddlePaddle_DeepSpeech2/data_utils/featurizer/audio_featurizer.py
+"""Contains the audio featurizer class."""
+
+import numpy as np
+
+from data_utils.audio import AudioSegment
+
+
+class AudioFeaturizer(object):
+    """音频特征器,用于从AudioSegment或SpeechSegment内容中提取特性。
+
+    Currently, it supports feature types of linear spectrogram and mfcc.
+
+    :param stride_ms: Striding size (in milliseconds) for generating frames.
+    :type stride_ms: float
+    :param window_ms: Window size (in milliseconds) for generating frames.
+    :type window_ms: float
+    :param target_sample_rate: Audio are resampled (if upsampling or
+                               downsampling is allowed) to this before
+                               extracting spectrogram features.
+    :type target_sample_rate: int
+    :param use_dB_normalization: Whether to normalize the audio to a certain
+                                 decibels before extracting the features.
+    :type use_dB_normalization: bool
+    :param target_dB: Target audio decibels for normalization.
+    :type target_dB: float
+    """
+
+    def __init__(self,
+                 stride_ms=10.0,
+                 window_ms=20.0,
+                 target_sample_rate=16000,
+                 use_dB_normalization=True,
+                 target_dB=-20):
+        self._stride_ms = stride_ms
+        self._window_ms = window_ms
+        self._target_sample_rate = target_sample_rate
+        self._use_dB_normalization = use_dB_normalization
+        self._target_dB = target_dB
+
+    def featurize(self, audio_segment, allow_downsampling=True, allow_upsampling=True):
+        """从AudioSegment或SpeechSegment中提取音频特征
+
+        :param audio_segment: Audio/speech segment to extract features from.
+        :type audio_segment: AudioSegment|SpeechSegment
+        :param allow_downsampling: Whether to allow audio downsampling before featurizing.
+        :type allow_downsampling: bool
+        :param allow_upsampling: Whether to allow audio upsampling before featurizing.
+        :type allow_upsampling: bool
+        :return: Spectrogram audio feature in 2darray.
+        :rtype: ndarray
+        :raises ValueError: If audio sample rate is not supported.
+        """
+        # upsampling or downsampling
+        if ((audio_segment.sample_rate > self._target_sample_rate and
+             allow_downsampling) or
+                (audio_segment.sample_rate < self._target_sample_rate and
+                 allow_upsampling)):
+            audio_segment.resample(self._target_sample_rate)
+        if audio_segment.sample_rate != self._target_sample_rate:
+            raise ValueError("Audio sample rate is not supported. "
+                             "Turn allow_downsampling or allow up_sampling on.")
+        # decibel normalization
+        if self._use_dB_normalization:
+            audio_segment.normalize(target_db=self._target_dB)
+        # extract spectrogram
+        return self._compute_linear_specgram(audio_segment.samples, audio_segment.sample_rate,
+                                             stride_ms=self._stride_ms, window_ms=self._window_ms)
+
+    # 用快速傅里叶变换计算线性谱图
+    @staticmethod
+    def _compute_linear_specgram(samples,
+                                 sample_rate,
+                                 stride_ms=10.0,
+                                 window_ms=20.0,
+                                 eps=1e-14):
+        stride_size = int(0.001 * sample_rate * stride_ms)
+        window_size = int(0.001 * sample_rate * window_ms)
+        truncate_size = (len(samples) - window_size) % stride_size
+        samples = samples[:len(samples) - truncate_size]
+        nshape = (window_size, (len(samples) - window_size) // stride_size + 1)
+        nstrides = (samples.strides[0], samples.strides[0] * stride_size)
+        windows = np.lib.stride_tricks.as_strided(samples, shape=nshape, strides=nstrides)
+        assert np.all(windows[:, 1] == samples[stride_size:(stride_size + window_size)])
+        # 快速傅里叶变换
+        weighting = np.hanning(window_size)[:, None]
+        fft = np.fft.rfft(windows * weighting, n=None, axis=0)
+        fft = np.absolute(fft)
+        fft = fft ** 2
+        scale = np.sum(weighting ** 2) * sample_rate
+        fft[1:-1, :] *= (2.0 / scale)
+        fft[(0, -1), :] /= scale
+        freqs = float(sample_rate) / window_size * np.arange(fft.shape[0])
+        ind = np.where(freqs <= (sample_rate / 2))[0][-1] + 1
+        return np.log(fft[:ind, :] + eps)
--- a/PaddlePaddle_DeepSpeech2/data_utils/featurizer/speech_featurizer.py
+++ b/PaddlePaddle_DeepSpeech2/data_utils/featurizer/speech_featurizer.py
+"""Contains the speech featurizer class."""
+
+from data_utils.featurizer.audio_featurizer import AudioFeaturizer
+from data_utils.featurizer.text_featurizer import TextFeaturizer
+
+
+class SpeechFeaturizer(object):
+    """Speech featurizer, for extracting features from both audio and transcript
+    contents of SpeechSegment.
+
+    Currently, for audio parts, it supports feature types of linear
+    spectrogram and mfcc; for transcript parts, it only supports char-level
+    tokenizing and conversion into a list of token indices. Note that the
+    token indexing order follows the given vocabulary file.
+
+    :param vocab_filepath: Filepath to load vocabulary for token indices
+                           conversion.
+    :type vocab_filepath: str
+    :param stride_ms: Striding size (in milliseconds) for generating frames.
+    :type stride_ms: float
+    :param window_ms: Window size (in milliseconds) for generating frames.
+    :type window_ms: float
+    :param target_sample_rate: Speech are resampled (if upsampling or
+                               downsampling is allowed) to this before
+                               extracting spectrogram features.
+    :type target_sample_rate: int
+    :param use_dB_normalization: Whether to normalize the audio to a certain
+                                 decibels before extracting the features.
+    :type use_dB_normalization: bool
+    :param target_dB: Target audio decibels for normalization.
+    :type target_dB: float
+    """
+
+    def __init__(self,
+                 vocab_filepath,
+                 stride_ms=10.0,
+                 window_ms=20.0,
+                 target_sample_rate=16000,
+                 use_dB_normalization=True,
+                 target_dB=-20):
+        self._audio_featurizer = AudioFeaturizer(stride_ms=stride_ms,
+                                                 window_ms=window_ms,
+                                                 target_sample_rate=target_sample_rate,
+                                                 use_dB_normalization=use_dB_normalization,
+                                                 target_dB=target_dB)
+        self._text_featurizer = TextFeaturizer(vocab_filepath)
+
+    def featurize(self, speech_segment, keep_transcription_text):
+        """提取语音片段的特征
+
+        1. For audio parts, extract the audio features.
+        2. For transcript parts, keep the original text or convert text string
+           to a list of token indices in char-level.
+
+        :param audio_segment: Speech segment to extract features from.
+        :type audio_segment: SpeechSegment
+        :return: A tuple of 1) spectrogram audio feature in 2darray, 2) list of
+                 char-level token indices.
+        :rtype: tuple
+        """
+        audio_feature = self._audio_featurizer.featurize(speech_segment)
+        if keep_transcription_text:
+            return audio_feature, speech_segment.transcript
+        text_ids = self._text_featurizer.featurize(speech_segment.transcript)
+        return audio_feature, text_ids
+
+    @property
+    def vocab_size(self):
+        """返回词汇表大小
+
+        :return: Vocabulary size.
+        :rtype: int
+        """
+        return self._text_featurizer.vocab_size
+
+    @property
+    def vocab_list(self):
+        """返回词汇表的list
+
+        :return: Vocabulary in list.
+        :rtype: list
+        """
+        return self._text_featurizer.vocab_list
--- a/PaddlePaddle_DeepSpeech2/data_utils/featurizer/text_featurizer.py
+++ b/PaddlePaddle_DeepSpeech2/data_utils/featurizer/text_featurizer.py
+class TextFeaturizer(object):
+    """文本特征器，用于处理或从文本中提取特征。支持字符级的令牌化和转换为令牌索引列表
+
+    :param vocab_filepath: 令牌索引转换词汇表的文件路径
+    :type vocab_filepath: str
+    """
+
+    def __init__(self, vocab_filepath):
+        self._vocab_dict, self._vocab_list = self._load_vocabulary_from_file(
+            vocab_filepath)
+
+    def featurize(self, text):
+        """将文本字符串转换为字符级的令牌索引列表
+
+        :param text: 文本
+        :type text: str
+        :return:字符级令牌索引列表
+        :rtype: list
+        """
+        tokens = self._char_tokenize(text)
+        token_indices = []
+        for token in tokens:
+            # 跳过词汇表不存在的字符
+            if token not in self._vocab_list:continue
+            token_indices.append(self._vocab_dict[token])
+        return token_indices
+
+    @property
+    def vocab_size(self):
+        """返回词汇表大小
+
+        :return: Vocabulary size.
+        :rtype: int
+        """
+        return len(self._vocab_list)
+
+    @property
+    def vocab_list(self):
+        """返回词汇表的列表
+
+        :return: Vocabulary in list.
+        :rtype: list
+        """
+        return self._vocab_list
+
+    def _char_tokenize(self, text):
+        """Character tokenizer."""
+        return list(text.strip())
+
+    def _load_vocabulary_from_file(self, vocab_filepath):
+        """Load vocabulary from file."""
+        vocab_lines = []
+        with open(vocab_filepath, 'r', encoding='utf-8') as file:
+            vocab_lines.extend(file.readlines())
+        vocab_list = [line.split('\t')[0].replace('\n', '') for line in vocab_lines]
+        vocab_dict = dict(
+            [(token, id) for (id, token) in enumerate(vocab_list)])
+        return vocab_dict, vocab_list
--- a/PaddlePaddle_DeepSpeech2/data_utils/normalizer.py
+++ b/PaddlePaddle_DeepSpeech2/data_utils/normalizer.py
+"""特征归一化"""
+import math
+
+import numpy as np
+import random
+from tqdm import tqdm
+from paddle.io import Dataset, DataLoader
+from data_utils.utility import read_manifest
+from data_utils.audio import AudioSegment
+from data_utils.featurizer.audio_featurizer import AudioFeaturizer
+
+
+class FeatureNormalizer(object):
+    """音频特征归一化类
+
+    如果mean_std_filepath不是None，则normalizer将直接从文件初始化。否则，使用manifest_path应该给特征mean和stddev计算
+
+    :param mean_std_filepath: 均值和标准值的文件路径
+    :type mean_std_filepath: None|str
+    :param manifest_path: 用于计算均值和标准值的数据列表，一般是训练的数据列表
+    :type meanifest_path: None|str
+    :param featurize_func:函数提取特征。它应该是可调用的``featurize_func(audio_segment)``
+    :type featurize_func: None|callable
+    :param num_samples: 用于计算均值和标准值的音频数量
+    :type num_samples: int
+    :param random_seed: 随机种子
+    :type random_seed: int
+    :raises ValueError: 如果mean_std_filepath和manifest_path(或mean_std_filepath和featurize_func)都为None
+    """
+
+    def __init__(self,
+                 mean_std_filepath,
+                 manifest_path=None,
+                 num_workers=4,
+                 num_samples=5000,
+                 random_seed=0):
+        if not mean_std_filepath:
+            if not manifest_path:
+                raise ValueError("如果mean_std_filepath是None，那么meanifest_path和featurize_func不应该是None")
+            self._rng = random.Random(random_seed)
+            self._compute_mean_std(manifest_path, num_samples, num_workers)
+        else:
+            self._read_mean_std_from_file(mean_std_filepath)
+
+    def apply(self, features, eps=1e-20):
+        """使用均值和标准值计算音频特征的归一化值
+
+        :param features: 需要归一化的音频
+        :type features: ndarray
+        :param eps:  添加到标准值以提供数值稳定性
+        :type eps: float
+        :return: 已经归一化的数据
+        :rtype: ndarray
+        """
+        return (features - self._mean) / (self._std + eps)
+
+    def write_to_file(self, filepath):
+        """将计算得到的均值和标准值写入到文件中
+
+        :param filepath: 均值和标准值写入的文件路径
+        :type filepath: str
+        """
+        np.savez(filepath, mean=self._mean, std=self._std)
+
+    def _read_mean_std_from_file(self, filepath):
+        """从文件中加载均值和标准值"""
+        npzfile = np.load(filepath)
+        self._mean = npzfile["mean"]
+        self._std = npzfile["std"]
+
+    def _compute_mean_std(self, manifest_path, num_samples, num_workers):
+        """从随机抽样的实例中计算均值和标准值"""
+        manifest = read_manifest(manifest_path)
+        if num_samples < 0 or num_samples > len(manifest):
+            sampled_manifest = manifest
+        else:
+            sampled_manifest = self._rng.sample(manifest, num_samples)
+        dataset = NormalizerDataset(sampled_manifest)
+        test_loader = DataLoader(dataset=dataset, batch_size=64, collate_fn=collate_fn, num_workers=num_workers)
+        # 求总和
+        std, means = None, None
+        number = 0
+        for std1, means1, number1 in tqdm(test_loader()):
+            number += number1
+            if means is None:
+                means = means1
+            else:
+                means += means1
+            if std is None:
+                std = std1
+            else:
+                std += std1
+        # 求总和的均值和标准值
+        for i in range(len(means)):
+            means[i] /= number
+            std[i] = std[i] / number - means[i] * means[i]
+            if std[i] < 1.0e-20:
+                std[i] = 1.0e-20
+            std[i] = math.sqrt(std[i])
+        self._mean = means.reshape([-1, 1])
+        self._std = std.reshape([-1, 1])
+
+
+class NormalizerDataset(Dataset):
+    def __init__(self, sampled_manifest):
+        super(NormalizerDataset, self).__init__()
+        self.audio_featurizer = AudioFeaturizer()
+        self.sampled_manifest = sampled_manifest
+
+    def __getitem__(self, idx):
+        instance = self.sampled_manifest[idx]
+        # 获取音频特征
+        audio = AudioSegment.from_file(instance["audio_filepath"])
+        feature = self.audio_featurizer.featurize(audio)
+        return feature, 0
+
+    def __len__(self):
+        return len(self.sampled_manifest)
+
+
+def collate_fn(features):
+    std, means = None, None
+    number = 0
+    for feature, _ in features:
+        number += feature.shape[1]
+        sums = np.sum(feature, axis=1)
+        if means is None:
+            means = sums
+        else:
+            means += sums
+        square_sums = np.sum(np.square(feature), axis=1)
+        if std is None:
+            std = square_sums
+        else:
+            std += square_sums
+    return std, means, number
--- a/PaddlePaddle_DeepSpeech2/data_utils/speech.py
+++ b/PaddlePaddle_DeepSpeech2/data_utils/speech.py
+"""Contains the speech segment class."""
+
+import numpy as np
+from data_utils.audio import AudioSegment
+
+
+class SpeechSegment(AudioSegment):
+    """语音片段抽象是音频片段的一个子类，附加文字记录。
+
+    :param samples: Audio samples [num_samples x num_channels].
+    :type samples: ndarray.float32
+    :param sample_rate: 训练数据的采样率
+    :type sample_rate: int
+    :param transcript: 音频文件对应的文本
+    :type transript: str
+    :raises TypeError: If the sample data type is not float or int.
+    """
+
+    def __init__(self, samples, sample_rate, transcript):
+        AudioSegment.__init__(self, samples, sample_rate)
+        self._transcript = transcript
+
+    def __eq__(self, other):
+        """Return whether two objects are equal.
+        """
+        if not AudioSegment.__eq__(self, other):
+            return False
+        if self._transcript != other._transcript:
+            return False
+        return True
+
+    def __ne__(self, other):
+        """Return whether two objects are unequal."""
+        return not self.__eq__(other)
+
+    @classmethod
+    def from_file(cls, filepath, transcript):
+        """从音频文件和相应的文本创建语音片段
+
+        :param filepath: 音频文件路径
+        :type filepath: str|file
+        :param transcript: 音频文件对应的文本
+        :type transript: str
+        :return: Speech segment instance.
+        :rtype: SpeechSegment
+        """
+        audio = AudioSegment.from_file(filepath)
+        return cls(audio.samples, audio.sample_rate, transcript)
+
+    @classmethod
+    def from_bytes(cls, bytes, transcript):
+        """从字节串和相应的文本创建语音片段
+
+        :param bytes: 包含音频样本的字节字符串
+        :type bytes: str
+        :param transcript: 音频文件对应的文本
+        :type transript: str
+        :return: Speech segment instance.
+        :rtype: Speech Segment
+        """
+        audio = AudioSegment.from_bytes(bytes)
+        return cls(audio.samples, audio.sample_rate, transcript)
+
+    @classmethod
+    def concatenate(cls, *segments):
+        """将任意数量的语音片段连接在一起，音频和文本都将被连接
+
+        :param *segments: 要连接的输入语音片段
+        :type *segments: tuple of SpeechSegment
+        :return: 返回SpeechSegment实例
+        :rtype: SpeechSegment
+        :raises ValueError: 不能用不同的抽样率连接片段
+        :raises TypeError: 只有相同类型SpeechSegment实例的语音片段可以连接
+        """
+        if len(segments) == 0:
+            raise ValueError("音频片段为空")
+        sample_rate = segments[0]._sample_rate
+        transcripts = ""
+        for seg in segments:
+            if sample_rate != seg._sample_rate:
+                raise ValueError("不能用不同的抽样率连接片段")
+            if type(seg) is not cls:
+                raise TypeError("只有相同类型SpeechSegment实例的语音片段可以连接")
+            transcripts += seg._transcript
+        samples = np.concatenate([seg.samples for seg in segments])
+        return cls(samples, sample_rate, transcripts)
+
+    @classmethod
+    def slice_from_file(cls, filepath, transcript, start=None, end=None):
+        """只加载一小部分SpeechSegment，而不需要将整个文件加载到内存中，这是非常浪费的。
+
+        :param filepath:文件路径或文件对象到音频文件
+        :type filepath: str|file
+        :param start: 开始时间，单位为秒。如果start是负的，则它从末尾开始计算。如果没有提供，这个函数将从最开始读取。
+        :type start: float
+        :param end: 结束时间，单位为秒。如果end是负的，则它从末尾开始计算。如果没有提供，默认的行为是读取到文件的末尾。
+        :type end: float
+        :param transcript: 音频文件对应的文本，如果没有提供，默认值是一个空字符串。
+        :type transript: str
+        :return: SpeechSegment实例
+        :rtype: SpeechSegment
+        """
+        audio = AudioSegment.slice_from_file(filepath, start, end)
+        return cls(audio.samples, audio.sample_rate, transcript)
+
+    @classmethod
+    def make_silence(cls, duration, sample_rate):
+        """创建指定安静音频长度和采样率的SpeechSegment实例，音频文件对应的文本将为空字符串。
+
+        :param duration: 安静音频的时间，单位秒
+        :type duration: float
+        :param sample_rate: 音频采样率
+        :type sample_rate: float
+        :return: 安静音频SpeechSegment实例
+        :rtype: SpeechSegment
+        """
+        audio = AudioSegment.make_silence(duration, sample_rate)
+        return cls(audio.samples, audio.sample_rate, "")
+
+    @property
+    def transcript(self):
+        """返回音频文件对应的文本
+
+        :return: 音频文件对应的文本
+        :rtype: str
+        """
+        return self._transcript
--- a/PaddlePaddle_DeepSpeech2/data_utils/utility.py
+++ b/PaddlePaddle_DeepSpeech2/data_utils/utility.py
+"""数据工具函数"""
+
+import json
+
+
+def read_manifest(manifest_path, max_duration=float('inf'), min_duration=0.0):
+    """解析数据列表
+    持续时间在[min_duration, max_duration]之外的实例将被过滤。
+
+    :param manifest_path: 数据列表的路径
+    :type manifest_path: str
+    :param max_duration: 过滤的最长音频长度
+    :type max_duration: float
+    :param min_duration: 过滤的最短音频长度
+    :type min_duration: float
+    :return: 数据列表，JSON格式
+    :rtype: list
+    :raises IOError: If failed to parse the manifest.
+    """
+    manifest = []
+    for json_line in open(manifest_path, 'r', encoding='utf-8'):
+        try:
+            json_data = json.loads(json_line)
+        except Exception as e:
+            raise IOError("Error reading manifest: %s" % str(e))
+        if max_duration >= json_data["duration"] >= min_duration:
+            manifest.append(json_data)
+    return manifest
--- a/PaddlePaddle_DeepSpeech2/dataset/mean_std.npz
+++ b/PaddlePaddle_DeepSpeech2/dataset/mean_std.npz
--- a/PaddlePaddle_DeepSpeech2/dataset/test.wav
+++ b/PaddlePaddle_DeepSpeech2/dataset/test.wav
--- a/PaddlePaddle_DeepSpeech2/dataset/test_vad.wav
+++ b/PaddlePaddle_DeepSpeech2/dataset/test_vad.wav
--- a/PaddlePaddle_DeepSpeech2/dataset/zh_vocab.txt
+++ b/PaddlePaddle_DeepSpeech2/dataset/zh_vocab.txt
--- a/PaddlePaddle_DeepSpeech2/decoders/__init__.py
+++ b/PaddlePaddle_DeepSpeech2/decoders/__init__.py
--- a/PaddlePaddle_DeepSpeech2/decoders/beam_search_decoder.py
+++ b/PaddlePaddle_DeepSpeech2/decoders/beam_search_decoder.py
--- a/PaddlePaddle_DeepSpeech2/decoders/ctc_greedy_decoder.py
+++ b/PaddlePaddle_DeepSpeech2/decoders/ctc_greedy_decoder.py
--- a/PaddlePaddle_DeepSpeech2/decoders/swig_wrapper.py
+++ b/PaddlePaddle_DeepSpeech2/decoders/swig_wrapper.py
--- a/PaddlePaddle_DeepSpeech2/docs/augment.md
+++ b/PaddlePaddle_DeepSpeech2/docs/augment.md
--- a/PaddlePaddle_DeepSpeech2/docs/beam_search.md
+++ b/PaddlePaddle_DeepSpeech2/docs/beam_search.md
--- a/PaddlePaddle_DeepSpeech2/docs/dataset.md
+++ b/PaddlePaddle_DeepSpeech2/docs/dataset.md
--- a/PaddlePaddle_DeepSpeech2/docs/eval.md
+++ b/PaddlePaddle_DeepSpeech2/docs/eval.md
--- a/PaddlePaddle_DeepSpeech2/docs/export_model.md
+++ b/PaddlePaddle_DeepSpeech2/docs/export_model.md
--- a/PaddlePaddle_DeepSpeech2/docs/faq.md
+++ b/PaddlePaddle_DeepSpeech2/docs/faq.md
--- a/PaddlePaddle_DeepSpeech2/docs/generate_audio.md
+++ b/PaddlePaddle_DeepSpeech2/docs/generate_audio.md
--- a/PaddlePaddle_DeepSpeech2/docs/images/infer_gui.jpg
+++ b/PaddlePaddle_DeepSpeech2/docs/images/infer_gui.jpg
--- a/PaddlePaddle_DeepSpeech2/docs/images/infer_server.jpg
+++ b/PaddlePaddle_DeepSpeech2/docs/images/infer_server.jpg
--- a/PaddlePaddle_DeepSpeech2/docs/images/wenetspeech.jpg
+++ b/PaddlePaddle_DeepSpeech2/docs/images/wenetspeech.jpg
--- a/PaddlePaddle_DeepSpeech2/docs/infer.md
+++ b/PaddlePaddle_DeepSpeech2/docs/infer.md
--- a/PaddlePaddle_DeepSpeech2/docs/install.md
+++ b/PaddlePaddle_DeepSpeech2/docs/install.md
--- a/PaddlePaddle_DeepSpeech2/docs/nvidia-jetson.md
+++ b/PaddlePaddle_DeepSpeech2/docs/nvidia-jetson.md
--- a/PaddlePaddle_DeepSpeech2/docs/train.md
+++ b/PaddlePaddle_DeepSpeech2/docs/train.md
--- a/PaddlePaddle_DeepSpeech2/docs/wenetspeech.md
+++ b/PaddlePaddle_DeepSpeech2/docs/wenetspeech.md
--- a/PaddlePaddle_DeepSpeech2/download_data/__init__.py
+++ b/PaddlePaddle_DeepSpeech2/download_data/__init__.py
--- a/PaddlePaddle_DeepSpeech2/download_data/aishell.py
+++ b/PaddlePaddle_DeepSpeech2/download_data/aishell.py
--- a/PaddlePaddle_DeepSpeech2/download_data/free_st_chinese_mandarin_corpus.py
+++ b/PaddlePaddle_DeepSpeech2/download_data/free_st_chinese_mandarin_corpus.py
--- a/PaddlePaddle_DeepSpeech2/download_data/noise.py
+++ b/PaddlePaddle_DeepSpeech2/download_data/noise.py
--- a/PaddlePaddle_DeepSpeech2/download_data/thchs_30.py
+++ b/PaddlePaddle_DeepSpeech2/download_data/thchs_30.py
--- a/PaddlePaddle_DeepSpeech2/download_data/utility.py
+++ b/PaddlePaddle_DeepSpeech2/download_data/utility.py
--- a/PaddlePaddle_DeepSpeech2/eval.py
+++ b/PaddlePaddle_DeepSpeech2/eval.py
--- a/PaddlePaddle_DeepSpeech2/export_model.py
+++ b/PaddlePaddle_DeepSpeech2/export_model.py
--- a/PaddlePaddle_DeepSpeech2/infer_gui.py
+++ b/PaddlePaddle_DeepSpeech2/infer_gui.py
--- a/PaddlePaddle_DeepSpeech2/infer_path.py
+++ b/PaddlePaddle_DeepSpeech2/infer_path.py
--- a/PaddlePaddle_DeepSpeech2/infer_server.py
+++ b/PaddlePaddle_DeepSpeech2/infer_server.py
--- a/PaddlePaddle_DeepSpeech2/model_utils/__init__.py
+++ b/PaddlePaddle_DeepSpeech2/model_utils/__init__.py
--- a/PaddlePaddle_DeepSpeech2/model_utils/model.py
+++ b/PaddlePaddle_DeepSpeech2/model_utils/model.py
--- a/PaddlePaddle_DeepSpeech2/model_utils/network.py
+++ b/PaddlePaddle_DeepSpeech2/model_utils/network.py
--- a/PaddlePaddle_DeepSpeech2/requirements.txt
+++ b/PaddlePaddle_DeepSpeech2/requirements.txt
--- a/PaddlePaddle_DeepSpeech2/static/index.css
+++ b/PaddlePaddle_DeepSpeech2/static/index.css
--- a/PaddlePaddle_DeepSpeech2/static/player.png
+++ b/PaddlePaddle_DeepSpeech2/static/player.png
--- a/PaddlePaddle_DeepSpeech2/static/record.js
+++ b/PaddlePaddle_DeepSpeech2/static/record.js
--- a/PaddlePaddle_DeepSpeech2/static/record.png
+++ b/PaddlePaddle_DeepSpeech2/static/record.png
--- a/PaddlePaddle_DeepSpeech2/static/recording.gif
+++ b/PaddlePaddle_DeepSpeech2/static/recording.gif
--- a/PaddlePaddle_DeepSpeech2/static/stop.png
+++ b/PaddlePaddle_DeepSpeech2/static/stop.png
--- a/PaddlePaddle_DeepSpeech2/templates/index.html
+++ b/PaddlePaddle_DeepSpeech2/templates/index.html
--- a/PaddlePaddle_DeepSpeech2/tools/_init_paths.pyc
+++ b/PaddlePaddle_DeepSpeech2/tools/_init_paths.pyc
--- a/PaddlePaddle_DeepSpeech2/tools/create_wenetspeech_data.py
+++ b/PaddlePaddle_DeepSpeech2/tools/create_wenetspeech_data.py
--- a/PaddlePaddle_DeepSpeech2/tools/generate_audio/frontend.py
+++ b/PaddlePaddle_DeepSpeech2/tools/generate_audio/frontend.py
--- a/PaddlePaddle_DeepSpeech2/tools/generate_audio/generate_audio.py
+++ b/PaddlePaddle_DeepSpeech2/tools/generate_audio/generate_audio.py
--- a/PaddlePaddle_DeepSpeech2/tools/generate_audio/generate_corpus.py
+++ b/PaddlePaddle_DeepSpeech2/tools/generate_audio/generate_corpus.py
--- a/PaddlePaddle_DeepSpeech2/tools/tune.py
+++ b/PaddlePaddle_DeepSpeech2/tools/tune.py
--- a/PaddlePaddle_DeepSpeech2/train.py
+++ b/PaddlePaddle_DeepSpeech2/train.py
--- a/PaddlePaddle_DeepSpeech2/utils/__init__.py
+++ b/PaddlePaddle_DeepSpeech2/utils/__init__.py
--- a/PaddlePaddle_DeepSpeech2/utils/audio_vad.py
+++ b/PaddlePaddle_DeepSpeech2/utils/audio_vad.py
--- a/PaddlePaddle_DeepSpeech2/utils/error_rate.py
+++ b/PaddlePaddle_DeepSpeech2/utils/error_rate.py
--- a/PaddlePaddle_DeepSpeech2/utils/predict.py
+++ b/PaddlePaddle_DeepSpeech2/utils/predict.py
--- a/PaddlePaddle_DeepSpeech2/utils/utility.py
+++ b/PaddlePaddle_DeepSpeech2/utils/utility.py
--- a/detect_with_asr.py
+++ b/detect_with_asr.py
--- a/detect_with_ocr.py
+++ b/detect_with_ocr.py
--- a/eagle64.ico
+++ b/eagle64.ico
--- a/speech_synthesis.py
+++ b/speech_synthesis.py
--- a/try_with_gui.py
+++ b/try_with_gui.py