Commit 5b0ac742 authored by 翟艳秋(20软)'s avatar 翟艳秋(20软)

1.[add] 为窗口添加icon;

2.[modified] 基于asr的输出结果中添加旁白字数推荐,规范起止时间为2位小数; 3.[modified] 调整音频合成部分临时音频的存储位置; 4.[modified] 为输出的表格添加自动换行
parent 945a2b39
Subproject commit 081f7807a2ce0e12b98e6f0a0da0e650133f2d9e
This diff is collapsed.
21 hdrcharset=BINARY
62 path=PaddlePaddle-DeepSpeech2/Paddle deepspeech安装.docx
30 mtime=1640747647.779457699
# DeepSpeech2 语音识别
![License](https://img.shields.io/badge/license-Apache%202-red.svg)
![python version](https://img.shields.io/badge/python-3.7+-orange.svg)
![support os](https://img.shields.io/badge/os-linux-yellow.svg)
![GitHub Repo stars](https://img.shields.io/github/stars/yeyupiaoling/PaddlePaddle-DeepSpeech?style=social)
本项目是基于PaddlePaddle的[DeepSpeech](https://github.com/PaddlePaddle/DeepSpeech) 项目开发的,做了较大的修改,方便训练中文自定义数据集,同时也方便测试和使用。DeepSpeech2是基于PaddlePaddle实现的端到端自动语音识别(ASR)引擎,其论文为[《Baidu's Deep Speech 2 paper》](http://proceedings.mlr.press/v48/amodei16.pdf) ,本项目同时还支持各种数据增强方法,以适应不同的使用场景。支持在Windows,Linux下训练和预测,支持Nvidia Jetson等开发板推理预测,该分支为新版本,如果要使用旧版本,请查看[release/1.0分支](https://github.com/yeyupiaoling/PaddlePaddle-DeepSpeech/tree/release/1.0)
本项目使用的环境:
- Python 3.7
- PaddlePaddle 2.2.0
- Windows or Ubuntu
## 更新记录
- 2021.11.26: 修改集束解码bug。
- 2021.11.09: 提供WenetSpeech数据集制作脚本。
- 2021.09.05: 提供GUI界面识别部署。
- 2021.09.04: 提供三个公开数据的预训练模型。
- 2021.08.30: 支持中文数字转阿拉伯数字,具体请看[预测文档](docs/infer.md)
- 2021.08.29: 完成训练代码和预测代码,同时完善相关文档。
- 2021.08.07: 支持导出预测模型,使用预测模型进行推理。使用webrtcvad工具,实现长语音识别。
- 2021.08.06: 将项目大部分的代码修改为PaddlePaddle2.0之后的新API。
## 模型下载
| 数据集 | 卷积层数量 | 循环神经网络的数量 | 循环神经网络的大小 | 测试集字错率 | 下载地址 |
| :---: | :---: | :---: | :---: | :---: | :---: |
| aishell(179小时) | 2 | 3 | 1024 | 0.084532 | [点击下载](https://download.csdn.net/download/qq_33200967/21773253) |
| free_st_chinese_mandarin_corpus(109小时) | 2 | 3 | 1024 | 0.170260 | [点击下载](https://download.csdn.net/download/qq_33200967/21866900) |
| thchs_30(34小时) | 2 | 3 | 1024 | 0.026838 | [点击下载](https://download.csdn.net/download/qq_33200967/21774247) |
| 超大数据集(1600多小时真实数据)+(1300多小时合成数据) | 2 | 3 | 1024 | 训练中 | [训练中]() |
**说明:** 这里提供的是训练参数,如果要用于预测,还需要执行[导出模型](docs/export_model.md),使用的解码方法是集束搜索。
>有问题欢迎提 [issue](https://github.com/yeyupiaoling/PaddlePaddle-DeepSpeech/issues) 交流
## 文档教程
- [快速安装](docs/install.md)
- [数据准备](docs/dataset.md)
- [WenetSpeech数据集](docs/wenetspeech.md)
- [合成语音数据](docs/generate_audio.md)
- [数据增强](docs/augment.md)
- [训练模型](docs/train.md)
- [集束搜索解码](docs/beam_search.md)
- [执行评估](docs/eval.md)
- [导出模型](docs/export_model.md)
- 预测
- [本地模型](docs/infer.md)
- [长语音模型](docs/infer.md)
- [Web部署模型](docs/infer.md)
- [Nvidia Jetson部署](docs/nvidia-jetson.md)
## 快速预测
- 下载作者提供的模型或者训练模型,然后执行[导出模型](docs/export_model.md),使用`infer_path.py`预测音频,通过参数`--wav_path`指定需要预测的音频路径,完成语音识别,详情请查看[模型部署](docs/infer.md)
```shell script
python infer_path.py --wav_path=./dataset/test.wav
```
输出结果:
```
----------- Configuration Arguments -----------
alpha: 1.2
beam_size: 10
beta: 0.35
cutoff_prob: 1.0
cutoff_top_n: 40
decoding_method: ctc_greedy
enable_mkldnn: False
is_long_audio: False
lang_model_path: ./lm/zh_giga.no_cna_cmn.prune01244.klm
mean_std_path: ./dataset/mean_std.npz
model_dir: ./models/infer/
to_an: True
use_gpu: True
use_tensorrt: False
vocab_path: ./dataset/zh_vocab.txt
wav_path: ./dataset/test.wav
------------------------------------------------
消耗时间:132, 识别结果: 近几年不但我用书给女儿儿压岁也劝说亲朋不要给女儿压岁钱而改送压岁书, 得分: 94
```
- 长语音预测
```shell script
python infer_path.py --wav_path=./dataset/test_vad.wav --is_long_audio=True
```
- Web部署
![录音测试页面](docs/images/infer_server.jpg)
- GUI界面部署
![GUI界面](docs/images/infer_gui.jpg)
## 相关项目
- 基于PaddlePaddle实现的声纹识别:[VoiceprintRecognition-PaddlePaddle](https://github.com/yeyupiaoling/VoiceprintRecognition-PaddlePaddle)
- 基于PaddlePaddle动态图实现的语音识别:[PPASR](https://github.com/yeyupiaoling/PPASR)
- 基于Pytorch实现的语音识别:[MASR](https://github.com/yeyupiaoling/MASR)
[
{
"type": "noise",
"aug_type": "audio",
"params": {
"min_snr_dB": 10,
"max_snr_dB": 50,
"noise_manifest_path": "dataset/manifest.noise"
},
"prob": 0.5
},
{
"type": "speed",
"aug_type": "audio",
"params": {
"min_speed_rate": 0.9,
"max_speed_rate": 1.1,
"num_rates": 3
},
"prob": 1.0
},
{
"type": "shift",
"aug_type": "audio",
"params": {
"min_shift_ms": -5,
"max_shift_ms": 5
},
"prob": 1.0
},
{
"type": "volume",
"aug_type": "audio",
"params": {
"min_gain_dBFS": -15,
"max_gain_dBFS": 15
},
"prob": 1.0
},
{
"type": "specaug",
"aug_type": "feature",
"params": {
"W": 0,
"warp_mode": "PIL",
"F": 10,
"n_freq_masks": 2,
"T": 50,
"n_time_masks": 2,
"p": 1.0,
"adaptive_number_ratio": 0,
"adaptive_size_ratio": 0,
"max_n_time_masks": 20,
"replace_with_zero": true
},
"prob": 1.0
}
]
\ No newline at end of file
import argparse
import functools
import json
import os
import wave
from collections import Counter
from zhconv import convert
import numpy as np
import soundfile
from tqdm import tqdm
from data_utils.normalizer import FeatureNormalizer
from utils.utility import add_arguments, print_arguments, read_manifest, change_rate
parser = argparse.ArgumentParser(description=__doc__)
add_arg = functools.partial(add_arguments, argparser=parser)
add_arg('annotation_path', str, 'dataset/annotation/', '标注文件的路径,如果annotation_path包含了test.txt,就全部使用test.txt的数据作为测试数据')
add_arg('manifest_prefix', str, 'dataset/', '训练数据清单,包括音频路径和标注信息')
add_arg('is_change_frame_rate', bool, True, '是否统一改变音频为16000Hz,这会消耗大量的时间')
add_arg('max_test_manifest', int, 10000, '最大的测试数据数量')
add_arg('count_threshold', int, 2, '字符计数的截断阈值,0为不做限制')
add_arg('vocab_path', str, 'dataset/zh_vocab.txt', '生成的数据字典文件')
add_arg('num_workers', int, 8, '读取数据的线程数量')
add_arg('manifest_paths', str, 'dataset/manifest.train', '数据列表路径')
add_arg('num_samples', int, 1000000, '用于计算均值和标准值得音频数量,当为-1使用全部数据')
add_arg('output_path', str, './dataset/mean_std.npz', '保存均值和标准值得numpy文件路径,后缀 (.npz).')
args = parser.parse_args()
# 创建数据列表
def create_manifest(annotation_path, manifest_path_prefix):
data_list = []
test_list = []
durations_all = []
duration_0_10 = 0
duration_10_20 = 0
duration_20 = 0
# 获取全部的标注文件
for annotation_text in os.listdir(annotation_path):
durations = []
print('正在创建%s的数据列表,请等待 ...' % annotation_text)
annotation_text_path = os.path.join(annotation_path, annotation_text)
# 读取标注文件
with open(annotation_text_path, 'r', encoding='utf-8') as f:
lines = f.readlines()
for line in tqdm(lines):
audio_path = line.split('\t')[0]
try:
# 过滤非法的字符
text = is_ustr(line.split('\t')[1].replace('\n', '').replace('\r', ''))
# 保证全部都是简体
text = convert(text, 'zh-cn')
# 重新调整音频格式并保存
if args.is_change_frame_rate:
change_rate(audio_path)
# 获取音频的长度
f_wave = wave.open(audio_path, "rb")
duration = f_wave.getnframes() / f_wave.getframerate()
if duration <= 10:
duration_0_10 += 1
elif 10 < duration <= 20:
duration_10_20 += 1
else:
duration_20 += 1
durations.append(duration)
d = json.dumps(
{
'audio_filepath': audio_path.replace('\\', '/'),
'duration': duration,
'text': text
},
ensure_ascii=False)
if annotation_text == 'test.txt':
test_list.append(d)
else:
data_list.append(d)
except Exception as e:
print(e)
continue
durations_all.append(sum(durations))
print("%s数据一共[%d]小时!" % (annotation_text, int(sum(durations) / 3600)))
print("0-10秒的数量:%d,10-20秒的数量:%d,大于20秒的数量:%d" % (duration_0_10, duration_10_20, duration_20))
# 将音频的路径,长度和标签写入到数据列表中
f_train = open(os.path.join(manifest_path_prefix, 'manifest.train'), 'w', encoding='utf-8')
f_test = open(os.path.join(manifest_path_prefix, 'manifest.test'), 'w', encoding='utf-8')
for line in test_list:
f_test.write(line + '\n')
interval = 500
if len(data_list) / 500 > args.max_test_manifest:
interval = len(data_list) // args.max_test_manifest
for i, line in enumerate(data_list):
if i % interval == 0 and i != 0:
if len(test_list) == 0:
f_test.write(line + '\n')
else:
f_train.write(line + '\n')
else:
f_train.write(line + '\n')
f_train.close()
f_test.close()
print("创建数量列表完成,全部数据一共[%d]小时!" % int(sum(durations_all) / 3600))
# 过滤非文字的字符
def is_ustr(in_str):
out_str = ''
for i in range(len(in_str)):
if is_uchar(in_str[i]):
out_str = out_str + in_str[i]
else:
out_str = out_str + ' '
return ''.join(out_str.split())
# 判断是否为文字字符
def is_uchar(uchar):
if u'\u4e00' <= uchar <= u'\u9fa5':
return True
if u'\u0030' <= uchar <= u'\u0039':
return False
if (u'\u0041' <= uchar <= u'\u005a') or (u'\u0061' <= uchar <= u'\u007a'):
return False
if uchar in ('-', ',', '.', '>', '?'):
return False
return False
# 生成噪声的数据列表
def create_noise(path='dataset/audio/noise', min_duration=30):
if not os.path.exists(path):
print('噪声音频文件为空,已跳过!')
return
json_lines = []
print('正在创建噪声数据列表,路径:%s,请等待 ...' % path)
for file in tqdm(os.listdir(path)):
audio_path = os.path.join(path, file)
try:
# 噪声的标签可以标记为空
text = ""
# 重新调整音频格式并保存
if args.is_change_frame_rate:
change_rate(audio_path)
f_wave = wave.open(audio_path, "rb")
duration = f_wave.getnframes() / f_wave.getframerate()
# 拼接音频
if duration < min_duration:
wav = soundfile.read(audio_path)[0]
data = wav
for i in range(int(min_duration / duration) + 1):
data = np.hstack([data, wav])
soundfile.write(audio_path, data, samplerate=16000)
f_wave = wave.open(audio_path, "rb")
duration = f_wave.getnframes() / f_wave.getframerate()
json_lines.append(
json.dumps(
{
'audio_filepath': audio_path.replace('\\', '/'),
'duration': duration,
'text': text
},
ensure_ascii=False))
except Exception as e:
continue
with open(os.path.join(args.manifest_prefix, 'manifest.noise'), 'w', encoding='utf-8') as f_noise:
for json_line in json_lines:
f_noise.write(json_line + '\n')
# 获取全部字符
def count_manifest(counter, manifest_path):
manifest_jsons = read_manifest(manifest_path)
for line_json in manifest_jsons:
for char in line_json['text']:
counter.update(char)
# 计算数据集的均值和标准值
def compute_mean_std(manifest_path, num_samples, output_path):
# 随机取指定的数量计算平均值归一化
normalizer = FeatureNormalizer(mean_std_filepath=None,
manifest_path=manifest_path,
num_samples=num_samples,
num_workers=args.num_workers)
# 将计算的结果保存的文件中
normalizer.write_to_file(output_path)
print('计算的均值和标准值已保存在 %s!' % output_path)
def main():
print_arguments(args)
print('开始生成数据列表...')
create_manifest(annotation_path=args.annotation_path,
manifest_path_prefix=args.manifest_prefix)
print('='*70)
print('开始生成噪声数据列表...')
create_noise(path='dataset/audio/noise')
print('='*70)
print('开始生成数据字典...')
counter = Counter()
# 获取全部数据列表中的标签字符
count_manifest(counter, args.manifest_paths)
# 为每一个字符都生成一个ID
count_sorted = sorted(counter.items(), key=lambda x: x[1], reverse=True)
with open(args.vocab_path, 'w', encoding='utf-8') as fout:
fout.write('<blank>\t-1\n')
for char, count in count_sorted:
# 跳过指定的字符阈值,超过这大小的字符都忽略
if count < args.count_threshold: break
fout.write('%s\t%d\n' % (char, count))
print('数据词汇表已生成完成,保存与:%s' % args.vocab_path)
print('='*70)
print('开始抽取%s条数据计算均值和标准值...' % args.num_samples)
compute_mean_std(args.manifest_paths, args.num_samples, args.output_path)
print('='*70)
if __name__ == '__main__':
main()
This diff is collapsed.
from data_utils.featurizer.speech_featurizer import SpeechFeaturizer
from data_utils.normalizer import FeatureNormalizer
from data_utils.speech import SpeechSegment
class AudioInferProcess(object):
"""
识别程序所使用的是对音频预处理的工具
:param vocab_filepath: 词汇表文件路径
:type vocab_filepath: str
:param mean_std_filepath: 平均值和标准差的文件路径
:type mean_std_filepath: str
:param stride_ms: 生成帧的跨步大小(以毫秒为单位)
:type stride_ms: float
:param window_ms: 用于生成帧的窗口大小(毫秒)
:type window_ms: float
:param use_dB_normalization: 提取特征前是否将音频归一化至-20 dB
:type use_dB_normalization: bool
"""
def __init__(self,
vocab_filepath,
mean_std_filepath,
stride_ms=10.0,
window_ms=20.0,
use_dB_normalization=True):
self._normalizer = FeatureNormalizer(mean_std_filepath)
self._speech_featurizer = SpeechFeaturizer(vocab_filepath=vocab_filepath,
stride_ms=stride_ms,
window_ms=window_ms,
use_dB_normalization=use_dB_normalization)
def process_utterance(self, audio_file):
"""对语音数据加载、预处理
:param audio_file: 音频文件的文件路径或文件对象
:type audio_file: str | file
:return: 预处理的音频数据
:rtype: 2darray
"""
speech_segment = SpeechSegment.from_file(audio_file, "")
specgram, _ = self._speech_featurizer.featurize(speech_segment, False)
specgram = self._normalizer.apply(specgram)
return specgram
@property
def vocab_size(self):
"""返回词汇表大小
:return: 词汇表大小
:rtype: int
"""
return self._speech_featurizer.vocab_size
@property
def vocab_list(self):
"""返回词汇表列表
:return: 词汇表列表
:rtype: list
"""
return self._speech_featurizer.vocab_list
"""Contains the data augmentation pipeline."""
import json
import os
import random
import sys
from datetime import datetime
from data_utils.augmentor.volume_perturb import VolumePerturbAugmentor
from data_utils.augmentor.shift_perturb import ShiftPerturbAugmentor
from data_utils.augmentor.speed_perturb import SpeedPerturbAugmentor
from data_utils.augmentor.noise_perturb import NoisePerturbAugmentor
from data_utils.augmentor.spec_augment import SpecAugmentor
from data_utils.augmentor.resample import ResampleAugmentor
class AugmentationPipeline(object):
"""Build a pre-processing pipeline with various augmentation models.Such a
data augmentation pipeline is oftern leveraged to augment the training
samples to make the model invariant to certain types of perturbations in the
real world, improving model's generalization ability.
The pipeline is built according the the augmentation configuration in json
string, e.g.
.. code-block::
[
{
"type": "noise",
"params": {
"min_snr_dB": 10,
"max_snr_dB": 50,
"noise_manifest_path": "dataset/manifest.noise"
},
"prob": 0.5
},
{
"type": "speed",
"params": {
"min_speed_rate": 0.9,
"max_speed_rate": 1.1,
"num_rates": 3
},
"prob": 1.0
},
{
"type": "shift",
"params": {
"min_shift_ms": -5,
"max_shift_ms": 5
},
"prob": 1.0
},
{
"type": "volume",
"params": {
"min_gain_dBFS": -15,
"max_gain_dBFS": 15
},
"prob": 1.0
},
{
"type": "specaug",
"params": {
"W": 0,
"warp_mode": "PIL",
"F": 10,
"n_freq_masks": 2,
"T": 50,
"n_time_masks": 2,
"p": 1.0,
"adaptive_number_ratio": 0,
"adaptive_size_ratio": 0,
"max_n_time_masks": 20,
"replace_with_zero": true
},
"prob": 1.0
}
]
This augmentation configuration inserts two augmentation models
into the pipeline, with one is VolumePerturbAugmentor and the other
SpeedPerturbAugmentor. "prob" indicates the probability of the current
augmentor to take effect. If "prob" is zero, the augmentor does not take
effect.
:param augmentation_config: Augmentation configuration in json string.
:type augmentation_config: str
:param random_seed: Random seed.
:type random_seed: int
:raises ValueError: If the augmentation json config is in incorrect format".
"""
def __init__(self, augmentation_config, random_seed=0):
self._rng = random.Random(random_seed)
self._augmentors, self._rates = self._parse_pipeline_from(augmentation_config, aug_type='audio')
self._spec_augmentors, self._spec_rates = self._parse_pipeline_from(augmentation_config, aug_type='feature')
def transform_audio(self, audio_segment):
"""Run the pre-processing pipeline for data augmentation.
Note that this is an in-place transformation.
:param audio_segment: Audio segment to process.
:type audio_segment: AudioSegmenet|SpeechSegment
"""
for augmentor, rate in zip(self._augmentors, self._rates):
if self._rng.uniform(0., 1.) < rate:
augmentor.transform_audio(audio_segment)
def transform_feature(self, spec_segment):
"""spectrogram augmentation.
Args:
spec_segment (np.ndarray): audio feature, (D, T).
"""
for augmentor, rate in zip(self._spec_augmentors, self._spec_rates):
if self._rng.uniform(0., 1.) < rate:
spec_segment = augmentor.transform_feature(spec_segment)
return spec_segment
def _parse_pipeline_from(self, config_json, aug_type):
"""Parse the config json to build a augmentation pipelien."""
try:
configs = []
configs_temp = json.loads(config_json)
for config in configs_temp:
if config['aug_type'] != aug_type: continue
if config['type'] == 'noise' and not os.path.exists(config['params']['noise_manifest_path']):
print('%s不存在,已经忽略噪声增强操作!' % config['params']['noise_manifest_path'], file=sys.stderr)
continue
print('[%s] 数据增强配置:%s' % (datetime.now(), config))
configs.append(config)
augmentors = [self._get_augmentor(config["type"], config["params"]) for config in configs]
rates = [config["prob"] for config in configs]
except Exception as e:
raise ValueError("Failed to parse the augmentation config json: %s" % str(e))
return augmentors, rates
def _get_augmentor(self, augmentor_type, params):
"""Return an augmentation model by the type name, and pass in params."""
if augmentor_type == "volume":
return VolumePerturbAugmentor(self._rng, **params)
elif augmentor_type == "shift":
return ShiftPerturbAugmentor(self._rng, **params)
elif augmentor_type == "speed":
return SpeedPerturbAugmentor(self._rng, **params)
elif augmentor_type == "resample":
return ResampleAugmentor(self._rng, **params)
elif augmentor_type == "noise":
return NoisePerturbAugmentor(self._rng, **params)
elif augmentor_type == "specaug":
return SpecAugmentor(self._rng, **params)
else:
raise ValueError("Unknown augmentor type [%s]." % augmentor_type)
"""Contains the abstract base class for augmentation models."""
from abc import ABCMeta, abstractmethod
class AugmentorBase(object):
"""Abstract base class for augmentation model (augmentor) class.
All augmentor classes should inherit from this class, and implement the
following abstract methods.
"""
__metaclass__ = ABCMeta
@abstractmethod
def __init__(self):
pass
@abstractmethod
def transform_audio(self, audio_segment):
"""Adds various effects to the input audio segment. Such effects
will augment the training data to make the model invariant to certain
types of perturbations in the real world, improving model's
generalization ability.
Note that this is an in-place transformation.
:param audio_segment: Audio segment to add effects to.
:type audio_segment: AudioSegmenet|SpeechSegment
"""
pass
"""Contains the noise perturb augmentation model."""
from data_utils.augmentor.base import AugmentorBase
from data_utils.utility import read_manifest
from data_utils.audio import AudioSegment
class NoisePerturbAugmentor(AugmentorBase):
"""用于添加背景噪声的增强模型
:param rng: Random generator object.
:type rng: random.Random
:param min_snr_dB: Minimal signal noise ratio, in decibels.
:type min_snr_dB: float
:param max_snr_dB: Maximal signal noise ratio, in decibels.
:type max_snr_dB: float
:param noise_manifest_path: Manifest path for noise audio data.
:type noise_manifest_path: str
"""
def __init__(self, rng, min_snr_dB, max_snr_dB, noise_manifest_path):
self._min_snr_dB = min_snr_dB
self._max_snr_dB = max_snr_dB
self._rng = rng
self._noise_manifest = read_manifest(manifest_path=noise_manifest_path)
def transform_audio(self, audio_segment):
"""Add background noise audio.
Note that this is an in-place transformation.
:param audio_segment: Audio segment to add effects to.
:type audio_segment: AudioSegmenet|SpeechSegment
"""
noise_json = self._rng.sample(self._noise_manifest, 1)[0]
if noise_json['duration'] >= audio_segment.duration:
diff_duration = noise_json['duration'] - audio_segment.duration
start = self._rng.uniform(0, diff_duration)
end = start + audio_segment.duration
noise_segment = AudioSegment.slice_from_file(noise_json['audio_filepath'], start=start, end=end)
snr_dB = self._rng.uniform(self._min_snr_dB, self._max_snr_dB)
audio_segment.add_noise(noise_segment, snr_dB, allow_downsampling=True, rng=self._rng)
"""Contain the resample augmentation model."""
from data_utils.augmentor.base import AugmentorBase
class ResampleAugmentor(AugmentorBase):
"""重采样的增强模型
See more info here:
https://ccrma.stanford.edu/~jos/resample/index.html
:param rng: Random generator object.
:type rng: random.Random
:param new_sample_rate: New sample rate in Hz.
:type new_sample_rate: int
"""
def __init__(self, rng, new_sample_rate):
self._new_sample_rate = new_sample_rate
self._rng = rng
def transform_audio(self, audio_segment):
"""Resamples the input audio to a target sample rate.
Note that this is an in-place transformation.
:param audio: Audio segment to add effects to.
:type audio: AudioSegment|SpeechSegment
"""
audio_segment.resample(self._new_sample_rate)
"""Contains the volume perturb augmentation model."""
from data_utils.augmentor.base import AugmentorBase
class ShiftPerturbAugmentor(AugmentorBase):
"""添加随机位移扰动的增强模型
:param rng: Random generator object.
:type rng: random.Random
:param min_shift_ms: Minimal shift in milliseconds.
:type min_shift_ms: float
:param max_shift_ms: Maximal shift in milliseconds.
:type max_shift_ms: float
"""
def __init__(self, rng, min_shift_ms, max_shift_ms):
self._min_shift_ms = min_shift_ms
self._max_shift_ms = max_shift_ms
self._rng = rng
def transform_audio(self, audio_segment):
"""Shift audio.
Note that this is an in-place transformation.
:param audio_segment: Audio segment to add effects to.
:type audio_segment: AudioSegmenet|SpeechSegment
"""
shift_ms = self._rng.uniform(self._min_shift_ms, self._max_shift_ms)
audio_segment.shift(shift_ms)
import random
import numpy as np
from PIL import Image
from PIL.Image import BICUBIC
from data_utils.augmentor.base import AugmentorBase
class SpecAugmentor(AugmentorBase):
"""Augmentation model for Time warping, Frequency masking, Time masking.
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
https://arxiv.org/abs/1904.08779
SpecAugment on Large Scale Datasets
https://arxiv.org/abs/1912.05533
"""
def __init__(self,
rng,
F,
T,
n_freq_masks,
n_time_masks,
p=1.0,
W=40,
adaptive_number_ratio=0,
adaptive_size_ratio=0,
max_n_time_masks=20,
replace_with_zero=True,
warp_mode='PIL'):
"""SpecAugment class.
Args:
rng (random.Random): random generator object.
F (int): parameter for frequency masking
T (int): parameter for time masking
n_freq_masks (int): number of frequency masks
n_time_masks (int): number of time masks
p (float): parameter for upperbound of the time mask
W (int): parameter for time warping
adaptive_number_ratio (float): adaptive multiplicity ratio for time masking
adaptive_size_ratio (float): adaptive size ratio for time masking
max_n_time_masks (int): maximum number of time masking
replace_with_zero (bool): pad zero on mask if true else use mean
warp_mode (str): "PIL" (default, fast, not differentiable)
or "sparse_image_warp" (slow, differentiable)
"""
super().__init__()
self._rng = rng
self.inplace = True
self.replace_with_zero = replace_with_zero
self.mode = warp_mode
self.W = W
self.F = F
self.T = T
self.n_freq_masks = n_freq_masks
self.n_time_masks = n_time_masks
self.p = p
# adaptive SpecAugment
self.adaptive_number_ratio = adaptive_number_ratio
self.adaptive_size_ratio = adaptive_size_ratio
self.max_n_time_masks = max_n_time_masks
if adaptive_number_ratio > 0:
self.n_time_masks = 0
if adaptive_size_ratio > 0:
self.T = 0
self._freq_mask = None
self._time_mask = None
@property
def freq_mask(self):
return self._freq_mask
@property
def time_mask(self):
return self._time_mask
def __repr__(self):
return f"specaug: F-{self.F}, T-{self.T}, F-n-{self.n_freq_masks}, T-n-{self.n_time_masks}"
def time_warp(self, x, mode='PIL'):
"""time warp for spec augment
move random center frame by the random width ~ uniform(-window, window)
Args:
x (np.ndarray): spectrogram (time, freq)
mode (str): PIL or sparse_image_warp
Raises:
NotImplementedError: [description]
NotImplementedError: [description]
Returns:
np.ndarray: time warped spectrogram (time, freq)
"""
window = max_time_warp = self.W
if window == 0:
return x
if mode == "PIL":
t = x.shape[0]
if t - window <= window:
return x
# NOTE: randrange(a, b) emits a, a + 1, ..., b - 1
center = random.randrange(window, t - window)
warped = random.randrange(center - window, center +
window) + 1 # 1 ... t - 1
left = Image.fromarray(x[:center]).resize((x.shape[1], warped),
BICUBIC)
right = Image.fromarray(x[center:]).resize((x.shape[1], t - warped),
BICUBIC)
if self.inplace:
x[:warped] = left
x[warped:] = right
return x
return np.concatenate((left, right), 0)
elif mode == "sparse_image_warp":
raise NotImplementedError('sparse_image_warp')
else:
raise NotImplementedError(
"unknown resize mode: " + mode +
", choose one from (PIL, sparse_image_warp).")
def mask_freq(self, x, replace_with_zero=False):
"""freq mask
Args:
x (np.ndarray): spectrogram (time, freq)
replace_with_zero (bool, optional): Defaults to False.
Returns:
np.ndarray: freq mask spectrogram (time, freq)
"""
n_bins = x.shape[1]
for i in range(0, self.n_freq_masks):
f = int(self._rng.uniform(a=0, b=self.F))
f_0 = int(self._rng.uniform(a=0, b=n_bins - f))
assert f_0 <= f_0 + f
if replace_with_zero:
x[:, f_0:f_0 + f] = 0
else:
x[:, f_0:f_0 + f] = x.mean()
self._freq_mask = (f_0, f_0 + f)
return x
def mask_time(self, x, replace_with_zero=False):
"""time mask
Args:
x (np.ndarray): spectrogram (time, freq)
replace_with_zero (bool, optional): Defaults to False.
Returns:
np.ndarray: time mask spectrogram (time, freq)
"""
n_frames = x.shape[0]
if self.adaptive_number_ratio > 0:
n_masks = int(n_frames * self.adaptive_number_ratio)
n_masks = min(n_masks, self.max_n_time_masks)
else:
n_masks = self.n_time_masks
if self.adaptive_size_ratio > 0:
T = self.adaptive_size_ratio * n_frames
else:
T = self.T
for i in range(n_masks):
t = int(self._rng.uniform(a=0, b=T))
t = min(t, int(n_frames * self.p))
t_0 = int(self._rng.uniform(a=0, b=n_frames - t))
assert t_0 <= t_0 + t
if replace_with_zero:
x[t_0:t_0 + t, :] = 0
else:
x[t_0:t_0 + t, :] = x.mean()
self._time_mask = (t_0, t_0 + t)
return x
def __call__(self, x, train=True):
if not train:
return x
return self.transform_feature(x)
def transform_feature(self, x: np.ndarray):
"""
Args:
x (np.ndarray): `[T, F]`
Returns:
x (np.ndarray): `[T, F]`
"""
assert isinstance(x, np.ndarray)
assert x.ndim == 2
x = self.time_warp(x, self.mode)
x = self.mask_freq(x, self.replace_with_zero)
x = self.mask_time(x, self.replace_with_zero)
return x
"""Contain the speech perturbation augmentation model."""
import numpy as np
from data_utils.augmentor.base import AugmentorBase
class SpeedPerturbAugmentor(AugmentorBase):
"""添加速度扰动的增强模型
See reference paper here:
http://www.danielpovey.com/files/2015_interspeech_augmentation.pdf
:param rng: Random generator object.
:type rng: random.Random
:param min_speed_rate: Lower bound of new speed rate to sample and should
not be smaller than 0.9.
:type min_speed_rate: float
:param max_speed_rate: Upper bound of new speed rate to sample and should
not be larger than 1.1.
:type max_speed_rate: float
"""
def __init__(self, rng, min_speed_rate=0.9, max_speed_rate=1.1, num_rates=3):
if min_speed_rate < 0.9:
raise ValueError("Sampling speed below 0.9 can cause unnatural effects")
if max_speed_rate > 1.1:
raise ValueError("Sampling speed above 1.1 can cause unnatural effects")
self._min_speed_rate = min_speed_rate
self._max_speed_rate = max_speed_rate
self._rng = rng
self._num_rates = num_rates
if num_rates > 0:
self._rates = np.linspace(self._min_speed_rate, self._max_speed_rate, self._num_rates, endpoint=True)
def transform_audio(self, audio_segment):
"""Sample a new speed rate from the given range and
changes the speed of the given audio clip.
Note that this is an in-place transformation.
:param audio_segment: Audio segment to add effects to.
:type audio_segment: AudioSegment|SpeechSegment
"""
if self._num_rates < 0:
speed_rate = self._rng.uniform(self._min_speed_rate, self._max_speed_rate)
else:
speed_rate = self._rng.choice(self._rates)
if speed_rate == 1.0: return
audio_segment.change_speed(speed_rate)
"""Contains the volume perturb augmentation model."""
from data_utils.augmentor.base import AugmentorBase
class VolumePerturbAugmentor(AugmentorBase):
"""添加随机体积扰动的增强模型
This is used for multi-loudness training of PCEN. See
https://arxiv.org/pdf/1607.05666v1.pdf
for more details.
:param rng: Random generator object.
:type rng: random.Random
:param min_gain_dBFS: Minimal gain in dBFS.
:type min_gain_dBFS: float
:param max_gain_dBFS: Maximal gain in dBFS.
:type max_gain_dBFS: float
"""
def __init__(self, rng, min_gain_dBFS, max_gain_dBFS):
self._min_gain_dBFS = min_gain_dBFS
self._max_gain_dBFS = max_gain_dBFS
self._rng = rng
def transform_audio(self, audio_segment):
"""Change audio loadness.
Note that this is an in-place transformation.
:param audio_segment: Audio segment to add effects to.
:type audio_segment: AudioSegmenet|SpeechSegment
"""
gain = self._rng.uniform(self._min_gain_dBFS, self._max_gain_dBFS)
audio_segment.gain_db(gain)
This diff is collapsed.
"""Contains the audio featurizer class."""
import numpy as np
from data_utils.audio import AudioSegment
class AudioFeaturizer(object):
"""音频特征器,用于从AudioSegment或SpeechSegment内容中提取特性。
Currently, it supports feature types of linear spectrogram and mfcc.
:param stride_ms: Striding size (in milliseconds) for generating frames.
:type stride_ms: float
:param window_ms: Window size (in milliseconds) for generating frames.
:type window_ms: float
:param target_sample_rate: Audio are resampled (if upsampling or
downsampling is allowed) to this before
extracting spectrogram features.
:type target_sample_rate: int
:param use_dB_normalization: Whether to normalize the audio to a certain
decibels before extracting the features.
:type use_dB_normalization: bool
:param target_dB: Target audio decibels for normalization.
:type target_dB: float
"""
def __init__(self,
stride_ms=10.0,
window_ms=20.0,
target_sample_rate=16000,
use_dB_normalization=True,
target_dB=-20):
self._stride_ms = stride_ms
self._window_ms = window_ms
self._target_sample_rate = target_sample_rate
self._use_dB_normalization = use_dB_normalization
self._target_dB = target_dB
def featurize(self, audio_segment, allow_downsampling=True, allow_upsampling=True):
"""从AudioSegment或SpeechSegment中提取音频特征
:param audio_segment: Audio/speech segment to extract features from.
:type audio_segment: AudioSegment|SpeechSegment
:param allow_downsampling: Whether to allow audio downsampling before featurizing.
:type allow_downsampling: bool
:param allow_upsampling: Whether to allow audio upsampling before featurizing.
:type allow_upsampling: bool
:return: Spectrogram audio feature in 2darray.
:rtype: ndarray
:raises ValueError: If audio sample rate is not supported.
"""
# upsampling or downsampling
if ((audio_segment.sample_rate > self._target_sample_rate and
allow_downsampling) or
(audio_segment.sample_rate < self._target_sample_rate and
allow_upsampling)):
audio_segment.resample(self._target_sample_rate)
if audio_segment.sample_rate != self._target_sample_rate:
raise ValueError("Audio sample rate is not supported. "
"Turn allow_downsampling or allow up_sampling on.")
# decibel normalization
if self._use_dB_normalization:
audio_segment.normalize(target_db=self._target_dB)
# extract spectrogram
return self._compute_linear_specgram(audio_segment.samples, audio_segment.sample_rate,
stride_ms=self._stride_ms, window_ms=self._window_ms)
# 用快速傅里叶变换计算线性谱图
@staticmethod
def _compute_linear_specgram(samples,
sample_rate,
stride_ms=10.0,
window_ms=20.0,
eps=1e-14):
stride_size = int(0.001 * sample_rate * stride_ms)
window_size = int(0.001 * sample_rate * window_ms)
truncate_size = (len(samples) - window_size) % stride_size
samples = samples[:len(samples) - truncate_size]
nshape = (window_size, (len(samples) - window_size) // stride_size + 1)
nstrides = (samples.strides[0], samples.strides[0] * stride_size)
windows = np.lib.stride_tricks.as_strided(samples, shape=nshape, strides=nstrides)
assert np.all(windows[:, 1] == samples[stride_size:(stride_size + window_size)])
# 快速傅里叶变换
weighting = np.hanning(window_size)[:, None]
fft = np.fft.rfft(windows * weighting, n=None, axis=0)
fft = np.absolute(fft)
fft = fft ** 2
scale = np.sum(weighting ** 2) * sample_rate
fft[1:-1, :] *= (2.0 / scale)
fft[(0, -1), :] /= scale
freqs = float(sample_rate) / window_size * np.arange(fft.shape[0])
ind = np.where(freqs <= (sample_rate / 2))[0][-1] + 1
return np.log(fft[:ind, :] + eps)
"""Contains the speech featurizer class."""
from data_utils.featurizer.audio_featurizer import AudioFeaturizer
from data_utils.featurizer.text_featurizer import TextFeaturizer
class SpeechFeaturizer(object):
"""Speech featurizer, for extracting features from both audio and transcript
contents of SpeechSegment.
Currently, for audio parts, it supports feature types of linear
spectrogram and mfcc; for transcript parts, it only supports char-level
tokenizing and conversion into a list of token indices. Note that the
token indexing order follows the given vocabulary file.
:param vocab_filepath: Filepath to load vocabulary for token indices
conversion.
:type vocab_filepath: str
:param stride_ms: Striding size (in milliseconds) for generating frames.
:type stride_ms: float
:param window_ms: Window size (in milliseconds) for generating frames.
:type window_ms: float
:param target_sample_rate: Speech are resampled (if upsampling or
downsampling is allowed) to this before
extracting spectrogram features.
:type target_sample_rate: int
:param use_dB_normalization: Whether to normalize the audio to a certain
decibels before extracting the features.
:type use_dB_normalization: bool
:param target_dB: Target audio decibels for normalization.
:type target_dB: float
"""
def __init__(self,
vocab_filepath,
stride_ms=10.0,
window_ms=20.0,
target_sample_rate=16000,
use_dB_normalization=True,
target_dB=-20):
self._audio_featurizer = AudioFeaturizer(stride_ms=stride_ms,
window_ms=window_ms,
target_sample_rate=target_sample_rate,
use_dB_normalization=use_dB_normalization,
target_dB=target_dB)
self._text_featurizer = TextFeaturizer(vocab_filepath)
def featurize(self, speech_segment, keep_transcription_text):
"""提取语音片段的特征
1. For audio parts, extract the audio features.
2. For transcript parts, keep the original text or convert text string
to a list of token indices in char-level.
:param audio_segment: Speech segment to extract features from.
:type audio_segment: SpeechSegment
:return: A tuple of 1) spectrogram audio feature in 2darray, 2) list of
char-level token indices.
:rtype: tuple
"""
audio_feature = self._audio_featurizer.featurize(speech_segment)
if keep_transcription_text:
return audio_feature, speech_segment.transcript
text_ids = self._text_featurizer.featurize(speech_segment.transcript)
return audio_feature, text_ids
@property
def vocab_size(self):
"""返回词汇表大小
:return: Vocabulary size.
:rtype: int
"""
return self._text_featurizer.vocab_size
@property
def vocab_list(self):
"""返回词汇表的list
:return: Vocabulary in list.
:rtype: list
"""
return self._text_featurizer.vocab_list
class TextFeaturizer(object):
"""文本特征器,用于处理或从文本中提取特征。支持字符级的令牌化和转换为令牌索引列表
:param vocab_filepath: 令牌索引转换词汇表的文件路径
:type vocab_filepath: str
"""
def __init__(self, vocab_filepath):
self._vocab_dict, self._vocab_list = self._load_vocabulary_from_file(
vocab_filepath)
def featurize(self, text):
"""将文本字符串转换为字符级的令牌索引列表
:param text: 文本
:type text: str
:return:字符级令牌索引列表
:rtype: list
"""
tokens = self._char_tokenize(text)
token_indices = []
for token in tokens:
# 跳过词汇表不存在的字符
if token not in self._vocab_list:continue
token_indices.append(self._vocab_dict[token])
return token_indices
@property
def vocab_size(self):
"""返回词汇表大小
:return: Vocabulary size.
:rtype: int
"""
return len(self._vocab_list)
@property
def vocab_list(self):
"""返回词汇表的列表
:return: Vocabulary in list.
:rtype: list
"""
return self._vocab_list
def _char_tokenize(self, text):
"""Character tokenizer."""
return list(text.strip())
def _load_vocabulary_from_file(self, vocab_filepath):
"""Load vocabulary from file."""
vocab_lines = []
with open(vocab_filepath, 'r', encoding='utf-8') as file:
vocab_lines.extend(file.readlines())
vocab_list = [line.split('\t')[0].replace('\n', '') for line in vocab_lines]
vocab_dict = dict(
[(token, id) for (id, token) in enumerate(vocab_list)])
return vocab_dict, vocab_list
"""特征归一化"""
import math
import numpy as np
import random
from tqdm import tqdm
from paddle.io import Dataset, DataLoader
from data_utils.utility import read_manifest
from data_utils.audio import AudioSegment
from data_utils.featurizer.audio_featurizer import AudioFeaturizer
class FeatureNormalizer(object):
"""音频特征归一化类
如果mean_std_filepath不是None,则normalizer将直接从文件初始化。否则,使用manifest_path应该给特征mean和stddev计算
:param mean_std_filepath: 均值和标准值的文件路径
:type mean_std_filepath: None|str
:param manifest_path: 用于计算均值和标准值的数据列表,一般是训练的数据列表
:type meanifest_path: None|str
:param featurize_func:函数提取特征。它应该是可调用的``featurize_func(audio_segment)``
:type featurize_func: None|callable
:param num_samples: 用于计算均值和标准值的音频数量
:type num_samples: int
:param random_seed: 随机种子
:type random_seed: int
:raises ValueError: 如果mean_std_filepath和manifest_path(或mean_std_filepath和featurize_func)都为None
"""
def __init__(self,
mean_std_filepath,
manifest_path=None,
num_workers=4,
num_samples=5000,
random_seed=0):
if not mean_std_filepath:
if not manifest_path:
raise ValueError("如果mean_std_filepath是None,那么meanifest_path和featurize_func不应该是None")
self._rng = random.Random(random_seed)
self._compute_mean_std(manifest_path, num_samples, num_workers)
else:
self._read_mean_std_from_file(mean_std_filepath)
def apply(self, features, eps=1e-20):
"""使用均值和标准值计算音频特征的归一化值
:param features: 需要归一化的音频
:type features: ndarray
:param eps: 添加到标准值以提供数值稳定性
:type eps: float
:return: 已经归一化的数据
:rtype: ndarray
"""
return (features - self._mean) / (self._std + eps)
def write_to_file(self, filepath):
"""将计算得到的均值和标准值写入到文件中
:param filepath: 均值和标准值写入的文件路径
:type filepath: str
"""
np.savez(filepath, mean=self._mean, std=self._std)
def _read_mean_std_from_file(self, filepath):
"""从文件中加载均值和标准值"""
npzfile = np.load(filepath)
self._mean = npzfile["mean"]
self._std = npzfile["std"]
def _compute_mean_std(self, manifest_path, num_samples, num_workers):
"""从随机抽样的实例中计算均值和标准值"""
manifest = read_manifest(manifest_path)
if num_samples < 0 or num_samples > len(manifest):
sampled_manifest = manifest
else:
sampled_manifest = self._rng.sample(manifest, num_samples)
dataset = NormalizerDataset(sampled_manifest)
test_loader = DataLoader(dataset=dataset, batch_size=64, collate_fn=collate_fn, num_workers=num_workers)
# 求总和
std, means = None, None
number = 0
for std1, means1, number1 in tqdm(test_loader()):
number += number1
if means is None:
means = means1
else:
means += means1
if std is None:
std = std1
else:
std += std1
# 求总和的均值和标准值
for i in range(len(means)):
means[i] /= number
std[i] = std[i] / number - means[i] * means[i]
if std[i] < 1.0e-20:
std[i] = 1.0e-20
std[i] = math.sqrt(std[i])
self._mean = means.reshape([-1, 1])
self._std = std.reshape([-1, 1])
class NormalizerDataset(Dataset):
def __init__(self, sampled_manifest):
super(NormalizerDataset, self).__init__()
self.audio_featurizer = AudioFeaturizer()
self.sampled_manifest = sampled_manifest
def __getitem__(self, idx):
instance = self.sampled_manifest[idx]
# 获取音频特征
audio = AudioSegment.from_file(instance["audio_filepath"])
feature = self.audio_featurizer.featurize(audio)
return feature, 0
def __len__(self):
return len(self.sampled_manifest)
def collate_fn(features):
std, means = None, None
number = 0
for feature, _ in features:
number += feature.shape[1]
sums = np.sum(feature, axis=1)
if means is None:
means = sums
else:
means += sums
square_sums = np.sum(np.square(feature), axis=1)
if std is None:
std = square_sums
else:
std += square_sums
return std, means, number
"""Contains the speech segment class."""
import numpy as np
from data_utils.audio import AudioSegment
class SpeechSegment(AudioSegment):
"""语音片段抽象是音频片段的一个子类,附加文字记录。
:param samples: Audio samples [num_samples x num_channels].
:type samples: ndarray.float32
:param sample_rate: 训练数据的采样率
:type sample_rate: int
:param transcript: 音频文件对应的文本
:type transript: str
:raises TypeError: If the sample data type is not float or int.
"""
def __init__(self, samples, sample_rate, transcript):
AudioSegment.__init__(self, samples, sample_rate)
self._transcript = transcript
def __eq__(self, other):
"""Return whether two objects are equal.
"""
if not AudioSegment.__eq__(self, other):
return False
if self._transcript != other._transcript:
return False
return True
def __ne__(self, other):
"""Return whether two objects are unequal."""
return not self.__eq__(other)
@classmethod
def from_file(cls, filepath, transcript):
"""从音频文件和相应的文本创建语音片段
:param filepath: 音频文件路径
:type filepath: str|file
:param transcript: 音频文件对应的文本
:type transript: str
:return: Speech segment instance.
:rtype: SpeechSegment
"""
audio = AudioSegment.from_file(filepath)
return cls(audio.samples, audio.sample_rate, transcript)
@classmethod
def from_bytes(cls, bytes, transcript):
"""从字节串和相应的文本创建语音片段
:param bytes: 包含音频样本的字节字符串
:type bytes: str
:param transcript: 音频文件对应的文本
:type transript: str
:return: Speech segment instance.
:rtype: Speech Segment
"""
audio = AudioSegment.from_bytes(bytes)
return cls(audio.samples, audio.sample_rate, transcript)
@classmethod
def concatenate(cls, *segments):
"""将任意数量的语音片段连接在一起,音频和文本都将被连接
:param *segments: 要连接的输入语音片段
:type *segments: tuple of SpeechSegment
:return: 返回SpeechSegment实例
:rtype: SpeechSegment
:raises ValueError: 不能用不同的抽样率连接片段
:raises TypeError: 只有相同类型SpeechSegment实例的语音片段可以连接
"""
if len(segments) == 0:
raise ValueError("音频片段为空")
sample_rate = segments[0]._sample_rate
transcripts = ""
for seg in segments:
if sample_rate != seg._sample_rate:
raise ValueError("不能用不同的抽样率连接片段")
if type(seg) is not cls:
raise TypeError("只有相同类型SpeechSegment实例的语音片段可以连接")
transcripts += seg._transcript
samples = np.concatenate([seg.samples for seg in segments])
return cls(samples, sample_rate, transcripts)
@classmethod
def slice_from_file(cls, filepath, transcript, start=None, end=None):
"""只加载一小部分SpeechSegment,而不需要将整个文件加载到内存中,这是非常浪费的。
:param filepath:文件路径或文件对象到音频文件
:type filepath: str|file
:param start: 开始时间,单位为秒。如果start是负的,则它从末尾开始计算。如果没有提供,这个函数将从最开始读取。
:type start: float
:param end: 结束时间,单位为秒。如果end是负的,则它从末尾开始计算。如果没有提供,默认的行为是读取到文件的末尾。
:type end: float
:param transcript: 音频文件对应的文本,如果没有提供,默认值是一个空字符串。
:type transript: str
:return: SpeechSegment实例
:rtype: SpeechSegment
"""
audio = AudioSegment.slice_from_file(filepath, start, end)
return cls(audio.samples, audio.sample_rate, transcript)
@classmethod
def make_silence(cls, duration, sample_rate):
"""创建指定安静音频长度和采样率的SpeechSegment实例,音频文件对应的文本将为空字符串。
:param duration: 安静音频的时间,单位秒
:type duration: float
:param sample_rate: 音频采样率
:type sample_rate: float
:return: 安静音频SpeechSegment实例
:rtype: SpeechSegment
"""
audio = AudioSegment.make_silence(duration, sample_rate)
return cls(audio.samples, audio.sample_rate, "")
@property
def transcript(self):
"""返回音频文件对应的文本
:return: 音频文件对应的文本
:rtype: str
"""
return self._transcript
"""数据工具函数"""
import json
def read_manifest(manifest_path, max_duration=float('inf'), min_duration=0.0):
"""解析数据列表
持续时间在[min_duration, max_duration]之外的实例将被过滤。
:param manifest_path: 数据列表的路径
:type manifest_path: str
:param max_duration: 过滤的最长音频长度
:type max_duration: float
:param min_duration: 过滤的最短音频长度
:type min_duration: float
:return: 数据列表,JSON格式
:rtype: list
:raises IOError: If failed to parse the manifest.
"""
manifest = []
for json_line in open(manifest_path, 'r', encoding='utf-8'):
try:
json_data = json.loads(json_line)
except Exception as e:
raise IOError("Error reading manifest: %s" % str(e))
if max_duration >= json_data["duration"] >= min_duration:
manifest.append(json_data)
return manifest
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment