Commit 3c1e08b4 authored by 西极's avatar 西极

init

parent 797bf0f3
# 2023Thesis 见两部分代码文件夹中的Readme文件说明
\ No newline at end of file
MIT License
Copyright (c) 2021 Jinglin Liu
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
\ No newline at end of file
# DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
[![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2105.02446)
[![GitHub Stars](https://img.shields.io/github/stars/MoonInTheRiver/DiffSinger?style=social)](https://github.com/MoonInTheRiver/DiffSinger)
[![downloads](https://img.shields.io/github/downloads/MoonInTheRiver/DiffSinger/total.svg)](https://github.com/MoonInTheRiver/DiffSinger/releases)
[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-blue?label=TTSDemo)](https://huggingface.co/spaces/NATSpeech/DiffSpeech)
[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-blue?label=SVSDemo)](https://huggingface.co/spaces/Silentlin/DiffSinger)
This repository is the official PyTorch implementation of our AAAI-2022 [paper](https://arxiv.org/abs/2105.02446), in which we propose DiffSinger (for Singing-Voice-Synthesis) and DiffSpeech (for Text-to-Speech).
:tada: :tada: :tada: **Updates**:
- Sep.11, 2022: :electric_plug: [DiffSinger-PN](docs/README-SVS-opencpop-pndm.md). Add plug-in [PNDM](https://arxiv.org/abs/2202.09778), ICLR 2022 in our laboratory, to accelerate DiffSinger freely.
- Jul.27, 2022: Update documents for [SVS](docs/README-SVS.md). Add easy inference [A](docs/README-SVS-opencpop-cascade.md#4-inference-from-raw-inputs) & [B](docs/README-SVS-opencpop-e2e.md#4-inference-from-raw-inputs); Add Interactive SVS running on [HuggingFace🤗 SVS](https://huggingface.co/spaces/Silentlin/DiffSinger).
- Mar.2, 2022: MIDI-B-version.
- Mar.1, 2022: [NeuralSVB](https://github.com/MoonInTheRiver/NeuralSVB), for singing voice beautifying, has been released.
- Feb.13, 2022: [NATSpeech](https://github.com/NATSpeech/NATSpeech), the improved code framework, which contains the implementations of DiffSpeech and our NeurIPS-2021 work [PortaSpeech](https://openreview.net/forum?id=xmJsuh8xlq) has been released.
- Jan.29, 2022: support MIDI-A-version SVS.
- Jan.13, 2022: support SVS, release PopCS dataset.
- Dec.19, 2021: support TTS. [HuggingFace🤗 TTS](https://huggingface.co/spaces/NATSpeech/DiffSpeech)
:rocket: **News**:
- Feb.24, 2022: Our new work, NeuralSVB was accepted by ACL-2022 [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2202.13277). [Demo Page](https://neuralsvb.github.io).
- Dec.01, 2021: DiffSinger was accepted by AAAI-2022.
- Sep.29, 2021: Our recent work `PortaSpeech: Portable and High-Quality Generative Text-to-Speech` was accepted by NeurIPS-2021 [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2109.15166) .
- May.06, 2021: We submitted DiffSinger to Arxiv [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2105.02446).
## Environments
1. If you want to use env of anaconda:
```sh
conda create -n your_env_name python=3.8
source activate your_env_name
pip install -r requirements_2080.txt (GPU 2080Ti, CUDA 10.2)
or pip install -r requirements_3090.txt (GPU 3090, CUDA 11.4)
```
2. Or, if you want to use virtual env of python:
```sh
## Install Python 3.8 first.
python -m venv venv
source venv/bin/activate
# install requirements.
pip install -U pip
pip install Cython numpy==1.19.1
pip install torch==1.9.0
pip install -r requirements.txt
```
## Documents
- [Run DiffSpeech (TTS version)](docs/README-TTS.md).
- [Run DiffSinger (SVS version)](docs/README-SVS.md).
## Overview
| Mel Pipeline | Dataset | Pitch Input | F0 Prediction | Acceleration Method | Vocoder |
| ------------------------------------------------------------------------------------------- | ---------------------------------------------------------| ----------------- | ------------- | --------------------------- | ----------------------------- |
| [DiffSpeech (Text->F0, Text+F0->Mel, Mel->Wav)](docs/README-TTS.md) | [Ljspeech](https://keithito.com/LJ-Speech-Dataset/) | None | Explicit | Shallow Diffusion | NSF-HiFiGAN |
| [DiffSinger (Lyric+F0->Mel, Mel->Wav)](docs/README-SVS-popcs.md) | [PopCS](https://github.com/MoonInTheRiver/DiffSinger) | Ground-Truth F0 | None | Shallow Diffusion | NSF-HiFiGAN |
| [DiffSinger (Lyric+MIDI->F0, Lyric+F0->Mel, Mel->Wav)](docs/README-SVS-opencpop-cascade.md) | [OpenCpop](https://wenet.org.cn/opencpop/) | MIDI | Explicit | Shallow Diffusion | NSF-HiFiGAN |
| [FFT-Singer (Lyric+MIDI->F0, Lyric+F0->Mel, Mel->Wav)](docs/README-SVS-opencpop-cascade.md) | [OpenCpop](https://wenet.org.cn/opencpop/) | MIDI | Explicit | Invalid | NSF-HiFiGAN |
| [DiffSinger (Lyric+MIDI->Mel, Mel->Wav)](docs/README-SVS-opencpop-e2e.md) | [OpenCpop](https://wenet.org.cn/opencpop/) | MIDI | Implicit | None | Pitch-Extractor + NSF-HiFiGAN |
| [DiffSinger+PNDM (Lyric+MIDI->Mel, Mel->Wav)](docs/README-SVS-opencpop-pndm.md) | [OpenCpop](https://wenet.org.cn/opencpop/) | MIDI | Implicit | PLMS | Pitch-Extractor + NSF-HiFiGAN |
## Tensorboard
```sh
tensorboard --logdir_spec exp_name
```
<table style="width:100%">
<tr>
<td><img src="resources/tfb.png" alt="Tensorboard" height="250"></td>
</tr>
</table>
## Citation
@article{liu2021diffsinger,
title={Diffsinger: Singing voice synthesis via shallow diffusion mechanism},
author={Liu, Jinglin and Li, Chengxi and Ren, Yi and Chen, Feiyang and Liu, Peng and Zhao, Zhou},
journal={arXiv preprint arXiv:2105.02446},
volume={2},
year={2021}}
## Acknowledgements
* lucidrains' [denoising-diffusion-pytorch](https://github.com/lucidrains/denoising-diffusion-pytorch)
* Official [PyTorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning)
* kan-bayashi's [ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN)
* jik876's [HifiGAN](https://github.com/jik876/hifi-gan)
* Official [espnet](https://github.com/espnet/espnet)
* lmnt-com's [DiffWave](https://github.com/lmnt-com/diffwave)
* keonlee9420's [Implementation](https://github.com/keonlee9420/DiffSinger).
Especially thanks to:
* Team Openvpi's maintenance: [DiffSinger](https://github.com/openvpi/DiffSinger).
* Your re-creation and sharing.
\ No newline at end of file
# task
binary_data_dir: ''
work_dir: '' # experiment directory.
infer: false # infer
seed: 1234
debug: false
save_codes:
- configs
- modules
- tasks
- utils
- usr
#############
# dataset
#############
ds_workers: 1
test_num: 100
valid_num: 100
endless_ds: false
sort_by_len: true
#########
# train and eval
#########
load_ckpt: ''
save_ckpt: true
save_best: false
num_ckpt_keep: 3
clip_grad_norm: 0
accumulate_grad_batches: 1
log_interval: 100
num_sanity_val_steps: 5 # steps of validation at the beginning
check_val_every_n_epoch: 10
val_check_interval: 2000
max_epochs: 1000
max_updates: 160000
max_tokens: 31250
max_sentences: 100000
max_eval_tokens: -1
max_eval_sentences: -1
test_input_dir: ''
base_config:
- configs/tts/base.yaml
- configs/tts/base_zh.yaml
datasets: []
test_prefixes: []
test_num: 0
valid_num: 0
pre_align_cls: data_gen.singing.pre_align.SingingPreAlign
binarizer_cls: data_gen.singing.binarize.SingingBinarizer
pre_align_args:
use_tone: false # for ZH
forced_align: mfa
use_sox: true
hop_size: 128 # Hop size.
fft_size: 512 # FFT size.
win_size: 512 # FFT size.
max_frames: 8000
fmin: 50 # Minimum freq in mel basis calculation.
fmax: 11025 # Maximum frequency in mel basis calculation.
pitch_type: frame
hidden_size: 256
mel_loss: "ssim:0.5|l1:0.5"
lambda_f0: 0.0
lambda_uv: 0.0
lambda_energy: 0.0
lambda_ph_dur: 0.0
lambda_sent_dur: 0.0
lambda_word_dur: 0.0
predictor_grad: 0.0
use_spk_embed: true
use_spk_id: false
max_tokens: 20000
max_updates: 400000
num_spk: 100
save_f0: true
use_gt_dur: true
use_gt_f0: true
base_config:
- configs/tts/fs2.yaml
- configs/singing/base.yaml
# task
base_config: configs/config_base.yaml
task_cls: ''
#############
# dataset
#############
raw_data_dir: ''
processed_data_dir: ''
binary_data_dir: ''
dict_dir: ''
pre_align_cls: ''
binarizer_cls: data_gen.tts.base_binarizer.BaseBinarizer
pre_align_args:
use_tone: true # for ZH
forced_align: mfa
use_sox: false
txt_processor: en
allow_no_txt: false
denoise: false
binarization_args:
shuffle: false
with_txt: true
with_wav: false
with_align: true
with_spk_embed: true
with_f0: true
with_f0cwt: true
loud_norm: false
endless_ds: true
reset_phone_dict: true
test_num: 100
valid_num: 100
max_frames: 1550
max_input_tokens: 1550
audio_num_mel_bins: 80
audio_sample_rate: 22050
hop_size: 256 # For 22050Hz, 275 ~= 12.5 ms (0.0125 * sample_rate)
win_size: 1024 # For 22050Hz, 1100 ~= 50 ms (If None, win_size: fft_size) (0.05 * sample_rate)
fmin: 80 # Set this to 55 if your speaker is male! if female, 95 should help taking off noise. (To test depending on dataset. Pitch info: male~[65, 260], female~[100, 525])
fmax: 7600 # To be increased/reduced depending on data.
fft_size: 1024 # Extra window size is filled with 0 paddings to match this parameter
min_level_db: -100
num_spk: 1
mel_vmin: -6
mel_vmax: 1.5
ds_workers: 4
#########
# model
#########
dropout: 0.1
enc_layers: 4
dec_layers: 4
hidden_size: 384
num_heads: 2
prenet_dropout: 0.5
prenet_hidden_size: 256
stop_token_weight: 5.0
enc_ffn_kernel_size: 9
dec_ffn_kernel_size: 9
ffn_act: gelu
ffn_padding: 'SAME'
###########
# optimization
###########
lr: 2.0
warmup_updates: 8000
optimizer_adam_beta1: 0.9
optimizer_adam_beta2: 0.98
weight_decay: 0
clip_grad_norm: 1
###########
# train and eval
###########
max_tokens: 30000
max_sentences: 100000
max_eval_sentences: 1
max_eval_tokens: 60000
train_set_name: 'train'
valid_set_name: 'valid'
test_set_name: 'test'
vocoder: pwg
vocoder_ckpt: ''
profile_infer: false
out_wav_norm: false
save_gt: false
save_f0: false
gen_dir_name: ''
use_denoise: false
pre_align_args:
txt_processor: zh_g2pM
binarizer_cls: data_gen.tts.binarizer_zh.ZhBinarizer
\ No newline at end of file
base_config: configs/tts/base.yaml
task_cls: tasks.tts.fs2.FastSpeech2Task
# model
hidden_size: 256
dropout: 0.1
encoder_type: fft # fft|tacotron|tacotron2|conformer
encoder_K: 8 # for tacotron encoder
decoder_type: fft # fft|rnn|conv|conformer
use_pos_embed: true
# duration
predictor_hidden: -1
predictor_kernel: 5
predictor_layers: 2
dur_predictor_kernel: 3
dur_predictor_layers: 2
predictor_dropout: 0.5
# pitch and energy
use_pitch_embed: true
pitch_type: ph # frame|ph|cwt
use_uv: true
cwt_hidden_size: 128
cwt_layers: 2
cwt_loss: l1
cwt_add_f0_loss: false
cwt_std_scale: 0.8
pitch_ar: false
#pitch_embed_type: 0q
pitch_loss: 'l1' # l1|l2|ssim
pitch_norm: log
use_energy_embed: false
# reference encoder and speaker embedding
use_spk_id: false
use_split_spk_id: false
use_spk_embed: false
use_var_enc: false
lambda_commit: 0.25
ref_norm_layer: bn
pitch_enc_hidden_stride_kernel:
- 0,2,5 # conv_hidden_size, conv_stride, conv_kernel_size. conv_hidden_size=0: use hidden_size
- 0,2,5
- 0,2,5
dur_enc_hidden_stride_kernel:
- 0,2,3 # conv_hidden_size, conv_stride, conv_kernel_size. conv_hidden_size=0: use hidden_size
- 0,2,3
- 0,1,3
# mel
mel_loss: l1:0.5|ssim:0.5 # l1|l2|gdl|ssim or l1:0.5|ssim:0.5
# loss lambda
lambda_f0: 1.0
lambda_uv: 1.0
lambda_energy: 0.1
lambda_ph_dur: 1.0
lambda_sent_dur: 1.0
lambda_word_dur: 1.0
predictor_grad: 0.1
# train and eval
pretrain_fs_ckpt: ''
warmup_updates: 2000
max_tokens: 32000
max_sentences: 100000
max_eval_sentences: 1
max_updates: 120000
num_valid_plots: 5
num_test_samples: 0
test_ids: []
use_gt_dur: false
use_gt_f0: false
# exp
dur_loss: mse # huber|mol
norm_type: gn
\ No newline at end of file
base_config: configs/tts/pwg.yaml
task_cls: tasks.vocoder.hifigan.HifiGanTask
resblock: "1"
adam_b1: 0.8
adam_b2: 0.99
upsample_rates: [ 8,8,2,2 ]
upsample_kernel_sizes: [ 16,16,4,4 ]
upsample_initial_channel: 128
resblock_kernel_sizes: [ 3,7,11 ]
resblock_dilation_sizes: [ [ 1,3,5 ], [ 1,3,5 ], [ 1,3,5 ] ]
lambda_mel: 45.0
max_samples: 8192
max_sentences: 16
generator_params:
lr: 0.0002 # Generator's learning rate.
aux_context_window: 0 # Context window size for auxiliary feature.
discriminator_optimizer_params:
lr: 0.0002 # Discriminator's learning rate.
\ No newline at end of file
raw_data_dir: 'data/raw/LJSpeech-1.1'
processed_data_dir: 'data/processed/ljspeech'
binary_data_dir: 'data/binary/ljspeech_wav'
raw_data_dir: 'data/raw/LJSpeech-1.1'
processed_data_dir: 'data/processed/ljspeech'
binary_data_dir: 'data/binary/ljspeech'
pre_align_cls: data_gen.tts.lj.pre_align.LJPreAlign
pitch_type: cwt
mel_loss: l1
num_test_samples: 20
test_ids: [ 68, 70, 74, 87, 110, 172, 190, 215, 231, 294,
316, 324, 402, 422, 485, 500, 505, 508, 509, 519 ]
use_energy_embed: false
test_num: 523
valid_num: 348
\ No newline at end of file
base_config:
- configs/tts/fs2.yaml
- configs/tts/lj/base_text2mel.yaml
\ No newline at end of file
base_config:
- configs/tts/hifigan.yaml
- configs/tts/lj/base_mel2wav.yaml
\ No newline at end of file
base_config:
- configs/tts/pwg.yaml
- configs/tts/lj/base_mel2wav.yaml
\ No newline at end of file
base_config: configs/tts/base.yaml
task_cls: tasks.vocoder.pwg.PwgTask
binarization_args:
with_wav: true
with_spk_embed: false
with_align: false
test_input_dir: ''
###########
# train and eval
###########
max_samples: 25600
max_sentences: 5
max_eval_sentences: 1
max_updates: 1000000
val_check_interval: 2000
###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
sampling_rate: 22050 # Sampling rate.
fft_size: 1024 # FFT size.
hop_size: 256 # Hop size.
win_length: null # Window length.
# If set to null, it will be the same as fft_size.
window: "hann" # Window function.
num_mels: 80 # Number of mel basis.
fmin: 80 # Minimum freq in mel basis calculation.
fmax: 7600 # Maximum frequency in mel basis calculation.
format: "hdf5" # Feature file format. "npy" or "hdf5" is supported.
###########################################################
# GENERATOR NETWORK ARCHITECTURE SETTING #
###########################################################
generator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_size: 3 # Kernel size of dilated convolution.
layers: 30 # Number of residual block layers.
stacks: 3 # Number of stacks i.e., dilation cycles.
residual_channels: 64 # Number of channels in residual conv.
gate_channels: 128 # Number of channels in gated conv.
skip_channels: 64 # Number of channels in skip conv.
aux_channels: 80 # Number of channels for auxiliary feature conv.
# Must be the same as num_mels.
aux_context_window: 2 # Context window size for auxiliary feature.
# If set to 2, previous 2 and future 2 frames will be considered.
dropout: 0.0 # Dropout rate. 0.0 means no dropout applied.
use_weight_norm: true # Whether to use weight norm.
# If set to true, it will be applied to all of the conv layers.
upsample_net: "ConvInUpsampleNetwork" # Upsampling network architecture.
upsample_params: # Upsampling network parameters.
upsample_scales: [4, 4, 4, 4] # Upsampling scales. Prodcut of these must be the same as hop size.
use_pitch_embed: false
###########################################################
# DISCRIMINATOR NETWORK ARCHITECTURE SETTING #
###########################################################
discriminator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_size: 3 # Number of output channels.
layers: 10 # Number of conv layers.
conv_channels: 64 # Number of chnn layers.
bias: true # Whether to use bias parameter in conv.
use_weight_norm: true # Whether to use weight norm.
# If set to true, it will be applied to all of the conv layers.
nonlinear_activation: "LeakyReLU" # Nonlinear function after each conv.
nonlinear_activation_params: # Nonlinear function parameters
negative_slope: 0.2 # Alpha in LeakyReLU.
###########################################################
# STFT LOSS SETTING #
###########################################################
stft_loss_params:
fft_sizes: [1024, 2048, 512] # List of FFT size for STFT-based loss.
hop_sizes: [120, 240, 50] # List of hop size for STFT-based loss
win_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
window: "hann_window" # Window function for STFT-based loss
use_mel_loss: false
###########################################################
# ADVERSARIAL LOSS SETTING #
###########################################################
lambda_adv: 4.0 # Loss balancing coefficient.
###########################################################
# OPTIMIZER & SCHEDULER SETTING #
###########################################################
generator_optimizer_params:
lr: 0.0001 # Generator's learning rate.
eps: 1.0e-6 # Generator's epsilon.
weight_decay: 0.0 # Generator's weight decay coefficient.
generator_scheduler_params:
step_size: 200000 # Generator's scheduler step size.
gamma: 0.5 # Generator's scheduler gamma.
# At each step size, lr will be multiplied by this parameter.
generator_grad_norm: 10 # Generator's gradient norm.
discriminator_optimizer_params:
lr: 0.00005 # Discriminator's learning rate.
eps: 1.0e-6 # Discriminator's epsilon.
weight_decay: 0.0 # Discriminator's weight decay coefficient.
discriminator_scheduler_params:
step_size: 200000 # Discriminator's scheduler step size.
gamma: 0.5 # Discriminator's scheduler gamma.
# At each step size, lr will be multiplied by this parameter.
discriminator_grad_norm: 1 # Discriminator's gradient norm.
disc_start_steps: 40000 # Number of steps to start to train discriminator.
! !
, ,
. .
; ;
<BOS> <BOS>
<EOS> <EOS>
? ?
AA0 AA0
AA1 AA1
AA2 AA2
AE0 AE0
AE1 AE1
AE2 AE2
AH0 AH0
AH1 AH1
AH2 AH2
AO0 AO0
AO1 AO1
AO2 AO2
AW0 AW0
AW1 AW1
AW2 AW2
AY0 AY0
AY1 AY1
AY2 AY2
B B
CH CH
D D
DH DH
EH0 EH0
EH1 EH1
EH2 EH2
ER0 ER0
ER1 ER1
ER2 ER2
EY0 EY0
EY1 EY1
EY2 EY2
F F
G G
HH HH
IH0 IH0
IH1 IH1
IH2 IH2
IY0 IY0
IY1 IY1
IY2 IY2
JH JH
K K
L L
M M
N N
NG NG
OW0 OW0
OW1 OW1
OW2 OW2
OY0 OY0
OY1 OY1
OY2 OY2
P P
R R
S S
SH SH
T T
TH TH
UH0 UH0
UH1 UH1
UH2 UH2
UW0 UW0
UW1 UW1
UW2 UW2
V V
W W
Y Y
Z Z
ZH ZH
| |
No preview for this file type
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment