オムライスの備忘録

数学・統計学・機械学習・プログラミングに関することを記す

【音声解析】アルゴリズム #まとめ編

Index

音声解析におけるアルゴリズム

音声解析における深層学習を用いたアルゴリズムを記す.

DNN

WaveNet / 2016

  • WaveNet: A Generative Model for Raw Audio

RNN

Deep Speech / 2014

  • Deep Speech: Scaling up end-to-end speech recognition

CNN

Wav2letter / 2016

  • Wav2Letter: an End-to-End ConvNet-based Speech Recognition System

wav2vec /2019

  • wav2vec: Unsupervised Pre-training for Speech Recognition

wav2vec 2.0 / 2020

  • wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

wav2vec-U / 2021

  • Unsupervised Speech Recognition

Attention

ESPnet / 2018

アルゴリズムと実装を含めた総称.

  • ESPnet: End-to-End Speech Processing Toolkit

ReazonSpeech / 2023

ESPnet に独自のコーパスで学習することで、日本語のモデルを作成.

コーパスとモデルの総称.

Whisper / 2022

WhisperX / 2023

  • WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

Squeezeformer / 2022

  • Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

BERT-CTC / 2022

  • BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model

BECTRA / 2022

  • BECTRA: Transducer-based End-to-End ASR with BERT-Enhanced Encoder

ACE-VC / 2023

  • ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations

JEIT / 2023

  • JEIT: Joint End-to-End Model and Internal Language Model Training for Speech Recognition

Google USM / 2023

whisperを超える精度の音声認識モデルをgoogle が発表.

300以上の言語の1,000万時間の教師なし音声、21万時間の教師あり音声を使用.



AVFormer / 2023

  • AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR

工夫・テクニック

Diffusion Model

UML / 2023

  • UML: A Universal Monolingual Output Layer for Multilingual ASR

GAN

Wave-U-Net Discriminator / 2023

  • Wave-U-Net Discriminator: Fast and Lightweight Discriminator for Generative Adversarial Network-Based Speech Synthesis

実装・ツール

参考