icassp2020--TEXT-INDEPENDENT SPEAKER VERIFICATION WITH ADVERSARIAL LEARNING ON SHORT UTTERANCES_graph attentive feature aggregation for text-indep-程序员宅基地

技术标签： icassp2020

TEXT-INDEPENDENT SPEAKER VERIFICATION WITH ADVERSARIAL LEARNING ON SHORT UTTERANCES

Kai Liu, Huan Zhou
Artificial Intelligence Application Research Center, Huawei Technologies Shenzhen, PRC

ABSTRACT
摘要
A text-independent speaker verification system suffers severe performance degradation under short utterance condition. To address the problem, in this paper, we propose an adversari- ally learned embedding mapping model that directly maps a short embedding to an enhanced embedding with increased discriminability. In particular, a Wasserstein GAN with a bunch of loss criteria are investigated. These loss functions have distinct optimization objectives and some of them are less favoured for the speaker verification research area. Dif- ferent from most prior studies, our main objective in this study is to investigate the effectiveness of those loss criteria by con- ducting numerous ablation studies. Experiments on Voxceleb dataset showed that some criteria are beneficial to the veri- fication performance while some have trivial effects. Lastly, a Wasserstein GAN with chosen loss criteria, without fine- tuning, achieves meaningful advancements over the baseline, with 4% relative improvements on EER and 7% on minDCF in the challenging scenario of short 2second utterances.
与文本无关的说话人验证系统在短语音条件下性能严重下降。为了解决这一问题，本文提出了一种敌方学习的嵌入映射模型，该模型直接将短嵌入映射到具有更高可分辨性的增强嵌入。特别地，研究了具有一系列损耗准则的Wasserstein GAN。这些损失函数具有明显的优化目标，其中一些函数不太适合于说话人验证领域。与以往的研究不同，本研究的主要目的是通过大量的消融研究来探讨这些丧失标准的有效性。在Voxceleb数据集上的实验表明，一些准则有利于验证性能的提高，而一些准则的影响较小。最后，一个Wasserstein GAN在没有微调的情况下选择了损耗标准，在较短的2秒话语的挑战性场景中，与基线相比取得了有意义的进步，EER相对提高了4%，minDCF相对提高了7%。

INTRODUCTION
一。导言
Text-independent Speaker Verification (SV) aims to automat- ically verify the identity of a speaker, given enrolled speaker record and some test speech signal (with no special constraint on phonetic content). The most important step in the SV pipeline is to map speech of arbitrary duration into speaker representation of fixed dimension. It’s desirable for such a speaker representation to be compact, discriminative and ro- bust to extrinsic and intrinsic variations.
文本无关说话人验证（SV）的目的是自动验证说话人的身份，给定注册的说话人记录和一些测试语音信号（对语音内容没有特殊限制）。SV管道中最重要的一步是将任意持续时间的语音映射成固定维数的说话人表示。这样一个说话人的表现形式应该是紧凑的，有区别的，并且是外部和内在变化的。
Several types of speaker representations have been de- veloped over the past decades. The well-known i-vector [3] has been the state-of-the-art speaker representation, usually associated with a simple cosine-scoring strategy or more powerful probability linear discriminant analysis (PLDA) [12, 4] as verifier. With the advent of deep neural networks (DNNs), a variety of DNN frameworks and loss functions have been developed to learn deep speaker representations, known as embeddings. By training these networks with either
在过去的几十年里，有几种类型的说话人表征已经发展出来。众所周知的i向量[3]是最先进的说话人表示，通常与简单的余弦评分策略或更强大的概率线性判别分析（PLDA）[12，4]作为验证器相关联。随着深度神经网络（DNNs）的出现，人们开发了各种DNN框架和损失函数来学习深度说话人表示，称为嵌入。通过训练这些网络
the cross-entropy loss, or some form of contrastive loss on large amount of data, the resulting embeddings are speaker- discriminative. Compared to the i-vector, those embeddings, such as x-vector[2] and GhostVLAD-aggregated embedding [18] (or G-vector for short), are promising, demonstrating competitive performance for long speeches and distinct ad- vantage for short speeches. Furthermore, the recently devel- oped G-vector further shows considerable gains over x-vector for noisy test conditions, which makes it more favorable for a practical SV system.
交叉熵损失，或者对大量数据的某种形式的对比损失，导致的嵌入是说话人识别的。与i-向量相比，x-向量[2]和ghostflad聚合嵌入[18]等嵌入技术具有广阔的应用前景，表现出了长语音的竞争性和短语音的独特优势。此外，最近发展起来的G矢量在噪声测试条件下比x矢量有更大的增益，这使得它更适合于实际的SV系统。

[18] Weidi Xie, Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Utterance-level aggregation for speaker recognition in the wild. In ICASSP, pages 5791– 5795, 2019. 牛津的

However, the performance of a SV system usually de- grades in real scenarios, due to prevalent mismatches between development and test condition, such as channel, domain or duration mismatch [11, 5, 18]. For instance, it has been ob- served [5] that on NIST-SRE 2010 test set (female part), the performance of i-vector/PLDA system drops from 2.48% to 24.78% when the verification trial was shortened from full- duration to 5 seconds long.
然而，由于开发和测试条件之间普遍存在不匹配，例如通道、域或持续时间不匹配，SV系统的性能在实际场景中通常会降低[11、5、18]。例如，在NIST-SRE 2010测试集（女性部分）上，当验证试验从完整持续时间缩短到5秒时，i-vector/PLDA系统的性能从2.48%下降到24.78%。
Numerous research studies have been proposed to miti- gate the short duration effect. An early family of researches aimed to modify different aspects of i-vector based SV sys- tem, e.g., feature extraction techniques, intermediate param- eter estimation, speaker model generation, score normaliza- tion techniques, as summarized in [11]. Recently, more novel deep learning technologies are explored. For instance, in- sufficient phonetic information is compensated by a teacher- student learning framework [17] and scoring scheme is cal- ibrated by transfer learning [13]. Another research strategy is to design duration robust speaker embeddings to dealing with utterances of arbitrary duration. By applying different neural network architectures and alternative loss functions, the discriminability of embeddings is further enhanced. For example, Inception Net with triplet loss is depolyed in [20], Inception-ResNet with joint softmax and center loss in [8] and ResCNN with novel speaker identity subspace loss in [14].
许多研究已经提出方法来缓解这种短时效应。早期的一系列研究旨在修改基于i-向量的SV系统的不同方面，例如特征提取技术、中间参数估计、说话人模型生成、分数规范化技术，如[11]所述。近年来，越来越多的新型深度学习技术被开发出来。例如，充分的语音信息由师生学习框架[17]进行补偿，评分方案由迁移学习[13]进行校准。另一种研究策略是设计持续时间鲁棒的说话人嵌入来处理任意持续时间的话语。通过采用不同的神经网络结构和交替损耗函数，进一步提高了嵌入的可分辨性。例如，在[20]中，具有三重态损失的起始网被去分解，在[8]中，具有联合软最大值和中心损失的起始ResNet和在[14]中具有新的说话人身份子空间损失的ResCNN。

[11] A. Poddar, M. Sahidullah, and G. Saha. Performance comparison of speaker recognition systems in presence of duration variability. In 2015 Annual IEEE India Conference (INDICON), pages 1–6, Dec 2015.
[13] Lihong Wan Jun Zhang Qingyang Hong, Lin Li and Feng Tong. Transfer learning for speaker verification on short utterances. In Interspeech, pages 1848–1852, 2016.
[20] Chunlei Zhang and Kazuhito Koishida. End-to-end text-independent speaker verification with triplet loss on short utterances. In Interspeech, pages 1487–1491, 2017.
[8] Na Li, Deyi Tuo, Dan Su, Zhifeng Li, and Dong Yu. Deep discriminative embeddings for duration robust speaker verification. In Interspeech, pages 2262–2266, 2018.
[14] Xinyuan Cai Ruifang Ji and Bo Xu. An end-to-end text- independent speaker identification system on short ut- terances. In Interspeech, pages 3628–3632, 2018.

Generative Adversarial Networks (GANs) [6] are one of the most popular deep learning algorithm developed recently. GANs have the potential to generate realistic instances and provide a solution to problems that require a generative so- lution, most notably in various image-to-image translation tasks.
生成性对抗网络（GANs）[6]是近年来发展起来的最流行的深度学习算法之一。GANs有可能生成真实的实例，并为需要生成解决方案的问题提供解决方案，尤其是在各种图像到图像的翻译任务中。

[6] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Sys- tems 27, pages 2672–2680. 2014. 这个是GAN的提出论文，是Ian Goodfellow搞出来的。

In this study, we aim to investigate the short duration is-sue presented in a practical SV system. Contrary to the most techniques mentioned above, our proposed approach works directly on the speaker embeddings. In particular, given short and long embedding pairs extracted from same speaker and session, we propose to use adversarial learning of Wasser- stein GAN to learn a new embedding with enhanced discrim- inability. To test our approach, G-vector is chosen as the em- bedding benchmark in our experiments due to its promising performance on short speeches. This put forward a challenge to our study than those prior studies which benchmarked with the i-vectors.
在本研究中，我们的目的是探讨在实际的SV系统中所呈现的短持续时间。与上述大多数技术相反，我们提出的方法直接作用于说话人嵌入。特别是，在给定从同一说话人和会话中提取的短嵌入对和长嵌入对的情况下，我们提出利用Wasser-stein-GAN的对抗性学习来学习一种新的嵌入方法。为了测试我们的方法，在我们的实验中选择了G-vector作为embedding的基准，因为它在短语音中有很好的表现。这对我们的研究提出了一个挑战，比以前的那些以i-向量为基准的研究提出了挑战。
The remainder of this paper is organized as follows: Sec- tion 2 briefly introduces the related works of our methods. Section 3 details our proposed Wassertein GAN based ap- proach. Section 4 presents experimental results and discus- sions. Finally, our conclusions are given in Section 5.
本文的其余部分安排如下：第二节简要介绍了本文方法的相关工作。第三节详细介绍了我们提出的基于Wassertein-GAN的ap-proach。第四节介绍了实验结果和讨论。最后，第5节给出了我们的结论。

RELATEDWORKS

2.1. Wasserstein-GAN

GANs [6] are deep generative models comprised of two net- works, a generator and a discriminator. The discriminator D tries to learn the difference between real sample y and fake sample g generated from noise η, and the generator G tries to fool the discriminator. That is, the following minimax loss function is optimized through alternating optimization, until equilibrium is reached.
GANs[6]是由两个网络、一个生成器和一个鉴别器组成的深度生成模型。鉴别器D试图学习噪声η产生的真样本y和假样本g之间的差别，而生成器g试图愚弄鉴别器。也就是说，通过交替优化来优化下面的极大极小损失函数，直到达到平衡。公式（1）
However, training a GAN model is difficult due to well- known diminishing or exploding gradients issue. The issues has been addressed by Wasserstein GAN (WGAN) [1], where the discriminator is designed to find a good fw and a new loss function is configured as measuring the Wasserstein distance:
然而，由于众所周知的梯度递减或爆炸问题，训练GAN模型是困难的。Wasserstein GAN（WGAN）[1]已经解决了这些问题，其中鉴别器被设计成寻找良好的fw，并且新的损耗函数被配置成测量Wasserstein距离：公式如下：

[1] Mart ́ın Arjovsky, Soumith Chintala, and Le ́on Bot- tou. Wasserstein gan. ArXiv, 2017. URL: https://arxiv.org/pdf/1701.07875.pdf.

2.2. Deployments of GANs inSV
SV中GANS的部署

Motivated by the remarkable success in image-to-image translation, GANs have been actively deployed in SV re- search community, mainly to handle domain-mismatch issue, like transforming i-vectors [15] and x-vectors [19]. In con- trast, there are few works to use GANs to handle the short du- ration issue. To authors’ best knowledge, the only published work is to propose compensating the i-vectors via conditional GAN [7]. However, limited performance improvements were observed. The proposed system alone failed to outperform the baseline system, and only score-wise fusion based system showed better performance than the baseline.
由于在图像到图像转换方面取得了显著的成功，GANs被积极地部署在SV搜索社区中，主要用于处理域失配问题，如转换i-向量[15]和x-向量[19]。相反，很少有作品使用GANs来处理短语音问题。据作者所知，唯一发表的工作是提出通过条件GAN补偿i-向量[7]。然而，观察到的性能改进有限。单系统的性能没有优于基线系统，只有融合系统的性能优于基线系统。

[19] Man-Wai Mak Youzhi Tu and Jen-Tzung Chien. Varia- tional domain adversarial learning for speaker verifica- tion. In Interspeech, pages 4315–4319, 2019.
[7] Nakamasa Inoue Jiacen Zhang and Koichi Shinoda. I- vector transformation using conditional generative ad- versarial networks for short utterance speaker verifica- tion. In Interspeech, Hyderabad, pages 3613–3617, 2018.

In authors’ opinion, training GAN is non-trivial, the rea- son behind such results might be the oversight on effects of loss functions of conditional GAN. As such, in this study, we investigate the problem and seek to reveal some guidelines on choosing beneficial loss functions to make the model perform better.
在作者看来，训练GAN是非常重要的，这种结果背后的原因可能是对条件GAN损失函数的影响的疏忽。因此，在本研究中，我们探讨了这个问题，并试图揭示一些选择有利损失函数的准则，以使模型表现得更好。

PROPOSED APPROACH

The architecture of our proposed approach is illustrated in Fig.1. Here x and y are D-dimensional G-vectors correspond- ing to short and long utterance embedding from same speaker session, z is speaker identity labels. With given x, y, z, the proposed system is trained to learn a D-dimensional embed- ding g, with the expectation that the g-based SV system can outperform the one based on x.
我们提出的方法的架构如图1所示。这里x和y是D维G-vectors，对应于同一说话人会话中嵌入的短、长话语，z是说话人身份标签。在给定x，y，z的情况下，训练系统学习D维的嵌入g，期望基于g的SV系统能优于基于x的SV系统。
在这里插入图片描述
Overall, the proposed architecture can be decomposed into four core components: embedding generator Gf , speaker label predictor Gc, distance calculator Gd and Wasserstine discriminator Dw . All components are jointly trained in order to generate enhanced embeddings with carefully handcraft optimization objects, as described as follows.
总体而言，该架构可以分解为四个核心组件：嵌入生成器Gf、说话人标签预测器Gc、距离计算器Gd和Wasserstine鉴别器Dw。所有组件都经过联合训练，以便使用精心设计的手工优化对象生成增强的嵌入，如下所述。

3.1. Proposed Discriminator-Related Loss Functions
我们提出的判别性相关的损失函数

As aforementioned, the primary task of the proposed ap- proach is to learn embedding with enhanced discriminability. Let P denote the data distribution, we propose to achieve the task by mapping Pg from initial Px to the target Py by adver- sarial learning of WGAN. To this end, in the discriminative model, several loss criteria are investigated with different optimization objectives.
如前所述，所提出的方法的主要任务是学习增强过的具有判别性的嵌入。假设P表示数据分布，我们建议通过WGAN的adver-sarial学习将Pg从初始Px映射到目标Py来完成任务。为此，在判别模型中，研究了具有不同优化目标的几种损失准则。
Following the conventional definition of min-max function, the loss function of WGAN is:
根据最小最大功能的传统定义，WGAN的损失函数为：
在这里插入图片描述
Inspired by the idea of conditional GAN [10], in this study, we investigate a novel loss function by optimizing the Wassertein distance between joint data distributions. That is, to control the data to be discriminated by concatenating short embedding x with the conventional discriminator input. The corresponding min-max function is updated as:

[10] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. ArXiv, 2014. URL: https://arxiv.org/pdf/1411.1784.pdf.

在本研究中，受条件GAN的启发，我们通过优化联合数据分布之间的Wassertein距离，研究了一种新的损失函数。也就是说，通过将短嵌入x与传统的鉴别器输入串联来控制要鉴别器化的数据。相应的min max函数更新为：
在这里插入图片描述
In addition, to seek more discriminability, the Fre ́chet In- ception Distance (FID) [9], as a popular metric to calculate the distance between feature vectors of real and generated im- ages, is also explored herein. Assuming Py and Pg as normal distributions with means μy , μg and co-variance matrices Cy , Cg , FID loss can be calculated by:
此外，为了寻求更高的可分辨性，本文还探讨了Fréchet-In-ception Distance（FID）[9]作为计算真实图像和生成图像特征向量之间距离的常用度量。假设Py和Pg为正态分布，平均值为μy，μg，协方差矩阵为Cy，Cg，FID损失可通过以下公式计算：
在这里插入图片描述
3.2. Proposed Generator-Related Loss Functions
我们提出的Generator相关的损失函数

In order to guide GAN training with the objective of feature discriminability, four loss criteria are investigated herein as extra training guides for the GAN training.
为了以特征判别性为目的指导GAN训练，本文研究了四种loss准则作为GAN训练的附加训练准则。
To verify the speaker label, the widely adopted multiclass cross-entropy (CE) loss is investigated with formulation of:
为了验证说话人标签，研究了广泛采用的多类交叉熵（CE）损失，公式如下：
在这里插入图片描述
where N is the batch size, c is the number of classes. gi de- notes the i-th generated embedding sample and zi is the cor- responding label index. W ∈ RD∗c and b ∈ Rc denotes the weight matrix and bias in the project layer.
其中N是批处理大小，c是类的数量。gi 表示第i个生成的嵌入样本和zi是对应的标签索引。W∈RD*c和b∈Rc表示项目层中的权重矩阵和偏差。
To explicitly penalize the class-related classification error, triplet loss is deployed as well, where a baseline (anchor) in- put is compared to a positive (truthy) input and a negative (falsy) input. Let Γ be the set of all possible embedding triplets γ = (ga , gp , gn ) in the training set, the loss is defined as:
为了显式地惩罚与类相关的分类错误，还采用了三元组loss，其中基线（锚）输入与正（truthy）输入和负（falsy）输入进行比较。设Γ为训练集中所有可能嵌入三元组的集合γ=（ga，gp，gn），其损失定义为：
在这里插入图片描述
where ga is an anchor input, gp is a positive input from the same class and gn is a negative input from a different class, Ψ ∈ R+ is safety margin between positive and negative pairs.
其中ga是锚输入，gp是同一类的正输入，gn是不同类的负输入，Ψ∈R+是正负对之间的安全角度。
Apart from the above, to minimize intra-class variation, center loss [16] is also adopted. It can be formulated as:
除此之外，为了最小化类内变化，还采用了中心损失[16]。它可以表述为：
在这里插入图片描述
where c
其中c
denotes the ith deep feature belonging to the yith class and m is the size of mini-batch.
表示属于yith类的第i个深特征，m是小批量的大小。
To better guide the training process, the similarity be- tween enhanced embedding and its target is explicitly con- sidered. It’s measured by the cosine distance and evaluated as a dot product as follow:
为了更好地指导训练过程，明确考虑了增强嵌入与目标之间的相似性。它是由余弦距离测量的，并作为点积计算，如下所示：
在这里插入图片描述
where g ̄ and y ̄ are normalized version of embedding g and y, respectively.
其中ḡ和ȳ分别是嵌入g和y的规范化版本。
In all, we propose to train the generator Gf with the total loss defined as:
总之，我们建议训练generator Gf时的total loss 定义为：
在这里插入图片描述
and discriminator Dw with:
和鉴别器Dw：

After the training of WGAN, the generative model Gf is retained. At the SV test stage, a short embedding x for any given test short utterance, can be easily mapped to its enhanced version (g) by directly applying the feed-forward model of Gf on the x.
经过WGAN训练后，保留了生成模型Gf。在SV测试阶段，通过直接在x上应用Gf的前向模型，可以很容易地将任意给定测试短句的短嵌入x映射到其增强版本（g）。

EXPERIMENTS AND RESULTS
四。实验和结果
This section details our experimental setups and investigation results on the effectiveness of the above proposed loss criteria.
本节详细介绍了我们的实验装置和对上述loss准则有效性的调查结果。

4.1. Experimental Setup

We use a subset of the Voxceleb2 to train our proposed sys- tem, where 1,057 speakers are chosen with total 164,716 ut- terances. Those utterances are randomly cut to 2 seconds as short utterance. Similarly, a subset of Voxceleb1 with 40 speakers is sampled and total 13,265 utterance pairs are used for testing.
我们使用Voxceleb2的一个子集来训练我们提出的系统，其中1057个说话人被选择，总共164716个话语。这些话语被随机缩短为2秒作为简段语音。类似地，对40个说话人的Voxceleb1子集进行采样，并使用总共13265个话语对进行测试。
The VGG-Restnet34s network is used to extract G-vectors as our baseline system. Regarding the GAN training, the learning rates for both Gf and Dw are 0.0001; Adam op timization is adopted; weight clipping is employed for Dw with threshold setting from -0.01 to 0.01 and batch size is set as 128.
使用VGG-Restnet34s网络提取G向量作为我们的基线系统。对于GAN训练，Gf和Dw的学习率均为0.0001；采用Adam-op优化；Dw采用权值剪裁，阈值设置为-0.01到0.01，批设置为128。

4.2. Ablation Studies on Various Loss Functions
不同损失函数的ablation消融研究
To verify the importance of proposed loss criteria, a bunch of ablation studies are conducted by choosing different com- binations of them. The overall results are illustrated in Tab.1, where Lc, Lt denote Lcenter and Ltriplet, respec- tively. Triplet a means that inputs are sampled from both y and g and b means from g only.
为了验证提出的ablation准则的重要性，我们选择了不同的ablation准则组合进行了大量的ablation研究。总体结果如表1所示，其中Lc，Lt分别表示Lcenter和Ltriplet。三元组a表示输入同时从y和g采样，b表示仅从g采样。
在这里插入图片描述
In our study, total 8 systems (v1 − v8), by combining different loss criteria with Watterstein GAN, are evaluated. Their corresponding detection error trade-off (DET) curves are plotted in Fig.2.
在我们的研究中，通过将不同的损失标准与Watterstein GAN相结合，对总共8个系统（v1-v8）进行了评估。相应的检测误差权衡（DET）曲线如图2所示。
在这里插入图片描述
From the above experimental results, the following con-
从以上实验结果来看，下面是-
clusions could be drawn:
可以得出结论：

• FID loss has positive effect (v1 vs. v2);
• Conditional WGAN outperforms WGAN (v3 vs. v4);
• Triplet loss is preferred (v7 vs. v2);
• Triplet a greatly outperforms triplet b (v3 vs. v8);
• softmax has positive effect (v3 vs. v5);
• Center loss has negative effect (v6 vs. v7);
• Cosine loss has significant positive effect (v6 vs. v8).

The above findings are very interesting with a twofold out- come. Firstly, it demonstrates that additional training func- tions (e.g. traditional softmax, cosine loss and triplet loss) all have positive contribution to the performance, which verifies our earlier statement that extra training guides might be help- ful for feature discriminability. Secondly, some less-favoured
以上的发现非常有趣，有两个方面。首先，证明了附加的训练函数（如传统的softmax、余弦损失和三重态损失）对性能都有积极的贡献，这验证了我们之前的观点，即附加的训练指南可能有助于提高特征的可分辨性。其次，一些不太受欢迎的
loss criteria to a typical SV system (e.g. FID loss and con- ditional WGAN loss) are surprisingly helpful, which are un- usual findings and might be worthy of further investigation.
典型SV系统的损失标准（如FID损失和常规WGAN损失）出人意料地有用，这是不常见的发现，可能值得进一步研究。

4.3. Comparison with the Baseline System

In the end, we make a performance comparison between our best system (v3) and the G-vector baseline system. Herein the comparison is measured in terms of equal error rate (EER) and minDCF. The results are reported in Tab.2.
最后，我们对最佳系统（v3）和G矢量基线系统进行了性能比较。在此，以等错误率（EER）和minDCF来测量比较。结果见表2。
在这里插入图片描述
From the table, we can see that our proposed system also has the merit for generalization and behave consistently for different short duration over the baseline system. In detail, for verification with 2 second enroll-test utterances, our proposed system shows 4.2% relative EER improvement and 7.2% rela- tive minDCF improvement. For shorter utterances with dura- tion of 1 second, it shows comparable EER (3.8%) improve- ment.
从表中可以看出，我们提出的系统也具有泛化的优点，并且在不同的短时间内与基线系统保持一致。详细地说，对于2秒注册测试话语的验证，我们提出的系统显示了4.2%的相对EER改进和7.2%的相对minDCF改进。对于持续1秒的较短的话语，它显示出可比的EER（3.8%）提高。
It’s worth noting that due to time constraint, the FID loss function has not been added to our final system; besides, there is no any fine-tuning on hyper-parameters, loss weights α, β, γ, λ, ε and triplet margin η. This means there are still a lot of room for improvements in our system.
值得注意的是，由于时间的限制，最终系统中没有加入FID损失函数；此外，对超参数、损失权重α、β、γ、λ、ε和三重态裕度η也没有任何微调。这意味着我们的系统还有很大的改进空间。

CONCLUSIONS
5个。结论
In this paper, we have successfully applied WGAN to learn enhanced embedding for speaker verification application with short utterances. Our main contributions are twofold: pro- posed WGAN-based kernel system; and on top of it, validated the effectiveness of a bunch of loss criteria on the GAN train- ing. Our final proposed system outperforms the baseline sys- tem for the challenging short speaker verification scenarios. In all, our experiments show both decent advancement and a potential direction where our further research goes forward.
本文成功地将WGAN应用于短语音说话人验证应用中的增强嵌入学习。我们的主要贡献有两方面：提出了基于WGAN的核系统；在此基础上，验证了一系列损失准则在GAN训练中的有效性。我们最后提出的系统在挑战性的短小说话人验证情形下，表现优于基线系统。总之，我们的实验既显示了良好的进展，也显示了我们进一步研究的潜在方向。
REFERENCES
[1] Mart ́ın Arjovsky, Soumith Chintala, and Le ́on Bot- tou. Wasserstein gan. ArXiv, 2017. URL: https://arxiv.org/pdf/1701.07875.pdf.
[2] D. Povey D. Snyder, D. Garcia-Romero and S. Khu- danpur. Deep neural network embeddings for text-independent speaker verification. In Interspeech, page 999, 2017.
[3] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet. Front-end factor analysis for speaker verifi- cation. IEEE Transactions on Audio, Speech, and Lan- guage Processing, 19(4):788–798, May 2011.
[4] Daniel Garcia-Romero and Carol Espy-Wilson. Analy- sis of i-vector length normalization in speaker recogni- tion systems. In Interspeech, pages 249–252, 2011.
[5] JahangirAlamGautamBhattacharyaandPatrickKenny. Deep speaker embeddings for short duration speaker verification. In Interspeech, Stockholm, pages 1517– 1521, 2017.
[6] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Sys- tems 27, pages 2672–2680. 2014.
[7] Nakamasa Inoue Jiacen Zhang and Koichi Shinoda. I- vector transformation using conditional generative ad- versarial networks for short utterance speaker verifica- tion. In Interspeech, Hyderabad, pages 3613–3617, 2018.
[8] Na Li, Deyi Tuo, Dan Su, Zhifeng Li, and Dong Yu. Deep discriminative embeddings for duration robust speaker verification. In Interspeech, pages 2262–2266, 2018.
[9] Thomas Unterthiner Bernhard Nessler Martin Heusel, Hubert Ramsauer and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, pages 6626–6637, 2017.
[10] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. ArXiv, 2014. URL: https://arxiv.org/pdf/1411.1784.pdf.
[11] A. Poddar, M. Sahidullah, and G. Saha. Performance comparison of speaker recognition systems in presence of duration variability. In 2015 Annual IEEE India Con- ference (INDICON), pages 1–6, Dec 2015.
[12] S. J. D. Prince and J. H. Elder. Probabilistic linear dis- criminant analysis for inferences about identity. In 2007 IEEE 11th International Conference on Computer Vi- sion, pages 1–8, Oct 2007.
[13] Lihong Wan Jun Zhang Qingyang Hong, Lin Li and Feng Tong. Transfer learning for speaker verification on short utterances. In Interspeech, pages 1848–1852, 2016.
[14] Xinyuan Cai Ruifang Ji and Bo Xu. An end-to-end text- independent speaker identification system on short ut- terances. In Interspeech, pages 3628–3632, 2018.
[15] Qing Wang, Wei Rao, Sining Sun, Leib Xie, Eng Chng, and Haizhou Li. Unsupervised domain adaptation via domain adversarial training for speaker recognition. In ICASSP, pages 4889–4893, 2018.
[16] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. A discriminative feature learning approach for deep face recognition. In ECCV, pages 499–515, 2016.
[17] Jee weon Jung, Hee-Soo Heo, Hye jin Shim, and Ha-Jin Yu. Short utterance compensation in speaker verification via cosine-based teacher-student learn- ing of speaker embeddings. ArXiv, 2018. URL: https://arxiv.org/pdf/1810.10884.pdf.
[18] Weidi Xie, Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Utterance-level aggregation for speaker recognition in the wild. In ICASSP, pages 5791– 5795, 2019.
[19] Man-Wai Mak Youzhi Tu and Jen-Tzung Chien. Varia- tional domain adversarial learning for speaker verifica- tion. In Interspeech, pages 4315–4319, 2019.
[20] Chunlei Zhang and Kazuhito Koishida. End-to-end text-independent speaker verification with triplet loss on short utterances. In Interspeech, pages 1487–1491, 2017.

本文链接：https://blog.csdn.net/yj13811596648/article/details/105683173

原作者删帖不实内容删帖广告或垃圾文章投诉

智能推荐

[转]修改注册表将MongoDB添加到系统服务-程序员宅基地

文章浏览阅读218次。http://www.webiyo.com/2011/02/install-mongodb-service-on-windows-7.htmlDownload MongoDBYou can download MongoDB in two different editions viz.32bit MongoDB direct download|64bit direct download..._modify comdb添加进注册表

常见解压缩软件与zip格式_怎么知道压缩文件是winzip还是pkzip-程序员宅基地

文章浏览阅读4k次。（2）以上软件均兼容使用自身支持的高级加密算法产生的zip文件，如 WinZip 兼容（可以打开）PKZip 使用 AES 加密的 zip 文件，但不支持 PKZip 使用 3DES 加密的 zip 文件。支持 zip 2.0 标准加密方式和 AES、3DES 等高级加密方式，其使用 AES 高级加密方式产生的 zip 文件格式与 WinZip 和 7Zip 不同。支持 zip 2.0 标准加密方式和 AES 高级加密方式，7Zip 和 WinZip 使用 AES 加密产生的 zip 文件格相同。_怎么知道压缩文件是winzip还是pkzip

微服务后台启动报错：The web application [ROOT] appears to have started a thread named [spring.cloud.inetutils-程序员宅基地

文章浏览阅读3.4w次，点赞4次，收藏2次。后台启动报错：The web application [ROOT] appears to have started a thread named [spring.cloud.inetutils] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread: 原因：后台项目的每一个服务中properties中都写明了配置所在的nacos的IP地址，账号..._the web application [root] appears to have started a thread named [spring.cl

光猫直连电脑不能上网_旧电脑装openwrt系统，化身智能路由器-程序员宅基地

文章浏览阅读1.7k次。为什么装openwrt系统：让老电脑变成一个智能路由器，可以达到普通路由器所没有的功能，家里现有的路由器是电信给的光猫路由器，功能很少（破解admin后能改NAT设置，但仍然很弱），开启桥接再买智能路由器接上，性能强点的智能路由器又太贵，而且也不一直用，没必要。软路由：这台电脑刷openwrt并且进入该系统后，由于本身硬件还是个电脑，只是装了系统软件，跟真正路由器不同，因此称软路由诉求：软路由不是..._闲置光猫能刷什么系统

Android销毁控件对象,剖析 Android 架构组件之 ViewModel-程序员宅基地

文章浏览阅读467次。原标题：剖析 Android 架构组件之 ViewModeliewModel 是 Android 架构组件之一，用于分离 UI 逻辑与 UI 数据。在发生 Configuration Changes 时，它不会被销毁。在界面重建后，方便开发者呈现界面销毁前的 UI 状态。本文主要分析 ViewModel 的以下3个方面：获取和创建过程。Configuration Changes 存活原理。销毁过程..._android 控件被销毁

Mongodb 对于Sort排序能够支持的最大内存限制查看和修改_internalqueryexecmaxblockingsortbytes-程序员宅基地

文章浏览阅读897次。【代码】Mongodb 对于Sort排序能够支持的最大内存限制查看和修改。_internalqueryexecmaxblockingsortbytes

随便推点

Jetson Nano之jetson-inference问题处理_危险的重寻址:不支持的重定位-程序员宅基地

文章浏览阅读2.3k次，点赞15次，收藏43次。Jetson Nano之jetson-inference问题处理说在前面遇到的问题成功的具体步骤参考链接说在前面作为个NVIDIA Jetson Nano新上路的菜鸟，尽管按照教程走，也总会遇到不少问题，这不，才没走几步，就卡了两天，卡在jetson-inference的配置上。我主要按照网上大佬的教程来走，以下是我遇到的坑和最后配置成功的步骤，文末附上其它大佬的帖子，供参考（如果可以科学上网..._危险的重寻址:不支持的重定位

python十进制单精度浮点(float)转16位16进制(FP16 hex)_python float转16进制-程序员宅基地

文章浏览阅读5k次。python十进制单精度(float)浮点转16位16进制(FP16 hex)目的将神经网络权重存放到FPGA内部需要将可训练参数从float转换到FP16 hex。Code# tt.pyimport structimport numpy as npdec_float = 5.9# 十进制单精度浮点转16位16进制hexa = struct.unpack('H',struct.pack('e',dec_float))[0]hexa = hex(hexa)hexa = hexa[2:]_python float转16进制

dog log 算子_图像特征之LoG算子与DoG算子-程序员宅基地

文章浏览阅读794次。LoG(Laplacian of Gaussian)算子和DoG(Difference of Gaussian)算子是图像处理中实现极值点检测(Blob Detection)的两种方法。通过利用高斯函数卷积操作进行尺度变换，可以在不同的尺度空间检测到关键点(Key Point)或兴趣点(Interest Point)，实现尺度不变性(Scale invariance)的特征点检测。Laplacia..._dog算子和log算子的关系

DirectX11笔记(六)--Direct3D渲染2--VERTEX BUFFER_dx11 constbuffer-程序员宅基地

文章浏览阅读1.8k次。顶点缓冲_dx11 constbuffer

Jsoncpp引用规范_"bool __cdecl std::uncaught_exception(void)\" (?un-程序员宅基地

文章浏览阅读1.2k次。解析JSON需要引入库文件。C++开源的有很多库。git上面也有许多Json的解析库。目前使用的最多的是https://github.com/open-source-parsers/jsoncpp。名为jsoncpp。使用Jsoncpp方法一：使用Jsoncpp生成的lib文件1.解压下载的Jsoncpp文件，在jsoncpp-src-0.5.0/makefiles/vs71_"bool __cdecl std::uncaught_exception(void)\" (?uncaught_exception@std@@ya_nxz"

dwr session error-程序员宅基地

文章浏览阅读198次。http://blog.sina.com.cn/s/blog_5f044a4d010185pn.html在使用dwr的时候遇到了session error 错误解决方法，就是在web.xml 中配置如下：<servlet> <servlet-name>dwr-invoker</..._dwr session error