TY  - CHAP
U1  - Konferenzveröffentlichung
A1  - Stadelmann, Thilo
A1  - Glinski-Haefeli, Sebastian
A1  - Gerber, Patrick
A1  - Dürr, Oliver
T1  - Capturing suprasegmental features of a voicewith RNNs for improved speaker clustering
T2  - 8th IAPR TC3 Workshop on Artificial Neural Networks in Pattern Recognition (ANNPR), 19-21 September 2018, Siena, Italy
N2  - Deep neural networks have become a veritable alternative to classic speaker recognition and clustering methods in recent years. However, while the speech signal clearly is a time series, and despite the body of literature on the benefits of prosodic (suprasegmental) features, identifying voices has usually not been approached with sequence learning methods. Only recently has a recurrent neural network (RNN) been successfully applied to this task, while the use of convolutional neural networks (CNNs) (that are not able to capture arbitrary time dependencies, unlike RNNs) still prevails. In this paper, we show the effectiveness of RNNs for speaker recognition by improving state of the art speaker clustering performance and robustness on the classic TIMIT benchmark. We provide arguments why RNNs are superior by experimentally showing a “sweet spot” of the segment length for successfully capturing prosodic information that has been theoretically predicted in previous work.
KW  - Speaker clustering
KW  - Speaker recognition
KW  - Recurrent neural network
Y1  - 2018
UN  - https://nbn-resolving.org/urn:nbn:de:bsz:kon4-opus4-22870
SN  - 978-3-319-99978-4
SB  - 978-3-319-99978-4
SN  - 978-3-319-99977-7
SB  - 978-3-319-99977-7
U6  - https://doi.org/10.1007/978-3-319-99978-4_26
DO  - https://doi.org/10.1007/978-3-319-99978-4_26
SP  - 333
EP  - 345
PB  - Springer
CY  - Cham
ET  - Akzeptierte Version
ER  -