Friday, May 3, 2019 · 10:15 a.m. · 13m 00s
Abstract: A recent trend in audio and speech processing consists in training neural networks on raw waveforms for various classification tasks. While this approach has been shown to perform well, there is limited understanding of what kind of information is learned from the waveforms by the neural networks. Such an insight is not only interesting for advancing those techniques but also for understanding better audio and speech signal characteristics. In this talk, taking inspirations from vision community, I will present a gradient-based visualization method that could provide insight into which spectral characteristics in a given input have the highest impact on the prediction score. I will demonstrate the potential of the proposed approach on two classification tasks: phoneme recognition and speaker identification.