The Power Cepstrum Calculation with Convolutional Neural Networks
A model of neural network with convolutional layers that calculates the power cepstrum of the input signal is proposed. To achieve it, the network calculates the discrete-time short-term Fourier transform internally, obtaining the spectrogram of the signal as an intermediate step. The weights of the neural network can be calculated in a direct way or they can be obtained through training with the gradient descent method. The behaviour of the training is analysed. The model originally proposed cannot be trained in a complete way, but both the part that calculates the spectrogram and also a variant of the cepstrum equivalent to the autocovariance that keeps a big part of its usefulness can be trained. For the cases of successful training, an analysis of the obtained weights is done. The main conclusions indicate, on the one hand, that it is possible to calculate the power cepstrum with a neural network; on the other hand, that it is possible to use these networks as the initial layers of a deep learning model for the case of trainable models. In these layers, weights are initialised with the discrete Fourier transform (DFT) coefficients and they are trained to adapt to specific classification or regression problems.
L. D. Dong Yu, Automatic speech recognition, a deep learning approach. Springer-Verlag London, 1 ed., 2014
J. McClellan and T. Parks, “Eigenvalue and eigenvector decomposition of the discrete fourier transform,” IEEE Transactions on Audio and Electroacoustics, vol. 20, no. 1, pp. 66–74, 1972.
A. M. Noll, “Cepstrum pitch determination,” The journal of the acoustical society of America, vol. 41, no. 2, pp. 293–309, 1967.
A. V. Oppenheim and R. W. Schafer, “From frequency to quefrency: A history of the cepstrum,” IEEE signal processing Magazine, vol. 21, no. 5, pp. 95–106, 2004.
B. P. Bogert, “The quefrency alanysis of time series for echoes; cepstrum, pseudo-autocovariance, cross-cepstrum and saphe cracking,” Time series analysis, pp. 209–243, 1963.
R. Randall, J. Antoni, and W. Smith, “A survey of the application of the cepstrum to structural modal analysis,” Mechanical Systems and Signal Processing, vol. 118, pp. 716–741, 2019.
C. Madill, D. D. Nguyen, K. Yick-Ning Cham, D. Novakovic, and P. McCabe, “The impact of nasalance on cepstral peak prominence and harmonics-to-noise ratio,” The Laryngoscope, 2018.
V. S. McKenna and C. E. Stepp, “The relationship between acoustical and perceptual measures of vocal effort,” The Journal of the Acoustical Society of America, vol. 144, no. 3, pp. 1643–1658, 2018.
M. A. Garcı́a and E. A. Destéfanis, “Spectrogram prediction with neural networks,” in XXIV Congreso Argentino de Ciencias de la Computación (Tandil, 2018)., 2018.
R. Collobert, C. Puhrsch, and G. Synnaeve, “Wav2letter: an end-to-end convnet-based speech recognition system,” arXiv preprint arXiv:1609.03193, 2016.
Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “An experimental study on speech enhancement based on deep neural networks,” IEEE Signal processing letters, vol. 21, no. 1, pp. 65–68, 2014.
M. J. Putten, S. Olbrich, and M. Arns, “Predicting sex from brain rhythms with deep learning,” Scientific reports, vol. 8, no. 1, p. 3069, 2018.
A. M. Badshah, J. Ahmad, N. Rahim, and S. W. Baik, “Speech emotion recognition from spectrograms with deep convolutional neural network,” in Platform Technology and Service (PlatCon), 2017 International Conference on, pp. 1–5, IEEE, 2017.
O. Moreira-Tamayo and J. P. De Gyvez, “Filtering and spectral processing of 1-d signals using cellular neural networks,” in Circuits and Systems, 1996. ISCAS’96., Connecting the World., 1996 IEEE International Symposium on, vol. 3, pp. 76–79, IEEE, 1996.
R. Velik, “Discrete fourier transform computation using neural networks,” in 2008 International Conference on Computational Intelligence and Security, pp. 120–123, IEEE, 2008.
Y. Hoshen, R. J. Weiss, and K. W. Wilson, “Speech acoustic modeling from raw multichannel waveforms,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pp. 4624–4628, IEEE, 2015.
T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals, “Learning the speech frontend with raw waveform cldnns,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
J. Andén and S. Mallat, “Deep scattering spectrum,” IEEE Transactions on Signal Processing, vol. 62, no. 16, pp. 4114–4128, 2014.
M. A. Garcı́a and E. A. Destéfanis, “Deep neural networks for shimmer approximation in synthesized audio signal,” in Argentine Congress of Computer Science, pp. 3–12, Springer, 2017.
C. Trabelsi, O. Bilaniuk, Y. Zhang, D. Serdyuk, S. Subramanian, J. F. Santos, S. Mehri, N. Rostamzadeh, Y. Bengio, and C. J. Pal, “Deep complex networks,” arXiv preprint arXiv:1705.09792, 2017.
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
J. D. Arias-Londoño, J. I. Godino-Llorente,
M. Markaki, and Y. Stylianou, “On combining information from modulation spectra and mel-frequency cepstral coefficients for automatic detection of pathological voices,” Logopedics Phoniatrics Vocology, vol. 36, no. 2, pp. 60–69, 2011.
Copyright (c) 2019 Mario Alejandro García, Eduardo Atilio Destéfanis
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.