Transformer-based Automatic Music Mood Classification Using Multi-modal Framework

Authors

DOI:

https://doi.org/10.24215/16666038.23.e02

Keywords:

BERT, bidirectional GRU, music, self-attention, transformer

Abstract

The mood is a psychological state of feeling that is related to internal emotions and affect, which is how emotions are expressed outwardly. According to studies, music affects our moods, and we are also inclined to choose a theme based on our current moods. Audio-based techniques can achieve promising results, but lyrics also give relevant information about the moods of a song which may not be present in the audio part. So a multi-modal with both textual features and acoustic features can provide enhanced accuracy. Sequential networks such as long short-term memory networks (LSTM) and gated recurrent unit networks (GRU) are widely used in the most state-of-the-art natural language processing (NLP) models. A transformer model uses self-attention to compute representations of its inputs and outputs, unlike recurrent unit networks (RNNs) that use sequences and transformers that can parallelize over input positions during training. In this work, we proposed a multi-modal music mood classification system based on transformers and compared the system's performance using a bi-directional GRU (Bi-GRU)-based system with and without attention. The performance is also analyzed for other state-of-the-art approaches. The proposed transformer-based model acquired higher accuracy than the Bi-GRU-based multi-modal system with
single-layer attention by providing a maximum accuracy of 77.94\%.

Downloads

Download data is not yet available.

References

M. Schedl, E. G´omez, J. Urbano, et al., “Music information retrieval: Recent developments and applications,” Foundations and Trends® in Information Retrieval, vol. 8, no. 2-3, pp. 127–261, 2014.

G. C. Bruner, “Music, mood, and marketing,” Journal of marketing, vol. 54, no. 4, pp. 94–104, 1990.

D. Liu, L. Lu, and H.-J. Zhang, “Automatic mood detection from acoustic music data,” in 4th Int. Conf. Music Information Retrieval (ISMIR’03), pp. 13–17, Johns Hopkins University, 2003.

L. Lu, D. Liu, and H.-J. Zhang, “Automatic mood detection and tracking of music audio signals,” IEEE Transactions on audio, speech, and language processing, vol. 14, no. 1, pp. 5–18, 2005.

M. Hemalatha, N. Sasirekha, S. Easwari, and N. Nagasaranya, “An empirical model for clustering and classification of instrumental music using machine learning technique,” in 2010 IEEE International Conference on Computational Intelligence and Computing Research, pp. 1–7, IEEE, 2010.

M. Van Zaanen and P. Kanters, “Automatic mood classification using tf* idf based on lyrics.,” in ISMIR, vol. 9, pp. 75–80, 2010.

R. Mayer, R. Neumayer, and A. Rauber, “Combination of audio and lyrics features for genre classification in digital audio collections,” in Proceedings of the 16th ACM international conference on Multimedia, pp. 159– 168, 2008.

C. Laurier, J. Grivolla, and P. Herrera, “Multimodal music mood classification using audio and lyrics,” in 2008 seventh international conference on machine learning and applications, pp. 688–693, IEEE, 2008.

Y. Zhao, D. Yang, and X. Chen, “Multi-modal music mood classification using co-training,” in 2010 International Conference on Computational Intelligence and Software Engineering, pp. 1–4, IEEE, 2010.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.

M. Sundermeyer, R. Schl¨uter, and H. Ney, “Lstm neural networks for language modeling,” in Thirteenth annual conference of the international speech communication association, 2012.

K. Cho, B. Van Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoderdecoder for statistical machine translation,” in 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734, 2014.

J. Abdillah, I. Asror, Y. F. A. Wibowo, et al., “Emotion classification of song lyrics using bidirectional lstm method with glove word representation weighting,” Jurnal RESTI (Rekayasa Sistem Dan Teknologi Informasi), vol. 4, no. 4, pp. 723–729, 2020.

R. Rajan, J. Antony, R. A. Joseph, J. M. Thomas, et al., “Audio-mood classification using acoustic-textual feature fusion,” in 2021 Fourth International Conference on Microelectronics, Signals & Systems (ICMSS), pp. 1–6, IEEE, 2021.

D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, “End-to-end attention-based large vocabulary speech recognition,” in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4945–4949, IEEE, 2016.

M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” Conference on Empirical Methods in Natural Language Processing, pp. 1412–1421, 2015.

A. Dipani, G. Iyer, and V. Baths, “Recognizing music mood and theme using convolutional neural networks and attention.,” in MediaEval, 2020.

H. Lu, H. Zhang, and A. Nayak, “A deep neural network for audio classification with a classifier attention mechanism,” arXiv preprint arXiv:2006.09815, 2020.

Q. H. Nguyen, T. T. Do, T. B. Chu, L. V. Trinh, D. H. Nguyen, C. V. Phan, T. A. Phan, D. V. Doan, H. N. Pham, B. P. Nguyen, et al., “Music genre classification using residual attention network,” in 2019 International Conference on System Science and Engineering (ICSSE), pp. 115–119, IEEE, 2019.

S. Yu, Y. Yu, X. Chen, and W. Li, “Hanme: hierarchical attention network for singing melody extraction,” IEEE Signal Processing Letters, vol. 28, pp. 1006–1010, 2021.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” North American Association for Computational Linguistics (NAACL), 2018.

W.-C. Chang, H.-F. Yu, K. Zhong, Y. Yang, and I. Dhillon, “X-bert: extreme multi-label text classification with using bidirectional encoder representations from transformers,” in NeurIPS Science Meets Engineering of Deep Learning Workshop, 2019.

V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,” arXiv preprint arXiv:1910.01108, 2019.

Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” Advances in neural information processing systems, vol. 32, 2019.

Y. Gong, Y.-A. Chung, and J. Glass, “Ast: Audio spectrogram transformer,” Interspeech 2021, pp. 571–575, 2021.

H. Zhao, C. Zhang, B. Zhu, Z. Ma, and K. Zhang, “S3t: Self-supervised pre-training with swin transformer for music classification,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 606–610, IEEE, 2022.

Y. Agrawal, R. G. R. Shanker, and V. Alluri, “Transformer-based approach towards music emotion recognition from lyrics,” in European Conference on Information Retrieval, pp. 167–175, Springer, 2021.

K. Pyrovolakis, P. Tzouveli, and G. Stamou, “Multimodal song mood detection with deep learning,” Sensors, vol. 22, no. 3, p. 1065, 2022.

B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music signal analysis in python,” in Proceedings of the 14th python in science conference, vol. 8, pp. 18–25, Citeseer, 2015.

J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532– 1543, 2014.

W. James, F. Burkhardt, F. Bowers, and I. K. Skrupskelis, The principles of psychology, vol. 1. Macmillan London, 1890.

E. C¸ ano and M. Morisio, “Moodylyrics: A sentiment annotated lyrics dataset,” in Proceedings of the 2017 International Conference on Intelligent Systems, Metaheuristics & Swarm Intelligence, pp. 118–124, 2017.

J. A. Russell, “A circumplex model of affect.,” Journal of personality and social psychology, vol. 39, no. 6, pp. 1161–, 1980.

J. Vig, “Bertviz: A tool for visualizing multihead selfattention in the bert model,” in ICLR workshop: Debugging machine learning models, 2019.

G. R. Shafron and M. P. Karno, “Heavy metal music and emotional dysphoria among listeners.,” Psychology of Popular Media Culture, vol. 2, no. 2, p. 74, 2013.

B. S. Everitt, The analysis of contingency tables. CRC Press, 1992.

T. G. Dietterich, “Approximate statistical tests for comparing supervised classification learning algorithms,” Neural computation, vol. 10, no. 7, pp. 1895–1923, 1998.

Downloads

Published

2023-04-03

How to Cite

A. S, S., & Rajan, R. (2023). Transformer-based Automatic Music Mood Classification Using Multi-modal Framework. Journal of Computer Science and Technology, 23(1), e02. https://doi.org/10.24215/16666038.23.e02

Issue

Section

Original Articles