On the Use of the Novel Zorro Activation Functions in Long Short-Term Memory Neural Networks

Authors

  • Victor Adrian Jimenez Grupo de Investigación en Tecnologías Informáticas Avanzadas, Universidad Tecnológica Nacional
  • Matías Roodschild Grupo de Investigación en Tecnologías Informáticas Avanzadas, Universidad Tecnológica Nacional
  • Jorge Gotay-Sardiñas Grupo de Investigación en Tecnologías Informáticas Avanzadas, Universidad Tecnológica Nacional
  • Adrián Will Grupo de Investigación en Tecnologías Informáticas Avanzadas, Universidad Tecnológica Nacional

DOI:

https://doi.org/10.24215/16666038.26.e03

Keywords:

Long Short-Term Memory (LSTM), Activation Functions, Gatting Functions

Abstract

Activation functions are fundamental components of modern neural networks, including Large Language Models (LLMs). Nonlinear activations regulate the flow of information in most models and determine how data is processed. However, training or even fine-tuning very large and complex models with such activations, like those used in ChatGPT and DeepSeek, remains out of reach for many researchers. For this reason, traditional architectures like LSTM remain relevant for moderate or small-scale applications. In this context, the choice of appropriate functions can significantly influence a network's ability to learn complex patterns under limited computing resources. In this work, we analyze the behavior of a novel family of activation functions, called Zorro, which can serve as gating functions in LSTM architectures. We propose replacing traditional activation and gating functions in LSTMs with Zorro functions to improve model performance and convergence speed. Unlike conventional approaches, our method assigns a different function to each gate or activation, enabling the application of the same methodology to any gated architecture. We evaluate the modified LSTM models on widely used small-scale benchmark datasets, including Japanese Vowels and Human Activity Recognition for classification, as well as Chickenpox and Turbofan Degradation for regression. The results show that our method improves model accuracy by up to 10% and reduces training time by up to 15%.

Downloads

Download data is not yet available.

References

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, pp. 1735–80, 12 1997.

L. Feng, F. Tung, H. Hajimirsadeghi, M. O. Ahmed, Y. Bengio, and G. Mori, “Attention as an rnn,” 2024. [Online]. Available: https://arxiv.org/abs/2405.13956

D. Hutchins, I. Schlag, Y. Wu, E. Dyer, and B. Neyshabur, “Block-recurrent transformers,” 2022. [Online]. Available: https://arxiv.org/abs/2203.07852

M. Beck, K. P ¨oppel, M. Spanring, A. Auer, O. Prudnikova, M. Kopp, G. Klambauer, J. Brandstetter, and S. Hochreiter, “xlstm: Extended long short-term memory,” 2024. [Online]. Available: https://arxiv.org/abs/2405.04517

P. Bilokon and Y. Qiu, “Transformers versus lstms for electronic trading,” 2023.

D. E. Lee, “Advanced stock pattern prediction using lstm with the attention mechanism in tensorflow: A step by step guide with apple inc. (aapl) data,” 2024.

A. Farzad, H. Mashayekhi, and H. Hassanpour, “A comparative performance analysis of different activation functions in lstm networks for classification,” Neural Computing and Applications, vol. 31, no. 7, pp. 2507–2521, Jul 2019. [Online]. Available: 10.1007/s00521-017-3210-6

M. A. Mohamed, H. A. Hassan, M. H. Essai, H. Esmaiel, A. S. Mubarak, and O. A. Omer, “Modified gate activation functions of bi-lstm-based sc-fdma channel equalization,” Journal of Electrical Engineering, vol. 74, no. 4, pp. 256–266, 2023.

D. L. Elliott, A better activation function for artificial neural networks. University of Maryland. Systems Research Center, 1993.

S. Singh Sodhi and P. Chandra, “Bi-modal derivative activation function for sigmoidal feedforward networks,” Neurocomputing, vol. 143, pp. 182–196, 2014.

G. S. da S. Gomes and T. B. Ludermir, “Optimization of the weights and asymmetric activation function family of neural network for time series forecasting,” Expert Systems with Applications, vol. 40, no. 16, pp. 6438–6446, 2013.

W. DUCH, “Survey of neural network transfer functions,” Neural Computing Surveys, vol. 2, pp. 163–212, 1999.

M. Roodschild, J. Gotay-Sardi ˜nas, V. A. Jimenez, and A. Will, “Zorro: A flexible and differentiable parametric family of activation functions that extends relu and gelu,” 2024. [Online]. Available: https://arxiv.org/abs/2409.19239

V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in ICML 2010, 2010, pp. 807–814.

M. Kudo, J. Toyama, and M. Shimbo, “Japanese vowels,” UCI Machine Learning Repository, 1999, dOI: https://doi.org/10.24432/C5NS47.

A. Saxena and G. Kai, “Turbofan engine degradation simulation data set,” NASA Ames Prognostics Data Repository, 2008. [Online]. Available: https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-data-repository/

R. Hyndman and Y. Yang, “tsdl: Time Series Data Library,” https://pkg.yangzhuoranyang.com/tsdl/, 2018.

A. El Helou, “Sensor HAR recognition App,” https://www.mathworks.com/matlabcentral/fileexchange/54138-sensor-har-recognition-app, 2024.

S. Elfwing, E. Uchibe, and K. Doya, “Sigmoid-weighted linear units for neural network function approximation in reinforcement learning,” 2017. [Online]. Available: https://arxiv.org/abs/1702.03118

D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” 2023.

M. Roodschild, J. Gotay, and A. Will, “A new approach for the vanishing gradient problem on sigmoid activation,” Progress in Artificial Intelligence, vol. 9, p. 351–360, 2020.

Downloads

Published

2026-04-10

Issue

Section

Original Articles

How to Cite

[1]
“On the Use of the Novel Zorro Activation Functions in Long Short-Term Memory Neural Networks”, JCS&T, vol. 26, no. 1, p. e03, Apr. 2026, doi: 10.24215/16666038.26.e03.

Similar Articles

1-10 of 93

You may also start an advanced similarity search for this article.