×
Beyond Specialization: Assessing the Capabilities of MLLMs in Age and Gender Estimation

Authors

Maksim Kuprashevich, Grigorii Alekseenko and Irina Tolstykh, Layer Team, Uzbekistan

Abstract

Multimodal Large Language Models (MLLMs) have recently gained immense popularity. Powerful commercial models like ChatGPT and Gemini, as well as open-source ones such as LLaVA, are essentially general-purpose models and are applied to solve a wide variety of tasks, including those in computer vision. These neural networks possess such strong general knowledge and reasoning abilities that they have proven capable of working even on tasks for which they were not specifically trained. We compared the capabilities of the most powerful MLLMs to date including ShareGPT4V, ChatGPT 4V/4O, and LLaVA Next in the specialized task of age and gender estimation, with the state-of-the-art specialized model MiVOLO384. In our study, we discovered that the fine-tuned open-source ShareGPT4V model is capable of outperforming the specialized model in age and gender estimation tasks. At the same time, the proprietary ChatGPT-4O beats both in the age estimation task but does not perform as confidently in gender recognition. This gives interesting insights about the strengths and weaknesses of the participating models and suggests that with a few tweaks, general-purpose MLLM models can match or even surpass specialized ones in certain fields. Even though these fine-tuned models might require more computing power, they offer big benefits for tasks where computing power is not a limiting factor and where the best accuracy is key, such as data annotation.

Keywords

MLLM, VLM, Human Attribute Recognition, Age estimation, Gender estimation, Large Models Generalization