
When it comes to landing high-paying jobs, men have a decided edge over women, if open source artificial intelligence (AI) models are doing the hiring.
Those are the findings of a new paper, titled “Who Gets the Callback? Generative AI and Gender Bias,” that examines gender bias among AI models at a time when more recruiters and corporate HR departments are leaning on the technology.
“We find that most models reproduce stereotypical gender associations and systematically recommend equally qualified women for lower-wage roles,” concluded co-authors Rochana Chaturvedi, a PhD candidate at the University of Illinois, and Sugat Chaturvedi, assistant professor at Ahmedabad University in India. They analyzed several mid-sized open-source large language models (LLMs) for signs of gender bias in hiring recommendations.
The researchers examined a dataset of 332,044 English-language job ads from India’s National Career Services online job portal, and sized up the hiring tendencies of Llama-3-8B-Instruct, Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct, Granite-3.1-8B-it, Ministral-8B-Instruct-2410,and Gemma-2-9B-it.
What they found was when the models were supplied with job descriptions and asked to choose between two equally qualified male and female candidates, the callback rate for females was significantly lower than males — especially for higher-paying jobs. “These biases stem from entrenched gender patterns in the training data as well as from an agreeableness bias induced during the reinforcement learning from human feedback stage,” the co-authors wrote.
The percentages for which the different models recommended a female candidate varied wildly, from 1.4% for Ministral to 87.3% for Gemma, they found. The most-balanced model was Llama-3.1, with a callback rate of 41%.
Meta Platforms Inc.’s Llama-3.1 was also the most likely to refuse to consider gender. It avoided picking a candidate by gender in 6% of cases, compared to 1.5% or less for other models.
As for pay, the difference between sexes was stark.
“We find that the wage gap is lowest for Granite and Llama-3.1 (≈ 9 log points for both), followed by Qwen (≈ 14 log points), with women being recommended for lower wage jobs than men,” the paper summarized. “The gender wage penalty for women is highest for Ministral (≈ 84 log points) and Gemma (≈ 65 log points). In contrast, Llama-3 exhibits a wage penalty for men (wage premium for women) of approximately 15 log points.”
The results of the study underscore challenges AI hiring models pose as more companies rely on them to make jobs recommendations, industry experts say. “AI technologies are increasingly being implemented in organizations to enhance HRM (human resource management) across a range of activities and departments to support operational performance and value creation,” the National Library of Medicine concluded in another paper on the role of HR in the AI age.