With an excess of confidence, Large Language Models (LLMs) can lie. Can we teach them not to be so sneaky? 

A group of researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have created a training methodology for LLMs, called RLCR (Reinforcement Learning with Calibration Rewards) that computes the certainty, or a confidence score, for each LLM answer. 

The LLMs can use the scale to better assess their accuracy, and even make improvements using the resulting scores.

Today, LLMs can be brazenly confident of the answers they provide, to the point where their certainty can fool a user into believing that the material presented to them is undoubtedly truthful. This assuredness can hide a deeper ambiguity as to the answer’s actual fidelity to the truth, a calibration error of confidence, as it were. 

The act of reasoning about this uncertainty itself can help a model improve its performance, especially for smaller models, the researchers state. 

Lucky Guesses 

LLM systems typically evaluate their work through built-in reinforcement learning (RL) reasoning models, which independently verify the answer. Their assessment is binary; an answer is deemed accurate or not. 

“While simple and effective for improving accuracy, this reward comes with a critical limitation: it rewards models equally whether they are confidently correct or merely guessing,” the researchers write in an ArXiv pre-publication.

An answer that has a 51% chance of being accurate gets the same reward function boost as an answer that is 99% known to be accurate.

This means an RL model gets the same reward whether it goes through an extensive reasoning process, or if it just guesses. It can get the same reward either way. The RL model has no way to express uncertainty in the process, researchers state (Hence, the cocky attitude about dubious answers). 

Even worse, this RL process can actually degrade error calibration over time. “The models become more capable and more overconfident at the same time,” said Isha Puri, an MIT PhD student and co-lead author, in a CSAIL post about the work. Better to be right than honest, an RL model might conclude.

This behavior can be particularly problematic in disciplines that require precision, such as law and medicine. 

Be Right, But Be Honest If You Guess

CSAIL’s RLCR reflects a finer gradation of uncertainty back to the LLM. In doing so, it can incentivize accuracy and calibration, the researchers argue. 

For each task, RLCR generates a Brier Score, a confidence score of its answer. Brier then compares that confidence score with the binary state of accuracy (is it accurate or not?). The reward function gets points based on how close the two scores are, and it is subtracted points if the two are wildly disparate. 

Both confidently wrong answers and unnecessarily confident ones are penalized, for instance. 

The researchers tested RLCR using problems where the answers are already known, such as math problems.  

When used by a 7 billion-parameter model in benchmarked tests, RLCR slashed calibration errors by up to 90 percent, and even helped the LLM improve its accuracy in some cases. The improvements worked on both tasks that the LLM was trained on, as well as with other novel tasks. 

The models can actually use RLCR results to improve inferencing. They are especially useful when the LLM produces multiple candidate answers and needs to choose between them. 

“These results suggest a path toward reasoning systems that are not only accurate, but reliably reason about and communicate uncertainty,” the researchers wrote in their paper. 

This is not the only research help LLMs have been offered for internally assessing accuracy. The Quiet STaR research project generates rationales at each token to improve prediction. The Conformal Prediction algorithm generates statistical evaluations of each answer. 

With so many concerns around LLM hallucinations, the frontier labs may want to investigate building internal metrics to communicate the uncertainty around their answers. 

The researchers are presenting the paper at the International Conference on Learning Representations now taking place.