Handling Ambiguous Rationales in Natural Language Reasoning

Updated

2025/02/24 07:24

Keywords

Natural Language Reasoning

NLP

4 more properties

Language models (LMs) have become remarkably proficient in natural language reasoning tasks by leveraging the rationales they generate to enhance their reasoning capabilities. These rationales serve a dual purpose: they not only help improve reasoning performance but also justify model decisions. The problem? Perfect rationales are often impossible to obtain.

Our research, How Ambiguous Are the Rationales for Natural Language Reasoning? A Simple Approach to Handling Rationale Uncertainty, explores this issue in depth. The study introduces AURA (Ambiguous Rationale Utilization for Robust Answering), a novel two-stage reasoning framework that enhances model robustness in the face of ambiguous rationales. Let’s dive into the details of this study and its implications for natural language reasoning.

Research Background and Problem Statement

Language models (LMs) have achieved significant progress in solving complex reasoning tasks that require commonsense knowledge, as well as in handling challenging multiple-choice questions. In particular, recent advancements have enabled models to leverage generated rationales to further enhance reasoning performance. Nevertheless, obtaining perfect rationales from models (or even from humans) is practically impossible. While human annotation can improve the quality of rationales, it is extremely costly and does not guarantee perfect conditions.

More importantly, learning the patterns of rationales, which are generated based on an enormous number of different statements, is nearly impossible. This is because rationales inherently contain various normative concepts, which justify different ways of thinking or acting—in other words, the same question may have different explanations. As a result, this diversity causes models to experience uncertainty, making it challenging to learn rationales effectively.

Therefore, our study analyzes the impact of ambiguous rationales on the performance of natural language reasoning models and proposes a method to effectively handle ambiguity.

Methodology

Natural Language Reasoning with Rationales

To begin, we introduce the concept of Natural Language Reasoning with Rationales. In this study, we focus on Multiple Choice Question Answering (MCQA) tasks, where a model is given a question

q

and a set of answer choices

A=\{a_i\}

First, the model computes a plausibility score

\rho(q,a_i)

for each answer choice and predicts the optimal answer

\hat{a}

by selecting the choice with the highest score:

\hat{a}=\argmax_{a_i}{\rho(q,a_i)}

When rationales are provided, the model extends the plausibility score by incorporating additional information. Specifically, given a rationale

r_i

for an answer choice

a_i

, the model learns an extended plausibility score as follows:

\rho(q,A,a_i,r_i)=\sum_{j=1}^{|a_i|}{\text{log}P(t_i^{j}|t_{i}^{j-1},...,t_i^{2}, t_i^{1}, q, A, r_i)}

where

t_j^i

represents the

j

-th token of the answer choice

a_i

, and

P(t_i^{j}|t_{i}^{j-1},...,t_i^{2}, t_i^{1}, q, A, r_i)

denotes the probability of a specific token appearing given the provided context.

Subsequently, the probabilities for all answer choices

a_i

are normalized using the Softmax function :

P(a_i|q,A,R)=\frac{e^{\rho_i}}{\sum_{j=1}^{|A|}{e^{\rho_j}}}

Finally, the model is trained to maximize the probability of selecting the correct answer

a^*

using the Cross-Entropy Loss function :

L=-\sum_{a_i\in A}{Q(a_i|q,A)}\text{log}P(a_i|q,A,R)

where

Q(a_i|q,A)

is 1 if

a_i=a^*

and 0 otherwise.

Ambiguous Rationales

Next, we introduce our approach for quantifying the ambiguity of rationales in natural language reasoning. We define ambiguity using the concept of entropy from Information Theory, which measures the amount of information or uncertainty inherent in possible outcomes.

(1) Defining Ambiguity Using Entropy

Entropy is a metric that quantifies the uncertainty or self-information of an event

x

. It is defined as:

H=-\sum_x{p(x)\text{log}p(x)}

p(x)

represents the probability of event

x

occuring, and

-\text{log}p(x)

denotes the informativeness of the event.

(2) Quantifying ambiguity of rationales using entropy

In the context of natural language reasoning, we use entropy to quantify the ambiguity of rationales by comparing a model’s prior and posterior beliefs about the plausibility of an answer choice

a_i

Prior belief: The plausibility score of answer choice

a_i

as judged by the model before fine-tuning (

\tilde{\theta}

), based on its internal knowledge:

\text{log}P(a_i|x_i,\tilde{\theta})

Posterior belief: The plausibility score of answer choice

a_i

after fine-tuning (

\hat{\theta}_{\text{MLE}}

P(a_i|x_i,\hat{\theta}_{\text{MLE}})

We aim to measure how the posterior beliefs are uncertain about what prior beliefs have high informativeness:

H(x)=-P(a_i|x_i,\hat{\theta}_{\text{MLE}})\text{log}P(a_i|x_i,\tilde{\theta})

(3) Identifying ambiguous rationales

To determine whether a rationale is ambiguous, we calculate the average entropy across the entire dataset(

\tau=\mathcal{E}[H(x)]

). We then classify rationales as ambiguous or unambiguous based on the threshold

\tau

R(x) = \begin{cases} r_{\text{unambiguous}}, & \text{if} \; h(x_i) < \tau \\ r_{\text{ambiguous}}, & \text{if} \; h(x_i) \geq \tau \end{cases}

By leveraging this entropy-based approach, we systematically identify ambiguous rationales, allowing the model to adjust its reasoning strategy accordingly.

AURA: Reasoning with Ambiguous Rationales

We now introduce AURA, our proposed simple yet novel two-stage reasoning framework that enhances model robustness in the face of ambiguous rationales.

AURA operates through a two-stage reasoning system, where the model first learns from the entire dataset and then refines its reasoning using only ambiguous rationales.

STEP 1 (Reasoning 1)

A pre-trained model (

\tilde{\theta}

) is fine-tuned using the entire dataset, producing a fine-tuned model (

\hat{\theta}_\text{MLE}

). The rationale entropy is then computed using both the pre-trained model (

\tilde{\theta}

) and the fine-tuned model (

\hat{\theta}_\text{MLE}

STEP 2 (Reasoning 2)

The pre-trained model (

\tilde{\theta}

) is further trained only on ambiguous rationales (i.e., rationales with high entropy values).

From an ensemble learning perspective, AURA effectively trains two models sequentially, allowing the system to refine its reasoning on underlearned, ambiguous rationales. This process can be represented as:

\{P(y|x^*, \theta^{(t)})\}_{t=1}^{T=2} \rightarrow \{P(y|W^{(t)})\}_{t=1}^{T=2}

W^{(t)}\sim P(W|x^{t},D)

where

t=1

corresponds to Reasoning 1, and

t=2

corresponds to Reasoning2.

Experiments and Analysis

In this section, we describe our observations regarding the superiority of AURA, which effectively addresses aleatoric uncertainty in rationales for reasoning tasks.

We conduct experiments using four commonsense question-answering datasets: CSQA, StrategyQA, OpenBookQA(OBQA), and QASC. We compared AURA against baseline models, including those without rationales, self-rationalization models, and pipeline rationalization approaches. Performance was evaluated in both In-Distribution (ID) and Out-of-Distribution (OoD) settings. Additionally, we varied the training ratios to assess AURA’s performance in low-resource settings, confirming its robustness even with limited training data.

AURA outperformed all baseline methods across multiple benchmarks, achieving the highest accuracy in both In-Distribution (ID) and Out-of-Distribution (OoD) settings. Also, AURA maintains strong performance even in low-resource settings, highlighting its effectiveness in learning from limited data.

We also conducted further analysis to answer the following two research questions.

Does the quality rationales contribute more than the good reasoning mechanisms to the overall performance?

Results showed that AURA led to greater performance gains than simply using higher-quality rationales, such as switching from machine-generated to human-annotated rationales. In OoD settings, better rationales did not guarantee improved performance, whereas a stronger reasoning approach consistently led to better results, suggesting that robust reasoning can compensate for unreliable rationales.

Do quality rationales provide a better influence on training or inference?

Training with human rationales improved accuracy by around 4.29% to 4.53%, but using them during inference resulted in a much larger boost of 22.2% to 22.44%. This indicates that high-quality rationales are far more beneficial when applied at the inference stage rather than during training.

Conclusion and Limitations

In this study, we analyzed the impact of ambiguous rationales on reasoning performance and proposed AURA, a reasoning mechanism designed to effectively handle rationale uncertainty. Through extensive experiments, we demonstrated that AURA achieves robust and superior performance, even in adversarial rationale quality scenarios and low-resource settings.

Despite its effectiveness, AURA has some limitations. Due to computational constraints, our analysis primarily focused on medium-sized language models rather than larger-scale models. Additionally, we concentrated on commonsense reasoning tasks rather than those requiring domain-specific knowledge, as our goal was to explore the benefits of pre-trained language models' (PLMs) prior knowledge. Future work could extend AURA to larger models and specialized reasoning domains to further validate its applicability.

References

Hazel H. Kim. 2025. How Ambiguous Are the Rationales for Natural Language Reasoning? A Simple Approach to Handling Rationale Uncertainty. In Proceedings of the 31st International Conference on Computational Linguistics, pages 10047–10053, Abu Dhabi, UAE. Association for Computational Linguistics.

저자 김현지 (Hazel Kim)

연세대학교에서 인공지능 전공으로 석사 학위를 취득했으며, CT AI Researcher로 재직하였다. 현재는 옥스포드 대학교에 박사과정으로 재학중이다. 관심 연구 분야는 자연어처리, 제한된 데이터 기반 학습, 언어모델의 불확실성 및 통제가능성 연구 등이다.

hazel.kim@cs.ox.ac.uk