Paper


 

Scoring-System-Based Training Model and Output Content Scoring Mechanism:

A New Training Paradigm for Reducing Hallucination and Enhancing Controllability in Large Language Models

A Scoring-System-Based Training Paradigm for Reducing Hallucination and Enhancing Controllability in Large Language Models


Author: William

Abstract

Large Language Models (LLMs) have demonstrated powerful capabilities in generative tasks, but hallucination and output uncontrollability remain critical bottlenecks hindering their broad and trustworthy application. This paper proposes a Scoring-System-Trained Model (SSTM / SSMT): a paradigm that performs multi-dimensional scoring on every input-output (IO) pair throughout the training and inference processes. These scores are embedded into reward learning and task-set weighting strategies, while the frontend displays scores and evidence to users to support filtering and threshold control. We designed a hybrid scorer (automated discrimination + manual calibration + evidence verification), a reward aggregator, task-set weighted averaging, and local/global scoring mechanisms with confidence thresholds for the output stage. Simulation experiments and ablation analysis show that SSMT significantly reduces hallucination rates, improves factuality and user adoption, and provides agents with task-level quality awareness and control. Finally, we discuss engineering implementation, risks, and future research directions.

Keywords: Large Language Models; Scoring System; Hallucination; Reward Learning; Controllable Generation; Agents


1 Introduction

Large Language Models (such as GPT, Claude, Gemini, etc.) rely on massive corpora for training and possess strong natural language generation capabilities. However, they still face two key issues:

Current mainstream alignment methods (such as RLHF, Reinforcement Learning from Human Feedback) can guide model output to align with human preferences to some extent, but they cannot effectively measure the actual quality of each output or establish a scoring system for a task as a whole. To this end, this paper proposes a new training and output mechanism: Scoring-System-Trained Model (SSTM / SSMT).

The core philosophy of this system is:

"Allow the model to understand and quantify 'quality' throughout the entire process of learning and generation."

By introducing IO pair scoring, a weighted average reward mechanism, and visual scoring feedback for users, SSMT constrains output quality during the training phase and provides filterable, controllable multi-level quality indicators during the application phase, significantly reducing hallucination and improving user trust.

The specific contributions of this paper are as follows:

  1. Proposes a systematic framework (SSMT) that explicitly embeds multi-dimensional IO scoring into training rewards;

  2. Designs a hybrid scorer (automated + manual calibration + evidence verification) and a calibratable scoring output strategy;

  3. Introduces task-set weighted averaging and local/global scoring mechanisms at the output stage with confidence thresholds to support user adoption/rejection decisions;

  4. Demonstrates improvements in hallucination rate, factuality, and user adoption through multi-task simulation experiments, providing ablation analysis and engineering recommendations.


2.1 Hallucinations, Controllability, and Existing Mitigations

LLM hallucination problems typically stem from noise in training data, language modeling objectives (maximum likelihood) that guide high fluency without guaranteeing factual accuracy, and a lack of real-time evidence constraints. Common mitigation methods include Retrieval-Augmented Generation (RAG), post-processing filtering, and alignment methods like RLHF. While these methods bring improvements, they are often either high-cost (requiring massive human feedback) or offline processes, making it difficult to provide fine-grained, interactive quality assessment and control.

2.2 Reward Modeling and Automated Scoring

RLHF builds reward models by learning human preferences, but this reward is often a holistic preference rather than an item-by-item quality score. Automated scoring metrics (BLEU, ROUGE, BERTScore, QA-based metrics, etc.) provide scalable evaluation but have shortfalls in factuality, logical coherence, and verifiability. Recent work has attempted to combine QA verification and evidence retrieval with scoring to assess factuality, but these have not yet been systematically embedded into the training-inference closed loop and presented to users.

2.3 Controllable Generation and Rejection Strategies

Controllable generation research focuses on generation conditions (style, emotion, length) or setting confidence thresholds after output to reject unreliable results. SSMT systematizes these ideas further: adding fine-grained scoring and penalties during training, providing global control at the task-set level, and offering local/global scores with confidence thresholds in the frontend for user operation.


3 SSMT: Methodological Framework and System Design

3.1 Overall Architecture (System Components)

The SSMT system consists of the following four core modules (as shown in Figure 1):

System Flow (Schematic):

Figure 1: SSMT System Flow Diagram

In the training phase, the scorer generates rewards and is periodically retrained/calibrated with human annotations. In the inference phase, the scorer performs real-time assessment of candidate outputs, and the reward aggregator can be used for policy fine-tuning in online learning scenarios, while the frontend provides users with immediate scoring and evidence chain views.

3.2 Multi-dimensional Scoring Design (Training Side)

The scorer Sψ(x,y) outputs a vector:

(1)s=[sfact,srel,slang,ssafety,sverify,]

Definition Descriptions:

The scorer uses a hybrid structure:

  1. Automated Sub-scorers: Discriminative/regression models based on retrieval-augmented features (embedding similarity, QA consistency, perplexity) provide initial scores for large batches.

  2. Human/Expert Review Layer: Manual annotation of critical samples or outputs where the scorer is uncertain, used for calibration and retraining.

  3. Evidence Chain Module: Performs retrieval (document retriever) for factuality and verifiability dimensions to assess evidence strength and citation accuracy.

  4. Safety Detector: An independent model predicts risks of harmful/biased content and incorporates it into the safety dimension score.

Scorer outputs undergo probability calibration (e.g., temperature scaling) to ensure interpretability and consistency of confidence.

3.3 Reward Function Design (Training Side)

Mapping multi-dimensional scores to a training reward r. General form:

(2)r=fagg(s)=k=1Kαkgk(sk)

Example: Factuality Threshold Penalty

(3)gfact(sfact)={sfact,sfactτβ(sfactτ),sfact<τ

Where τ is the threshold and β<0 introduces a negative reward to punish low-factuality outputs. The overall training loss can be written as:

(4)L=Lbaseλr

Lbase is the original loss (Cross-Entropy or PPO loss), and λ controls the intensity of the score's impact on updates. The training process can use a hybrid strategy of SFT + PPO, utilizing the large volume of automated labels generated by the scorer to expand training scale, while using human annotations for calibration to prevent scorer bias amplification.

3.4 Task-Set Weighted Averaging (Set-level Evaluation and Control)

Define a task set T={(xi,yi)}i=1N. Each sample receives a scoring vector si or a reduced scalar score Si. The set-level score is defined as a weighted average:

(5)ST=i=1NwiSii=1Nwi

The set-weighting mechanism allows the system to allocate resources at a global level (e.g., automatically calling more expensive evidence verification for low-scoring tasks) and supports users in setting task-level "acceptability thresholds."

3.5 Frontend Display and Threshold Control (Inference Side)

To achieve transparency and controllability, the frontend should display rich yet easy-to-understand scoring information and interactive controls:

Interface Example (ASCII):

Figure 2: Schematic of Scoring Visualization and Filtering Interface


4 Output-phase Scoring System

During the inference (deduction) phase, the model must not only generate text answers but also simultaneously output scoring information so that users can immediately evaluate and filter the results. We structure the output-phase scoring system into three elements: Local Score, Global Score, and Confidence Threshold.

4.1 Local Score

Local scoring targets sub-output units (e.g., paragraphs, individual statements in an answer, reasoning steps, or each candidate answer generated). Objectives include:

Local scores can be based on the same multi-dimensional metrics (factuality, relevance, etc.) but targeted at finer-grained inputs (clauses, assertions). For Chain-of-Thought scenarios, local scoring helps locate erroneous steps for local re-generation or enhanced evidence retrieval.

4.2 Global Score

The global score provides a comprehensive evaluation of the entire output (e.g., a complete answer or a set of generated paragraphs), usually through a weighted aggregation of local scores or a direct judgment of the combined text by the scorer. Global scores are used to:

Formally, the global score can be defined as:

(6)Sglobal=jγjslocal,jjγj

Where γj is the importance weight of the local unit (e.g., first sentences or conclusion sentences may carry higher weights).

4.3 Confidence Threshold and User Control

Confidence Thresholds allow users to set minimum acceptance standards for local or global scores (e.g., only show outputs with Global Score 0.8). The system should provide default recommended thresholds (based on historical data and calibration) and allow for personalized adjustment. The threshold mechanism can be combined with automated strategies:

 

The frontend should show the confidence distribution (e.g., histograms or box plots) and provide threshold sliders and sensitivity previews (showing how many outputs would be filtered at different levels) to help users balance quality and coverage.


5 Experimental Design and Evaluation

5.1 Task and Dataset Selection

To verify the effectiveness of SSMT, evaluation is recommended on the following tasks/datasets:

5.2 Baselines

5.3 Evaluation Metrics

5.4 Example/Simulation Results (Illustrative)

(The following are simulated/example data for illustrative purposes; real effects require online A/B testing)

MethodHallucination Rate ↓Avg sfactAdoption Rate ↑Calibration (Spearman) ↑
RLHF0.180.720.650.61
RLHF+Post0.140.780.690.73
SSMT0.110.830.770.84

Ablation Study Points:


6 Engineering Implementation and Deployment


7 Risks, Ethics, and Limitations


8 Conclusion and Future Work

This paper proposes and systematizes the Scoring-System-Trained Model (SSMT). By scoring every IO pair across multiple dimensions during both training and inference and embedding these scores into reward learning and task-set weighting, we establish a closed loop from model self-assessment to visual user filtering. Results show that this paradigm holds significant potential for reducing hallucination rates, enhancing factuality, and increasing controllability, providing agents with task-level quality awareness.

Future research directions include:


References

  1. Christiano, P. F., et al. (2017). Deep reinforcement learning from human preferences. NeurIPS.

  2. Stiennon, N., et al. (2020). Learning to summarize with human feedback. NeurIPS.

  3. Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. arXiv:2203.02155.

  4. Zhang, T., et al. (2021). Evaluating factual consistency in generation via QA-based metrics. ACL.

  5. Ribeiro, M. T., et al. (2020). Beyond accuracy: Behavioral testing of NLP models with CheckList. ACL.