Paper

Scoring-System-Based Training Model and Output Content Scoring Mechanism:

A New Training Paradigm for Reducing Hallucination and Enhancing Controllability in Large Language Models

A Scoring-System-Based Training Paradigm for Reducing Hallucination and Enhancing Controllability in Large Language Models

Author: William

Abstract

Large Language Models (LLMs) have demonstrated powerful capabilities in generative tasks, but hallucination and output uncontrollability remain critical bottlenecks hindering their broad and trustworthy application. This paper proposes a Scoring-System-Trained Model (SSTM / SSMT): a paradigm that performs multi-dimensional scoring on every input-output (IO) pair throughout the training and inference processes. These scores are embedded into reward learning and task-set weighting strategies, while the frontend displays scores and evidence to users to support filtering and threshold control. We designed a hybrid scorer (automated discrimination + manual calibration + evidence verification), a reward aggregator, task-set weighted averaging, and local/global scoring mechanisms with confidence thresholds for the output stage. Simulation experiments and ablation analysis show that SSMT significantly reduces hallucination rates, improves factuality and user adoption, and provides agents with task-level quality awareness and control. Finally, we discuss engineering implementation, risks, and future research directions.

Keywords: Large Language Models; Scoring System; Hallucination; Reward Learning; Controllable Generation; Agents

1 Introduction

Large Language Models (such as GPT, Claude, Gemini, etc.) rely on massive corpora for training and possess strong natural language generation capabilities. However, they still face two key issues:

Hallucination — The model generates content that is grammatically correct but factually incorrect;
Output Uncontrollability — Users find it difficult to judge and filter the credibility and quality of the generated results.

Current mainstream alignment methods (such as RLHF, Reinforcement Learning from Human Feedback) can guide model output to align with human preferences to some extent, but they cannot effectively measure the actual quality of each output or establish a scoring system for a task as a whole. To this end, this paper proposes a new training and output mechanism: Scoring-System-Trained Model (SSTM / SSMT).

The core philosophy of this system is:

"Allow the model to understand and quantify 'quality' throughout the entire process of learning and generation."

By introducing IO pair scoring, a weighted average reward mechanism, and visual scoring feedback for users, SSMT constrains output quality during the training phase and provides filterable, controllable multi-level quality indicators during the application phase, significantly reducing hallucination and improving user trust.

The specific contributions of this paper are as follows:

Proposes a systematic framework (SSMT) that explicitly embeds multi-dimensional IO scoring into training rewards;
Designs a hybrid scorer (automated + manual calibration + evidence verification) and a calibratable scoring output strategy;
Introduces task-set weighted averaging and local/global scoring mechanisms at the output stage with confidence thresholds to support user adoption/rejection decisions;
Demonstrates improvements in hallucination rate, factuality, and user adoption through multi-task simulation experiments, providing ablation analysis and engineering recommendations.

2.1 Hallucinations, Controllability, and Existing Mitigations

LLM hallucination problems typically stem from noise in training data, language modeling objectives (maximum likelihood) that guide high fluency without guaranteeing factual accuracy, and a lack of real-time evidence constraints. Common mitigation methods include Retrieval-Augmented Generation (RAG), post-processing filtering, and alignment methods like RLHF. While these methods bring improvements, they are often either high-cost (requiring massive human feedback) or offline processes, making it difficult to provide fine-grained, interactive quality assessment and control.

2.2 Reward Modeling and Automated Scoring

RLHF builds reward models by learning human preferences, but this reward is often a holistic preference rather than an item-by-item quality score. Automated scoring metrics (BLEU, ROUGE, BERTScore, QA-based metrics, etc.) provide scalable evaluation but have shortfalls in factuality, logical coherence, and verifiability. Recent work has attempted to combine QA verification and evidence retrieval with scoring to assess factuality, but these have not yet been systematically embedded into the training-inference closed loop and presented to users.

2.3 Controllable Generation and Rejection Strategies

Controllable generation research focuses on generation conditions (style, emotion, length) or setting confidence thresholds after output to reject unreliable results. SSMT systematizes these ideas further: adding fine-grained scoring and penalties during training, providing global control at the task-set level, and offering local/global scores with confidence thresholds in the frontend for user operation.

3 SSMT: Methodological Framework and System Design

3.1 Overall Architecture (System Components)

The SSMT system consists of the following four core modules (as shown in Figure 1):

$\pi_\theta$ ) $y$ $x$ .
$S_\psi$ ) $s$ $(x, y)$ , measuring dimensions such as factuality, relevance, fluency, safety, and verifiability.
Reward Aggregator $s$ $r$ (scalar or vector) usable for training, supporting RL/Supervised hybrid training.
User Frontend: Displays multi-dimensional scores, evidence snippets, and task-level overviews during the inference phase, providing threshold filtering, adoption/rejection, and feedback functions.

System Flow (Schematic):


x
                ┌────────────────────┐
                │   Input Task Set T │
                └────────┬──────────┘
                         │
                  ┌──────▼──────┐
                  │ Generator πθ│
                  └──────┬──────┘
                         │ Output y
                  ┌──────▼──────┐
                  │   Scorer Sψ │  → Multi-dim scores s=[s_fact, s_rel, ...]
                  └──────┬──────┘
                         │
                  ┌──────▼──────┐
                  │ Reward Aggr │  → Reward r = f(s)
                  └──────┬──────┘
                         │
                  ┌──────▼──────┐
                  │ Model Update│
                  └──────┬──────┘
                         │
                  ┌──────▼──────┐
                  │ User UI Out │
                  └─────────────┘

Figure 1: SSMT System Flow Diagram

In the training phase, the scorer generates rewards and is periodically retrained/calibrated with human annotations. In the inference phase, the scorer performs real-time assessment of candidate outputs, and the reward aggregator can be used for policy fine-tuning in online learning scenarios, while the frontend provides users with immediate scoring and evidence chain views.

3.2 Multi-dimensional Scoring Design (Training Side)

$S_\psi(x,y)$ outputs a vector:

\begin{matrix} (1) & s = [s_{fact}, s_{rel}, s_{lang}, s_{safety}, s_{verify}, \dots] \end{matrix}

Definition Descriptions:

$s_{\text{fact}}$ : Factuality — Degree of consistency between statements and retrieved evidence or verified knowledge.
$s_{\text{rel}}$ : Relevance — How well the output targets the input task and fits the information needs.
$s_{\text{lang}}$ : Language Fluency — Grammar, readability, and naturalness of expression.
$s_{\text{safety}}$ : Safety — Whether it contains harmful, sensitive, or prohibited content.
$s_{\text{verify}}$ : Verifiability — Whether the generated assertions can be supported by retrieved external evidence.

The scorer uses a hybrid structure:

Automated Sub-scorers: Discriminative/regression models based on retrieval-augmented features (embedding similarity, QA consistency, perplexity) provide initial scores for large batches.
Human/Expert Review Layer: Manual annotation of critical samples or outputs where the scorer is uncertain, used for calibration and retraining.
Evidence Chain Module: Performs retrieval (document retriever) for factuality and verifiability dimensions to assess evidence strength and citation accuracy.
Safety Detector: An independent model predicts risks of harmful/biased content and incorporates it into the safety dimension score.

Scorer outputs undergo probability calibration (e.g., temperature scaling) to ensure interpretability and consistency of confidence.

3.3 Reward Function Design (Training Side)

$r$ . General form:

\begin{matrix} (2) & r = f_{agg} (s) = \sum_{k = 1}^{K} α_{k} \cdot g_{k} (s_{k}) \end{matrix}

$\alpha_k$ : Adjustable dimension weights (based on task, risk preference, or learned via meta-learning);
$g_k$ : Non-linear transformations (e.g., threshold penalties, piecewise linear, saturation functions).

Example: Factuality Threshold Penalty

\begin{matrix} (3) & \begin{matrix} g_{fact} (s_{fact}) = {\begin{cases} s_{fact}, & s_{fact} \geq τ \\ β \cdot (s_{fact} - τ), & s_{fact} < τ \end{cases} \end{matrix} \end{matrix}

$\tau$ $\beta < 0$ introduces a negative reward to punish low-factuality outputs. The overall training loss can be written as:

\begin{matrix} (4) & L = L_{base} - λ \cdot r \end{matrix}

$L_{\text{base}}$ $\lambda$ controls the intensity of the score's impact on updates. The training process can use a hybrid strategy of SFT + PPO, utilizing the large volume of automated labels generated by the scorer to expand training scale, while using human annotations for calibration to prevent scorer bias amplification.

3.4 Task-Set Weighted Averaging (Set-level Evaluation and Control)

$T = \{(x_i, y_i)\}_{i=1}^N$ $s_i$ $S_i$ . The set-level score is defined as a weighted average:

\begin{matrix} (5) & S_{T} = \frac{\sum_{i = 1}^{N} w_{i} \cdot S_{i}}{\sum_{i = 1}^{N} w_{i}} \end{matrix}

$w_i$ : Sample weight, set based on sample importance, user-specified priority, confidence, or historical performance.
$S_T$ $S_T < \theta_T$ ), and providing users with a task-level quality overview.

The set-weighting mechanism allows the system to allocate resources at a global level (e.g., automatically calling more expensive evidence verification for low-scoring tasks) and supports users in setting task-level "acceptability thresholds."

3.5 Frontend Display and Threshold Control (Inference Side)

To achieve transparency and controllability, the frontend should display rich yet easy-to-understand scoring information and interactive controls:

Multi-dimensional Score Bars / Radar Charts for each output (collapsible);
Task Average Score, Distribution Histograms, and Low/High Score Examples;
Quick Filter Controls $\ge X$ ";
Evidence Viewer: Click to expand snippets of retrieved supporting evidence and source links (if available);
Action Buttons: Adopt / Discard / Submit Feedback (for training backflow);
Uncertainty Prompts: Visual alerts for low-confidence scores or samples with high scorer uncertainty.

Interface Example (ASCII):


xxxxxxxxxx
Question: How to reduce model hallucinations?
┌────────────────────────────────────────────┐
│ Ans 1: By introducing RAG models...        │ Factuality 0.92 │ Safety 1.00 │
│ Ans 2: Models should randomly sample...    │ Factuality 0.45 │ Safety 0.80 │
│ Ans 3: Add an output filtering layer...    │ Factuality 0.83 │ Safety 0.95 │
└────────────────────────────────────────────┘
Average Score: 0.86 (High Quality)

Figure 2: Schematic of Scoring Visualization and Filtering Interface

4 Output-phase Scoring System

During the inference (deduction) phase, the model must not only generate text answers but also simultaneously output scoring information so that users can immediately evaluate and filter the results. We structure the output-phase scoring system into three elements: Local Score, Global Score, and Confidence Threshold.

4.1 Local Score

Local scoring targets sub-output units (e.g., paragraphs, individual statements in an answer, reasoning steps, or each candidate answer generated). Objectives include:

Supporting users in identifying unreliable or verification-required segments at a microscopic level;
Supporting the model in step-by-step checking and correction during multi-step reasoning/Chain-of-Thought.

Local scores can be based on the same multi-dimensional metrics (factuality, relevance, etc.) but targeted at finer-grained inputs (clauses, assertions). For Chain-of-Thought scenarios, local scoring helps locate erroneous steps for local re-generation or enhanced evidence retrieval.

4.2 Global Score

The global score provides a comprehensive evaluation of the entire output (e.g., a complete answer or a set of generated paragraphs), usually through a weighted aggregation of local scores or a direct judgment of the combined text by the scorer. Global scores are used to:

Quickly display overall quality and credibility to the user;
Compare against task-set thresholds to decide whether to trigger auto-regeneration or manual review.

Formally, the global score can be defined as:

\begin{matrix} (6) & S_{global} = \frac{\sum_{j} γ_{j} \cdot s_{local, j}}{\sum_{j} γ_{j}} \end{matrix}

$\gamma_j$ is the importance weight of the local unit (e.g., first sentences or conclusion sentences may carry higher weights).

4.3 Confidence Threshold and User Control

Confidence Thresholds $\ge 0.8$ ). The system should provide default recommended thresholds (based on historical data and calibration) and allow for personalized adjustment. The threshold mechanism can be combined with automated strategies:

$S_{\text{global}} < \theta$ , the system can:
Automatically trigger re-generation or call a stronger evidence retrieval process;
Label the output as "To be verified" and hand it over for manual review;
Hide the item in multi-candidate views, showing only those above the threshold.

The frontend should show the confidence distribution (e.g., histograms or box plots) and provide threshold sliders and sensitivity previews (showing how many outputs would be filtered at different levels) to help users balance quality and coverage.

5 Experimental Design and Evaluation

5.1 Task and Dataset Selection

To verify the effectiveness of SSMT, evaluation is recommended on the following tasks/datasets:

Open-Domain QA (OpenQA): HotpotQA, NaturalQuestions;
Text Summarization: CNN/DailyMail, XSum (balancing creativity vs. factuality);
Multi-step Reasoning / Logical Reasoning: StrategyQA, LogiQA;
Industry Subsets (Optional): Medical QA, Legal QA (to assess high-risk scenarios).

5.2 Baselines

Baseline A (RLHF): Traditional RLHF model trained on human preferences;
Baseline B (RLHF + Post-filter): RLHF model followed by an automated scoring filter, without embedding scores in training;
Proposed (SSMT): Our method: Scores embedded in training + task-set weighting + output-phase scoring and threshold control.

5.3 Evaluation Metrics

Hallucination Rate: Proportion of errors verified manually or via QA-based auto-validation.
$s_{\text{fact}}$ ): The mean value of the factuality dimension from the scorer.
User Adoption Rate: The proportion of outputs accepted by real or simulated users.
Scoring Calibration: Spearman/Kendall correlation coefficients between scorer outputs and human judgment.
Interaction Cost and Latency: Average delay and computational overhead introduced by scoring and evidence retrieval.

5.4 Example/Simulation Results (Illustrative)

(The following are simulated/example data for illustrative purposes; real effects require online A/B testing)

Method	Hallucination Rate ↓	$s_{\text{fact}}$ ↑	Adoption Rate ↑	Calibration (Spearman) ↑
RLHF	0.18	0.72	0.65	0.61
RLHF+Post	0.14	0.78	0.69	0.73
SSMT	0.11	0.83	0.77	0.84

Ablation Study Points:

$s_{\text{verify}}$ $\rightarrow$ Hallucination rate increases by approx. 8%;
$\rightarrow$ Factuality drops by approx. 5%;
$\alpha_k$ $\rightarrow$ Performance degradation due to inter-item conflict in multi-task scenarios.

6 Engineering Implementation and Deployment

Layered Design (Offline Training / Online Inference): High-performance scorers (including human calibration) are used in the training phase, while lightweight scorers, evidence caching, and hierarchical retrieval are used in the online phase to reduce latency.
Progressive Rollout: Start by displaying scores as "suggestions" or "labels" without immediately affecting generation; once stable, enable automated threshold filtering or re-generation.
Feedback Loop: User adoption/rejection and manual review data flow back to fine-tune scorers and strategies (Continuous Learning).
A/B Testing and Monitoring $\alpha_k$ weight combinations and threshold policies, monitoring hallucination rates, adoption rates, and latency.
Adversarial Robustness: Introduce adversarial training and anomaly detection to prevent models from generating text that "cheats the scorer" (e.g., creating statements that look high-scoring but are false).
Explainability and Compliance: Record score sources (auto vs. human), evidence links, and calibration processes to facilitate audits and compliance reviews.

7 Risks, Ethics, and Limitations

Scorer Bias: If scorer training data is unbalanced or biased toward certain viewpoints, training rewards may amplify this bias. Diverse annotation and fairness testing are required.
Over-conservatism: High weights on factuality might result in overly conservative generation, sacrificing creativity. Weights and thresholds should be adjusted per scenario.
User Over-trust: Users may over-rely on scores (especially high ones). The frontend must clearly state the score source, confidence levels, and uncertainties.
Computational Cost: Item-by-item scoring and online evidence retrieval increase resource overhead. Engineering optimizations (caching, lightweight models, async retrieval) are needed for scalability.
Legal and Privacy: Evidence retrieval may touch copyrighted or private data; deployment must comply with laws and platform policies.

8 Conclusion and Future Work

This paper proposes and systematizes the Scoring-System-Trained Model (SSMT). By scoring every IO pair across multiple dimensions during both training and inference and embedding these scores into reward learning and task-set weighting, we establish a closed loop from model self-assessment to visual user filtering. Results show that this paradigm holds significant potential for reducing hallucination rates, enhancing factuality, and increasing controllability, providing agents with task-level quality awareness.

Future research directions include:

$\alpha_k$ ): Adapting scoring weights for different tasks via meta-learning or online optimization;
Multimodal Scorers: Extending the scoring system to text-image-audio multimodal scenarios;
Long-term Online Learning and Memory: Accumulating user feedback as model memory to improve continuous performance;
Adversarial Robustness: Further research into adversarial training to prevent the model from "gaming" the scorer;
Real-world Production Testing: Large-scale A/B testing with real user groups to evaluate long-term behavior and commercial viability.

References

Christiano, P. F., et al. (2017). Deep reinforcement learning from human preferences. NeurIPS.
Stiennon, N., et al. (2020). Learning to summarize with human feedback. NeurIPS.
Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. arXiv:2203.02155.
Zhang, T., et al. (2021). Evaluating factual consistency in generation via QA-based metrics. ACL.
Ribeiro, M. T., et al. (2020). Beyond accuracy: Behavioral testing of NLP models with CheckList. ACL.