A Scoring-System-Based Training Paradigm for Reducing Hallucination and Enhancing Controllability in Large Language Models
Large Language Models (LLMs) have demonstrated powerful capabilities in generative tasks, but hallucination and output uncontrollability remain critical bottlenecks hindering their broad and trustworthy application. This paper proposes a Scoring-System-Trained Model (SSTM / SSMT): a paradigm that performs multi-dimensional scoring on every input-output (IO) pair throughout the training and inference processes. These scores are embedded into reward learning and task-set weighting strategies, while the frontend displays scores and evidence to users to support filtering and threshold control. We designed a hybrid scorer (automated discrimination + manual calibration + evidence verification), a reward aggregator, task-set weighted averaging, and local/global scoring mechanisms with confidence thresholds for the output stage. Simulation experiments and ablation analysis show that SSMT significantly reduces hallucination rates, improves factuality and user adoption, and provides agents with task-level quality awareness and control. Finally, we discuss engineering implementation, risks, and future research directions.
Keywords: Large Language Models; Scoring System; Hallucination; Reward Learning; Controllable Generation; Agents
Large Language Models (such as GPT, Claude, Gemini, etc.) rely on massive corpora for training and possess strong natural language generation capabilities. However, they still face two key issues:
Hallucination — The model generates content that is grammatically correct but factually incorrect;
Output Uncontrollability — Users find it difficult to judge and filter the credibility and quality of the generated results.
Current mainstream alignment methods (such as RLHF, Reinforcement Learning from Human Feedback) can guide model output to align with human preferences to some extent, but they cannot effectively measure the actual quality of each output or establish a scoring system for a task as a whole. To this end, this paper proposes a new training and output mechanism: Scoring-System-Trained Model (SSTM / SSMT).
The core philosophy of this system is:
"Allow the model to understand and quantify 'quality' throughout the entire process of learning and generation."
By introducing IO pair scoring, a weighted average reward mechanism, and visual scoring feedback for users, SSMT constrains output quality during the training phase and provides filterable, controllable multi-level quality indicators during the application phase, significantly reducing hallucination and improving user trust.
The specific contributions of this paper are as follows:
Proposes a systematic framework (SSMT) that explicitly embeds multi-dimensional IO scoring into training rewards;
Designs a hybrid scorer (automated + manual calibration + evidence verification) and a calibratable scoring output strategy;
Introduces task-set weighted averaging and local/global scoring mechanisms at the output stage with confidence thresholds to support user adoption/rejection decisions;
Demonstrates improvements in hallucination rate, factuality, and user adoption through multi-task simulation experiments, providing ablation analysis and engineering recommendations.
LLM hallucination problems typically stem from noise in training data, language modeling objectives (maximum likelihood) that guide high fluency without guaranteeing factual accuracy, and a lack of real-time evidence constraints. Common mitigation methods include Retrieval-Augmented Generation (RAG), post-processing filtering, and alignment methods like RLHF. While these methods bring improvements, they are often either high-cost (requiring massive human feedback) or offline processes, making it difficult to provide fine-grained, interactive quality assessment and control.
RLHF builds reward models by learning human preferences, but this reward is often a holistic preference rather than an item-by-item quality score. Automated scoring metrics (BLEU, ROUGE, BERTScore, QA-based metrics, etc.) provide scalable evaluation but have shortfalls in factuality, logical coherence, and verifiability. Recent work has attempted to combine QA verification and evidence retrieval with scoring to assess factuality, but these have not yet been systematically embedded into the training-inference closed loop and presented to users.
Controllable generation research focuses on generation conditions (style, emotion, length) or setting confidence thresholds after output to reject unreliable results. SSMT systematizes these ideas further: adding fine-grained scoring and penalties during training, providing global control at the task-set level, and offering local/global scores with confidence thresholds in the frontend for user operation.
The SSMT system consists of the following four core modules (as shown in Figure 1):
Generator (
Scorer (
Reward Aggregator: Maps the scoring vector
User Frontend: Displays multi-dimensional scores, evidence snippets, and task-level overviews during the inference phase, providing threshold filtering, adoption/rejection, and feedback functions.
System Flow (Schematic):
x┌────────────────────┐│ Input Task Set T │└────────┬──────────┘│┌──────▼──────┐│ Generator πθ│└──────┬──────┘│ Output y┌──────▼──────┐│ Scorer Sψ │ → Multi-dim scores s=[s_fact, s_rel, ...]└──────┬──────┘│┌──────▼──────┐│ Reward Aggr │ → Reward r = f(s)└──────┬──────┘│┌──────▼──────┐│ Model Update│└──────┬──────┘│┌──────▼──────┐│ User UI Out │└─────────────┘
Figure 1: SSMT System Flow Diagram
In the training phase, the scorer generates rewards and is periodically retrained/calibrated with human annotations. In the inference phase, the scorer performs real-time assessment of candidate outputs, and the reward aggregator can be used for policy fine-tuning in online learning scenarios, while the frontend provides users with immediate scoring and evidence chain views.
The scorer
Definition Descriptions:
The scorer uses a hybrid structure:
Automated Sub-scorers: Discriminative/regression models based on retrieval-augmented features (embedding similarity, QA consistency, perplexity) provide initial scores for large batches.
Human/Expert Review Layer: Manual annotation of critical samples or outputs where the scorer is uncertain, used for calibration and retraining.
Evidence Chain Module: Performs retrieval (document retriever) for factuality and verifiability dimensions to assess evidence strength and citation accuracy.
Safety Detector: An independent model predicts risks of harmful/biased content and incorporates it into the safety dimension score.
Scorer outputs undergo probability calibration (e.g., temperature scaling) to ensure interpretability and consistency of confidence.
Mapping multi-dimensional scores to a training reward
Example: Factuality Threshold Penalty
Where
Define a task set
The set-weighting mechanism allows the system to allocate resources at a global level (e.g., automatically calling more expensive evidence verification for low-scoring tasks) and supports users in setting task-level "acceptability thresholds."
To achieve transparency and controllability, the frontend should display rich yet easy-to-understand scoring information and interactive controls:
Multi-dimensional Score Bars / Radar Charts for each output (collapsible);
Task Average Score, Distribution Histograms, and Low/High Score Examples;
Quick Filter Controls: "Only show results with score
Evidence Viewer: Click to expand snippets of retrieved supporting evidence and source links (if available);
Action Buttons: Adopt / Discard / Submit Feedback (for training backflow);
Uncertainty Prompts: Visual alerts for low-confidence scores or samples with high scorer uncertainty.
Interface Example (ASCII):
xxxxxxxxxxQuestion: How to reduce model hallucinations?┌────────────────────────────────────────────┐│ Ans 1: By introducing RAG models... │ Factuality 0.92 │ Safety 1.00 ││ Ans 2: Models should randomly sample... │ Factuality 0.45 │ Safety 0.80 ││ Ans 3: Add an output filtering layer... │ Factuality 0.83 │ Safety 0.95 │└────────────────────────────────────────────┘Average Score: 0.86 (High Quality)
Figure 2: Schematic of Scoring Visualization and Filtering Interface
During the inference (deduction) phase, the model must not only generate text answers but also simultaneously output scoring information so that users can immediately evaluate and filter the results. We structure the output-phase scoring system into three elements: Local Score, Global Score, and Confidence Threshold.
Local scoring targets sub-output units (e.g., paragraphs, individual statements in an answer, reasoning steps, or each candidate answer generated). Objectives include:
Supporting users in identifying unreliable or verification-required segments at a microscopic level;
Supporting the model in step-by-step checking and correction during multi-step reasoning/Chain-of-Thought.
Local scores can be based on the same multi-dimensional metrics (factuality, relevance, etc.) but targeted at finer-grained inputs (clauses, assertions). For Chain-of-Thought scenarios, local scoring helps locate erroneous steps for local re-generation or enhanced evidence retrieval.
The global score provides a comprehensive evaluation of the entire output (e.g., a complete answer or a set of generated paragraphs), usually through a weighted aggregation of local scores or a direct judgment of the combined text by the scorer. Global scores are used to:
Quickly display overall quality and credibility to the user;
Compare against task-set thresholds to decide whether to trigger auto-regeneration or manual review.
Formally, the global score can be defined as:
Where
Confidence Thresholds allow users to set minimum acceptance standards for local or global scores (e.g., only show outputs with Global Score
If an output
Automatically trigger re-generation or call a stronger evidence retrieval process;
Label the output as "To be verified" and hand it over for manual review;
Hide the item in multi-candidate views, showing only those above the threshold.
The frontend should show the confidence distribution (e.g., histograms or box plots) and provide threshold sliders and sensitivity previews (showing how many outputs would be filtered at different levels) to help users balance quality and coverage.
To verify the effectiveness of SSMT, evaluation is recommended on the following tasks/datasets:
Open-Domain QA (OpenQA): HotpotQA, NaturalQuestions;
Text Summarization: CNN/DailyMail, XSum (balancing creativity vs. factuality);
Multi-step Reasoning / Logical Reasoning: StrategyQA, LogiQA;
Industry Subsets (Optional): Medical QA, Legal QA (to assess high-risk scenarios).
Baseline A (RLHF): Traditional RLHF model trained on human preferences;
Baseline B (RLHF + Post-filter): RLHF model followed by an automated scoring filter, without embedding scores in training;
Proposed (SSMT): Our method: Scores embedded in training + task-set weighting + output-phase scoring and threshold control.
Hallucination Rate: Proportion of errors verified manually or via QA-based auto-validation.
Factuality Score (Avg
User Adoption Rate: The proportion of outputs accepted by real or simulated users.
Scoring Calibration: Spearman/Kendall correlation coefficients between scorer outputs and human judgment.
Interaction Cost and Latency: Average delay and computational overhead introduced by scoring and evidence retrieval.
(The following are simulated/example data for illustrative purposes; real effects require online A/B testing)
| Method | Hallucination Rate ↓ | Avg | Adoption Rate ↑ | Calibration (Spearman) ↑ |
|---|---|---|---|---|
| RLHF | 0.18 | 0.72 | 0.65 | 0.61 |
| RLHF+Post | 0.14 | 0.78 | 0.69 | 0.73 |
| SSMT | 0.11 | 0.83 | 0.77 | 0.84 |
Ablation Study Points:
Removing the verifiability dimension (
Canceling human calibration
Fixed (non-adaptive) weights
Layered Design (Offline Training / Online Inference): High-performance scorers (including human calibration) are used in the training phase, while lightweight scorers, evidence caching, and hierarchical retrieval are used in the online phase to reduce latency.
Progressive Rollout: Start by displaying scores as "suggestions" or "labels" without immediately affecting generation; once stable, enable automated threshold filtering or re-generation.
Feedback Loop: User adoption/rejection and manual review data flow back to fine-tune scorers and strategies (Continuous Learning).
A/B Testing and Monitoring: Perform online A/B testing for different
Adversarial Robustness: Introduce adversarial training and anomaly detection to prevent models from generating text that "cheats the scorer" (e.g., creating statements that look high-scoring but are false).
Explainability and Compliance: Record score sources (auto vs. human), evidence links, and calibration processes to facilitate audits and compliance reviews.
Scorer Bias: If scorer training data is unbalanced or biased toward certain viewpoints, training rewards may amplify this bias. Diverse annotation and fairness testing are required.
Over-conservatism: High weights on factuality might result in overly conservative generation, sacrificing creativity. Weights and thresholds should be adjusted per scenario.
User Over-trust: Users may over-rely on scores (especially high ones). The frontend must clearly state the score source, confidence levels, and uncertainties.
Computational Cost: Item-by-item scoring and online evidence retrieval increase resource overhead. Engineering optimizations (caching, lightweight models, async retrieval) are needed for scalability.
Legal and Privacy: Evidence retrieval may touch copyrighted or private data; deployment must comply with laws and platform policies.
This paper proposes and systematizes the Scoring-System-Trained Model (SSMT). By scoring every IO pair across multiple dimensions during both training and inference and embedding these scores into reward learning and task-set weighting, we establish a closed loop from model self-assessment to visual user filtering. Results show that this paradigm holds significant potential for reducing hallucination rates, enhancing factuality, and increasing controllability, providing agents with task-level quality awareness.
Future research directions include:
Automated Learning of Weights (
Multimodal Scorers: Extending the scoring system to text-image-audio multimodal scenarios;
Long-term Online Learning and Memory: Accumulating user feedback as model memory to improve continuous performance;
Adversarial Robustness: Further research into adversarial training to prevent the model from "gaming" the scorer;
Real-world Production Testing: Large-scale A/B testing with real user groups to evaluate long-term behavior and commercial viability.
Christiano, P. F., et al. (2017). Deep reinforcement learning from human preferences. NeurIPS.
Stiennon, N., et al. (2020). Learning to summarize with human feedback. NeurIPS.
Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. arXiv:2203.02155.
Zhang, T., et al. (2021). Evaluating factual consistency in generation via QA-based metrics. ACL.
Ribeiro, M. T., et al. (2020). Beyond accuracy: Behavioral testing of NLP models with CheckList. ACL.