Model Evaluation
Auto Metrics + LLM-as-Judge — multi-dimensional, quantifiable model quality assessment
Product Overview
Core Capabilities
Auto Evaluation
Built on LlamaFactory's evaluation pipeline with standard metrics — BLEU-4, ROUGE, METEOR, accuracy, and eval_loss. Results are auto-parsed from all_results.json and rendered as interactive charts with zero additional scoring service configuration.
LLM-as-Judge
Connect any OpenAI-API-compatible LLM as an evaluator. The platform generates model responses per sample, then the Judge model scores each across multiple dimensions with written commentary. Supports custom Judge endpoint, model, temperature, and max tokens. API keys are securely passed via Kubernetes Secrets.
Four-Dimensional Scoring
LLM-as-Judge evaluates across instruction following, quality, clarity, and safety — each scored 1-10 — plus an overall score and one-line commentary, enabling rapid identification of model weaknesses.
Auto-Trigger After Fine-Tuning
Enable enableAutoEval on a fine-tuning job to automatically launch evaluation upon training completion. Model path and evaluation dataset are inherited from the fine-tuning configuration, creating a closed-loop train-to-evaluate pipeline with zero manual intervention.
Result Visualization & Export
Auto mode renders ECharts metric charts and summary tables. LLM-as-Judge mode displays score overview cards and per-sample details with judge commentary. Results can be exported in Markdown or JSON for archiving and cross-team sharing.
Evaluation Mode Comparison
| Dimension | Auto Mode | LLM-as-Judge |
|---|---|---|
| Metrics | BLEU-4 / ROUGE / METEOR / accuracy / eval_loss | instruction_following / quality / clarity / safety / overall |
| Scoring | Algorithmic computation, deterministic results | External LLM scores each sample 1-10 with commentary |
| Best For | Quick regression tests, baseline metric comparison | Subjective quality assessment, pre-deployment human-review substitute |
| Compute Cost | Low — inference + metric calculation only | Higher — inference + Judge API calls |
| Setup | Select dataset and go | Requires Judge endpoint, model, and API key |
Evaluation Workflow
Select Model & Dataset
Specify the fine-tuned model and evaluation dataset; supports alpaca and sharegpt formats
Choose Evaluation Mode
Select Auto or LLM-as-Judge mode and configure batch size, max samples, and other parameters
Execute Evaluation
Platform schedules an evaluation Pod via Volcano Job with real-time inference and Judge scoring progress
Review Report
Auto mode shows metric charts; Judge mode shows score overview and per-sample review details