Skip to main content

Model Evaluation

Auto Metrics + LLM-as-Judge — multi-dimensional, quantifiable model quality assessment

Product Overview

Two evaluation modes: Auto mode runs automated assessment using LlamaFactory built-in metrics (BLEU-4, ROUGE, METEOR, accuracy, eval_loss), while LLM-as-Judge mode calls an external LLM to score model outputs across four dimensions — instruction following, quality, clarity, and safety — on a 1-10 scale. Evaluation jobs can be auto-triggered after fine-tuning or configured as standalone runs, with results visualized through charts and per-sample detail views.

Core Capabilities

Auto Evaluation

Built on LlamaFactory's evaluation pipeline with standard metrics — BLEU-4, ROUGE, METEOR, accuracy, and eval_loss. Results are auto-parsed from all_results.json and rendered as interactive charts with zero additional scoring service configuration.

LLM-as-Judge

Connect any OpenAI-API-compatible LLM as an evaluator. The platform generates model responses per sample, then the Judge model scores each across multiple dimensions with written commentary. Supports custom Judge endpoint, model, temperature, and max tokens. API keys are securely passed via Kubernetes Secrets.

Four-Dimensional Scoring

LLM-as-Judge evaluates across instruction following, quality, clarity, and safety — each scored 1-10 — plus an overall score and one-line commentary, enabling rapid identification of model weaknesses.

Auto-Trigger After Fine-Tuning

Enable enableAutoEval on a fine-tuning job to automatically launch evaluation upon training completion. Model path and evaluation dataset are inherited from the fine-tuning configuration, creating a closed-loop train-to-evaluate pipeline with zero manual intervention.

Result Visualization & Export

Auto mode renders ECharts metric charts and summary tables. LLM-as-Judge mode displays score overview cards and per-sample details with judge commentary. Results can be exported in Markdown or JSON for archiving and cross-team sharing.

Evaluation Mode Comparison

Dimension Auto Mode LLM-as-Judge
Metrics BLEU-4 / ROUGE / METEOR / accuracy / eval_loss instruction_following / quality / clarity / safety / overall
Scoring Algorithmic computation, deterministic results External LLM scores each sample 1-10 with commentary
Best For Quick regression tests, baseline metric comparison Subjective quality assessment, pre-deployment human-review substitute
Compute Cost Low — inference + metric calculation only Higher — inference + Judge API calls
Setup Select dataset and go Requires Judge endpoint, model, and API key

Evaluation Workflow

1

Select Model & Dataset

Specify the fine-tuned model and evaluation dataset; supports alpaca and sharegpt formats

2

Choose Evaluation Mode

Select Auto or LLM-as-Judge mode and configure batch size, max samples, and other parameters

3

Execute Evaluation

Platform schedules an evaluation Pod via Volcano Job with real-time inference and Judge scoring progress

4

Review Report

Auto mode shows metric charts; Judge mode shows score overview and per-sample review details

Back to Rise ModelX