Data Engineering
Versioned management, intelligent cleaning, LLM augmentation — full lifecycle for high-quality training data
Product Overview
Core Capabilities
Multi-Version Data Management
Git-style versioning: every cleaning or augmentation run produces a new version automatically. Supports a four-state lifecycle — Creating, Uploading, Ready, Archived. Versions are fully isolated; Ready versions can be referenced directly by training, fine-tuning, and evaluation jobs, while Archived versions are read-only for auditability.
Multi-Format Support & Online Editor
Native support for Alpaca (instruction/input/output), ShareGPT (multi-turn conversations), and Custom (user-defined fields). Built-in code editor with JSONL line-by-line validation, JSON syntax checking, and CSV/TSV column consistency checks — edit and save directly in the browser without local tooling.
Data-Juicer Intelligent Cleaning
Integrated Data-Juicer engine with 20+ rules across five categories: text cleaning, filtering, deduplication, privacy protection, and text normalization. Supports garbled text removal, length filtering, MinHash semantic dedup, PII masking, traditional-to-simplified Chinese conversion, and more — configured via a three-step wizard and submitted as a K8s job.
LLM Batch Augmentation
Connect any OpenAI-compatible API (vLLM, Ollama, etc.) for data multiplication (1-20x) and quality enhancement. Target specific fields, customize prompts to control generation strategy, and auto-write augmented results to a new dataset version — seamlessly feeding into the training pipeline.
Interactive Augmentation
Online augmentation mode for real-time single-record preview: configure model endpoint with one-click connectivity check, input raw text with a chosen multiplier, and instantly view LLM-generated results. Ideal for small-batch prompt debugging and quality verification, with results directly copyable to your dataset.
Three-Tier Visibility Control
Datasets support project (current project private), tenant (shared within workspace), and public (globally visible) scopes. Combined with RBAC, this enables development isolation and organization-level data asset reuse — eliminating redundant collection and cleaning efforts.
Cleaning Rule Categories
| Category | Rules | Examples |
|---|---|---|
| Text Cleaning | 2 | Remove garbled/invisible characters, strip HTML/XML tags |
| Filtering | 5 | Length filter, special char ratio, harmful content, language filter, perplexity filter |
| Deduplication | 3 | MinHash semantic dedup, exact dedup, intra-document n-gram repetition |
| Privacy Protection | 2 | PII replacement ([NAME]/[PHONE]), PII text removal |
| Text Normalization | 3 | Unicode normalization, whitespace standardization, traditional-to-simplified Chinese |
Data Workflow
Upload Data
Create a dataset and upload files (up to 500 MB), select Alpaca/ShareGPT/Custom format, edit and validate online
Clean / Augment
Configure Data-Juicer cleaning rules or LLM augmentation strategy, submit as a K8s job for automated processing
New Version
Cleaning or augmentation results auto-produce a new version with full data lineage and processing logs
Feed to Training / Eval
Ready versions can be directly referenced by training, fine-tuning, and evaluation jobs — mount with one click