Skip to main content

Data Engineering

Versioned management, intelligent cleaning, LLM augmentation — full lifecycle for high-quality training data

Product Overview

A full-lifecycle data management platform for LLM fine-tuning and evaluation. Git-style multi-version control, Alpaca/ShareGPT/Custom multi-format online editing, Data-Juicer cleaning across five rule categories, LLM batch and interactive augmentation, plus project/tenant/public three-tier visibility — ensuring quality and traceability from data collection to consumption.

Core Capabilities

Multi-Version Data Management

Git-style versioning: every cleaning or augmentation run produces a new version automatically. Supports a four-state lifecycle — Creating, Uploading, Ready, Archived. Versions are fully isolated; Ready versions can be referenced directly by training, fine-tuning, and evaluation jobs, while Archived versions are read-only for auditability.

Multi-Format Support & Online Editor

Native support for Alpaca (instruction/input/output), ShareGPT (multi-turn conversations), and Custom (user-defined fields). Built-in code editor with JSONL line-by-line validation, JSON syntax checking, and CSV/TSV column consistency checks — edit and save directly in the browser without local tooling.

Data-Juicer Intelligent Cleaning

Integrated Data-Juicer engine with 20+ rules across five categories: text cleaning, filtering, deduplication, privacy protection, and text normalization. Supports garbled text removal, length filtering, MinHash semantic dedup, PII masking, traditional-to-simplified Chinese conversion, and more — configured via a three-step wizard and submitted as a K8s job.

LLM Batch Augmentation

Connect any OpenAI-compatible API (vLLM, Ollama, etc.) for data multiplication (1-20x) and quality enhancement. Target specific fields, customize prompts to control generation strategy, and auto-write augmented results to a new dataset version — seamlessly feeding into the training pipeline.

Interactive Augmentation

Online augmentation mode for real-time single-record preview: configure model endpoint with one-click connectivity check, input raw text with a chosen multiplier, and instantly view LLM-generated results. Ideal for small-batch prompt debugging and quality verification, with results directly copyable to your dataset.

Three-Tier Visibility Control

Datasets support project (current project private), tenant (shared within workspace), and public (globally visible) scopes. Combined with RBAC, this enables development isolation and organization-level data asset reuse — eliminating redundant collection and cleaning efforts.

Cleaning Rule Categories

Category Rules Examples
Text Cleaning 2 Remove garbled/invisible characters, strip HTML/XML tags
Filtering 5 Length filter, special char ratio, harmful content, language filter, perplexity filter
Deduplication 3 MinHash semantic dedup, exact dedup, intra-document n-gram repetition
Privacy Protection 2 PII replacement ([NAME]/[PHONE]), PII text removal
Text Normalization 3 Unicode normalization, whitespace standardization, traditional-to-simplified Chinese

Data Workflow

1

Upload Data

Create a dataset and upload files (up to 500 MB), select Alpaca/ShareGPT/Custom format, edit and validate online

2

Clean / Augment

Configure Data-Juicer cleaning rules or LLM augmentation strategy, submit as a K8s job for automated processing

3

New Version

Cleaning or augmentation results auto-produce a new version with full data lineage and processing logs

4

Feed to Training / Eval

Ready versions can be directly referenced by training, fine-tuning, and evaluation jobs — mount with one click

Back to Rise ModelX