Skip to main content

Training Data Management

Manage training data, fine-tuned LoRA adapters, and embeddings for your Chiro AI instance. Achiral provides secure storage with automated retention policies and version control.

Storage Overview

Each Chiro instance includes:

PlanTraining DataLoRA AdaptersEmbeddings StorageRetention
Spark1 GB5 adapters10 GB30 days
Seed10 GB20 adapters50 GB90 days
Scale100 GB50 adapters200 GB180 days
DedicatedUnlimitedUnlimitedUnlimitedCustom

Storage Types

LoRA Adapters

Store fine-tuned LoRA adapters for your custom models.

Usage:

  • Custom fine-tuned adapters
  • Domain-specific adaptations
  • Industry-specialized models
  • Version history and rollback points

Best Practices:

  • Name adapters descriptively (e.g., customer-support-v2, legal-review-v1)
  • Test adapters before deploying to production
  • Keep previous versions for rollback
  • Document adapter purpose and training data

Training Data

Store datasets for fine-tuning your Chiro instance.

Usage:

  • Training datasets (JSONL, CSV, Parquet)
  • Company documents and knowledge bases
  • Historical conversations and interactions
  • Domain-specific corpora

Best Practices:

  • Use JSONL format for training data
  • Include diverse examples (minimum 100 samples recommended)
  • Validate data quality before training
  • Remove PII and sensitive information

Embeddings Storage

Store vector embeddings in your dedicated Weaviate tenant.

Usage:

  • Document embeddings for RAG
  • Knowledge base vectors
  • Semantic search indices
  • Context retrieval cache

Best Practices:

  • Chunk documents into 512-token segments
  • Use 50-token overlap for better context
  • Reindex embeddings when updating documents
  • Monitor embedding quality and relevance

Conversation History

Store chat conversations and interaction logs.

Usage:

  • Chat conversation history
  • User feedback and ratings
  • Performance metrics
  • Audit logs for compliance

Best Practices:

  • Enable automatic retention policies
  • Export conversations for analysis
  • Anonymize data when required for compliance

Storage Management

View Storage Usage

Via Dashboard

  1. Navigate to ConfigurationStorage
  2. View current usage by category
  3. See storage trends over time

Via API

curl https://api.achiral.ai/v1/organizations/{org_id}/storage \
-H "Authorization: Bearer YOUR_API_KEY"

Response:

{
"training_data_gb": 8.5,
"lora_adapters_count": 12,
"embeddings_gb": 35.2,
"conversation_history_gb": 2.1,
"limits": {
"training_data_gb": 10,
"lora_adapters_count": 20,
"embeddings_gb": 50
}
}

Upgrade Storage Limits

Via Dashboard

  1. Navigate to ConfigurationStorage
  2. Click Upgrade Plan or Add Storage
  3. Select new storage tier or add-on
  4. Review pricing impact
  5. Click Confirm

Storage upgrades are applied immediately.

Via API

curl -X POST https://api.achiral.ai/v1/organizations/{org_id}/storage/upgrade \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"training_data_gb": 50,
"embeddings_gb": 100
}'

Storage Pricing

Additional storage beyond plan limits:

  • Training Data: $0.50 per GB per month
  • Embeddings: $0.30 per GB per month
  • LoRA Adapters: $5 per adapter per month
  • Billing: Prorated daily

Data Retention Policies

Configure automatic cleanup of old data.

Set Retention Policy

curl -X POST https://api.achiral.ai/v1/organizations/{org_id}/retention \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"policies": [
{
"type": "conversation_history",
"max_age_days": 90,
"keep_latest": 1000
},
{
"type": "lora_adapters",
"keep_latest": 5
}
]
}'

Policy Options

  • max_age_days: Delete files older than N days
  • keep_latest: Always keep N most recent items
  • max_size_gb: Delete oldest files when size exceeds limit
  • pattern: Apply policy to files matching pattern

Example Policies

Keep Recent Adapters

{
"type": "lora_adapters",
"keep_latest": 10,
"min_age_days": 7
}

Archive Old Conversations

{
"type": "conversation_history",
"max_age_days": 180,
"export_before_delete": true,
"export_destination": "s3://my-bucket/archives/"
}

Clean Old Results

{
"type": "results",
"max_age_days": 14,
"max_size_gb": 50
}

Archive Training Data

{
"type": "data",
"pattern": "*.csv",
"max_age_days": 90,
"archive_to": "s3://my-bucket/archive/"
}

Storage Backup

Automatic Backups

All storage is automatically backed up:

  • Frequency: Every 6 hours
  • Retention: 7 days
  • Location: Separate availability zone
  • Encryption: AES-256 at rest

Manual Snapshots

Create on-demand snapshots:

curl -X POST https://api.achiral.ai/v1/nano/{nano_id}/snapshot \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"description": "Before major model update",
"retention_days": 30
}'

Restore from Backup

curl -X POST https://api.achiral.ai/v1/nano/{nano_id}/restore \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"snapshot_id": "snap-abc123xyz",
"restore_path": "/var/lib/nano/models"
}'

External Storage Integration

S3 Integration

Connect your AWS S3 buckets for data import/export.

Configure S3 Access

curl -X POST https://api.achiral.ai/v1/nano/{nano_id}/storage/s3 \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"bucket": "my-bucket",
"region": "us-east-1",
"access_key_id": "AKIAIOSFODNN7EXAMPLE",
"secret_access_key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
}'

Import from S3

# Import dataset
curl -X POST https://api.achiral.ai/v1/nano/{nano_id}/storage/import \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"source": "s3://my-bucket/datasets/training.parquet",
"destination": "/var/lib/nano/data/training.parquet"
}'

Export to S3

# Export fine-tuned model
curl -X POST https://api.achiral.ai/v1/nano/{nano_id}/storage/export \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"source": "/var/lib/nano/models/my-fine-tuned-model",
"destination": "s3://my-bucket/models/my-fine-tuned-model"
}'

Google Cloud Storage

Similar integration available for GCS buckets:

curl -X POST https://api.achiral.ai/v1/nano/{nano_id}/storage/gcs \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"bucket": "my-gcs-bucket",
"project_id": "my-project",
"credentials": "base64_encoded_service_account_json"
}'

Azure Blob Storage

Connect Azure storage accounts:

curl -X POST https://api.achiral.ai/v1/nano/{nano_id}/storage/azure \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"account_name": "mystorageaccount",
"container": "my-container",
"access_key": "your_access_key"
}'

Storage Performance

IOPS and Throughput

TierRead IOPSWrite IOPSRead MB/sWrite MB/s
Spark10,0005,0001,000500
Seed50,00025,0005,0002,500
Growth100,00050,00010,0005,000

Optimization Tips

Use Efficient Formats

# Prefer Parquet over CSV for large datasets
import pandas as pd

# Slow: CSV
df.to_csv('data.csv')

# Fast: Parquet with compression
df.to_parquet('data.parquet', compression='snappy')

Batch Operations

# Batch file operations
import os
from pathlib import Path

# Instead of many small writes
for i in range(1000):
with open(f'file_{i}.txt', 'w') as f:
f.write(data[i])

# Use batch operations
with open('combined.txt', 'w') as f:
f.writelines([data[i] + '\n' for i in range(1000)])

Cache Frequently Accessed Data

# Cache models in memory
from functools import lru_cache

@lru_cache(maxsize=3)
def load_model(model_name):
return torch.load(f'/var/lib/nano/models/{model_name}')

Storage Monitoring

Metrics

Monitor storage usage through the dashboard:

  • Used Space: Current storage consumption
  • Available Space: Remaining capacity
  • IOPS Usage: Current I/O operations per second
  • Throughput: Read/write bandwidth usage

Alerts

Set up alerts for:

  • Storage usage above 80%
  • Low IOPS availability
  • Failed backup operations
  • Retention policy violations

Storage Logs

Access storage operation logs:

curl https://api.achiral.ai/v1/nano/{nano_id}/storage/logs \
-H "Authorization: Bearer YOUR_API_KEY"

Data Security

Encryption

All storage is encrypted:

  • At Rest: AES-256 encryption
  • In Transit: TLS 1.3 for all transfers
  • Key Management: AWS KMS integration
  • Customer Keys: Bring your own encryption keys (Growth plan)

Access Control

Configure access permissions:

curl -X POST https://api.achiral.ai/v1/nano/{nano_id}/storage/permissions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"path": "/var/lib/nano/models",
"permissions": "read-only",
"users": ["user1@example.com", "user2@example.com"]
}'

Data Migration

Between Chiro Instances

Copy data between your Chiro instances:

curl -X POST https://api.achiral.ai/v1/nano/{source_id}/storage/copy \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"source_path": "/var/lib/nano/models/my-model",
"destination_nano": "{target_id}",
"destination_path": "/var/lib/nano/models/my-model"
}'

Bulk Transfer

For large datasets, use the bulk transfer service:

  1. Contact support to initiate bulk transfer
  2. Provide source and destination details
  3. Transfer is handled offline for optimal speed
  4. Verification and notification upon completion

Next Steps