Training Data Management
Manage training data, fine-tuned LoRA adapters, and embeddings for your Chiro AI instance. Achiral provides secure storage with automated retention policies and version control.
Storage Overview
Each Chiro instance includes:
| Plan | Training Data | LoRA Adapters | Embeddings Storage | Retention |
|---|---|---|---|---|
| Spark | 1 GB | 5 adapters | 10 GB | 30 days |
| Seed | 10 GB | 20 adapters | 50 GB | 90 days |
| Scale | 100 GB | 50 adapters | 200 GB | 180 days |
| Dedicated | Unlimited | Unlimited | Unlimited | Custom |
Storage Types
LoRA Adapters
Store fine-tuned LoRA adapters for your custom models.
Usage:
- Custom fine-tuned adapters
- Domain-specific adaptations
- Industry-specialized models
- Version history and rollback points
Best Practices:
- Name adapters descriptively (e.g.,
customer-support-v2,legal-review-v1) - Test adapters before deploying to production
- Keep previous versions for rollback
- Document adapter purpose and training data
Training Data
Store datasets for fine-tuning your Chiro instance.
Usage:
- Training datasets (JSONL, CSV, Parquet)
- Company documents and knowledge bases
- Historical conversations and interactions
- Domain-specific corpora
Best Practices:
- Use JSONL format for training data
- Include diverse examples (minimum 100 samples recommended)
- Validate data quality before training
- Remove PII and sensitive information
Embeddings Storage
Store vector embeddings in your dedicated Weaviate tenant.
Usage:
- Document embeddings for RAG
- Knowledge base vectors
- Semantic search indices
- Context retrieval cache
Best Practices:
- Chunk documents into 512-token segments
- Use 50-token overlap for better context
- Reindex embeddings when updating documents
- Monitor embedding quality and relevance
Conversation History
Store chat conversations and interaction logs.
Usage:
- Chat conversation history
- User feedback and ratings
- Performance metrics
- Audit logs for compliance
Best Practices:
- Enable automatic retention policies
- Export conversations for analysis
- Anonymize data when required for compliance
Storage Management
View Storage Usage
Via Dashboard
- Navigate to Configuration → Storage
- View current usage by category
- See storage trends over time
Via API
curl https://api.achiral.ai/v1/organizations/{org_id}/storage \
-H "Authorization: Bearer YOUR_API_KEY"
Response:
{
"training_data_gb": 8.5,
"lora_adapters_count": 12,
"embeddings_gb": 35.2,
"conversation_history_gb": 2.1,
"limits": {
"training_data_gb": 10,
"lora_adapters_count": 20,
"embeddings_gb": 50
}
}
Upgrade Storage Limits
Via Dashboard
- Navigate to Configuration → Storage
- Click Upgrade Plan or Add Storage
- Select new storage tier or add-on
- Review pricing impact
- Click Confirm
Storage upgrades are applied immediately.
Via API
curl -X POST https://api.achiral.ai/v1/organizations/{org_id}/storage/upgrade \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"training_data_gb": 50,
"embeddings_gb": 100
}'
Storage Pricing
Additional storage beyond plan limits:
- Training Data: $0.50 per GB per month
- Embeddings: $0.30 per GB per month
- LoRA Adapters: $5 per adapter per month
- Billing: Prorated daily
Data Retention Policies
Configure automatic cleanup of old data.
Set Retention Policy
curl -X POST https://api.achiral.ai/v1/organizations/{org_id}/retention \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"policies": [
{
"type": "conversation_history",
"max_age_days": 90,
"keep_latest": 1000
},
{
"type": "lora_adapters",
"keep_latest": 5
}
]
}'
Policy Options
- max_age_days: Delete files older than N days
- keep_latest: Always keep N most recent items
- max_size_gb: Delete oldest files when size exceeds limit
- pattern: Apply policy to files matching pattern
Example Policies
Keep Recent Adapters
{
"type": "lora_adapters",
"keep_latest": 10,
"min_age_days": 7
}
Archive Old Conversations
{
"type": "conversation_history",
"max_age_days": 180,
"export_before_delete": true,
"export_destination": "s3://my-bucket/archives/"
}
Clean Old Results
{
"type": "results",
"max_age_days": 14,
"max_size_gb": 50
}
Archive Training Data
{
"type": "data",
"pattern": "*.csv",
"max_age_days": 90,
"archive_to": "s3://my-bucket/archive/"
}
Storage Backup
Automatic Backups
All storage is automatically backed up:
- Frequency: Every 6 hours
- Retention: 7 days
- Location: Separate availability zone
- Encryption: AES-256 at rest
Manual Snapshots
Create on-demand snapshots:
curl -X POST https://api.achiral.ai/v1/nano/{nano_id}/snapshot \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"description": "Before major model update",
"retention_days": 30
}'
Restore from Backup
curl -X POST https://api.achiral.ai/v1/nano/{nano_id}/restore \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"snapshot_id": "snap-abc123xyz",
"restore_path": "/var/lib/nano/models"
}'
External Storage Integration
S3 Integration
Connect your AWS S3 buckets for data import/export.
Configure S3 Access
curl -X POST https://api.achiral.ai/v1/nano/{nano_id}/storage/s3 \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"bucket": "my-bucket",
"region": "us-east-1",
"access_key_id": "AKIAIOSFODNN7EXAMPLE",
"secret_access_key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
}'
Import from S3
# Import dataset
curl -X POST https://api.achiral.ai/v1/nano/{nano_id}/storage/import \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"source": "s3://my-bucket/datasets/training.parquet",
"destination": "/var/lib/nano/data/training.parquet"
}'
Export to S3
# Export fine-tuned model
curl -X POST https://api.achiral.ai/v1/nano/{nano_id}/storage/export \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"source": "/var/lib/nano/models/my-fine-tuned-model",
"destination": "s3://my-bucket/models/my-fine-tuned-model"
}'
Google Cloud Storage
Similar integration available for GCS buckets:
curl -X POST https://api.achiral.ai/v1/nano/{nano_id}/storage/gcs \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"bucket": "my-gcs-bucket",
"project_id": "my-project",
"credentials": "base64_encoded_service_account_json"
}'
Azure Blob Storage
Connect Azure storage accounts:
curl -X POST https://api.achiral.ai/v1/nano/{nano_id}/storage/azure \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"account_name": "mystorageaccount",
"container": "my-container",
"access_key": "your_access_key"
}'
Storage Performance
IOPS and Throughput
| Tier | Read IOPS | Write IOPS | Read MB/s | Write MB/s |
|---|---|---|---|---|
| Spark | 10,000 | 5,000 | 1,000 | 500 |
| Seed | 50,000 | 25,000 | 5,000 | 2,500 |
| Growth | 100,000 | 50,000 | 10,000 | 5,000 |
Optimization Tips
Use Efficient Formats
# Prefer Parquet over CSV for large datasets
import pandas as pd
# Slow: CSV
df.to_csv('data.csv')
# Fast: Parquet with compression
df.to_parquet('data.parquet', compression='snappy')
Batch Operations
# Batch file operations
import os
from pathlib import Path
# Instead of many small writes
for i in range(1000):
with open(f'file_{i}.txt', 'w') as f:
f.write(data[i])
# Use batch operations
with open('combined.txt', 'w') as f:
f.writelines([data[i] + '\n' for i in range(1000)])
Cache Frequently Accessed Data
# Cache models in memory
from functools import lru_cache
@lru_cache(maxsize=3)
def load_model(model_name):
return torch.load(f'/var/lib/nano/models/{model_name}')
Storage Monitoring
Metrics
Monitor storage usage through the dashboard:
- Used Space: Current storage consumption
- Available Space: Remaining capacity
- IOPS Usage: Current I/O operations per second
- Throughput: Read/write bandwidth usage
Alerts
Set up alerts for:
- Storage usage above 80%
- Low IOPS availability
- Failed backup operations
- Retention policy violations
Storage Logs
Access storage operation logs:
curl https://api.achiral.ai/v1/nano/{nano_id}/storage/logs \
-H "Authorization: Bearer YOUR_API_KEY"
Data Security
Encryption
All storage is encrypted:
- At Rest: AES-256 encryption
- In Transit: TLS 1.3 for all transfers
- Key Management: AWS KMS integration
- Customer Keys: Bring your own encryption keys (Growth plan)
Access Control
Configure access permissions:
curl -X POST https://api.achiral.ai/v1/nano/{nano_id}/storage/permissions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"path": "/var/lib/nano/models",
"permissions": "read-only",
"users": ["user1@example.com", "user2@example.com"]
}'
Data Migration
Between Chiro Instances
Copy data between your Chiro instances:
curl -X POST https://api.achiral.ai/v1/nano/{source_id}/storage/copy \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"source_path": "/var/lib/nano/models/my-model",
"destination_nano": "{target_id}",
"destination_path": "/var/lib/nano/models/my-model"
}'
Bulk Transfer
For large datasets, use the bulk transfer service:
- Contact support to initiate bulk transfer
- Provide source and destination details
- Transfer is handled offline for optimal speed
- Verification and notification upon completion
Next Steps
- Environment Variables - Configure runtime settings
- API Reference - Storage API endpoints
- Security & Compliance - Data security features