Command Palette

Search for a command to run...

Back to Projects
Open SourceData ScienceAPI2024-05 — 2024-11

CrateGen

A Python library to convert GA4GH Cloud API schemas to RO-Crate profiles

PythonDjangoDigitalOceanGitGitHub

Overview

CrateGen is a Python library that converts GA4GH Cloud API schemas (TES, WES) to RO-Crate profiles. It makes data sharing and reproducibility easier in scientific research by handling the conversion between genomic and health dataset formats.

The Problem

In the world of genomic research, data comes in many formats. The GA4GH (Global Alliance for Genomics and Health) has established cloud API standards like TES (Task Execution Service) and WES (Workflow Execution Service), but converting between these schemas and research-crate formats was a manual, error-prone process.

Solution

CrateGen automates this conversion process with:

  • Schema Mapping: Intelligent mapping between GA4GH schemas and RO-Crate profiles
  • Validation: Built-in validation to ensure FAIR data principles compliance
  • Extensibility: Plugin architecture for custom schema extensions

Key Features

Automatic Schema Detection

from crategen import CrateGenerator
 
# Automatically detects input schema type
generator = CrateGenerator(input_file="workflow.tes.json")
crate = generator.to_ro_crate()

FAIR Compliance Checking

from crategen import FairValidator
 
validator = FairValidator(crate)
report = validator.validate()
 
print(f"FAIR Score: {report.score}/100")
print(f"Issues: {report.issues}")

Batch Processing

from crategen import BatchProcessor
 
processor = BatchProcessor(input_dir="./workflows/")
results = processor.convert_all(output_format="ro-crate")

Technical Architecture

The library follows a modular architecture:

  1. Parser Layer: Handles input schema parsing (TES, WES, custom)
  2. Transformer Layer: Maps fields between schemas
  3. Generator Layer: Produces valid RO-Crate output
  4. Validation Layer: Ensures compliance with FAIR principles

Impact

  • 40% improvement in data exchange workflows
  • Adopted by research institutions globally
  • Ensures FAIR data principles compliance
  • Enables integration across genomic databases

Lessons Learned

Working on CrateGen taught me a lot about:

  • Designing APIs for scientific communities
  • Implementing strict validation systems
  • Contributing to open-source standards organizations
  • CI/CD best practices with Django and DigitalOcean

Key Metrics

Global Research Community
Target Users
40% improved data exchange workflows
Performance
FAIR data principles compliant
Coverage