Director of Engineering,
Guardrail Technologies
– Present
Leading platform rebuild for AI safety and LLM governance serving highly regulated industries, establishing production-grade engineering practices.
Implementing observability infrastructure including monitoring, distributed tracing, and incident response procedures to enable reliable production operations for LLM systems.
Architecting LLM and MCP gateway infrastructure with authentication, authorization, audit logging, and PII redaction capabilities.
Optimizing infrastructure costs through usage visibility and right-sizing while maintaining compliance requirements for regulated customers.
Spearheaded development of an operational response process for a 35-member service team to address incident recurrence and improve time-to-mitigation.
Designed and scaled a major incident response process for a 500+ member unit, introducing documentation standards and automated repair frameworks, enhancing collaboration and efficiency.
Led operability team in creating feedback cycles and automated repair systems for distributed compute systems, enabling scalable incident response for backend systems impacting thousands of instances.
Contributed to foundational infrastructure for Kubernetes-based workloads on Oracle Cloud, benefiting high-profile customers like TikTok and xAI, enhancing service reliability and scalability.
Eliminated instance downtime by redesigning core API services to support in-place updates, scaling from dozens of beta users requiring manual intervention to thousands of automated daily updates across production infrastructure.
Managed Gartner Magic Quadrant evaluation process for Compute service, navigating cross-organizational dependencies (Linux, Storage, Networking, Images) to deliver performance benchmarks under analyst scrutiny—validating infrastructure could handle 1000 simultaneous launches.
Prevented customer-impacting release delays during organizational upheaval by owning end-to-end review and coordination of Terraform provider, SDK, and CLI changes—keeping deliverables on track while peer teams experienced significant schedule slippage.
Reduced customer support burden through deep performance optimization of foundational Go SDK, enabling self-service troubleshooting from application layer down to OS-level configuration for services like Oracle Kubernetes Engine.
Accelerated enterprise customer adoption by developing reusable infrastructure patterns (Active Directory integration, Elasticsearch deployment, custom Linux migrations) using Python, Terraform, Ansible, and Cloud-Init, reducing customer onboarding time and support escalations.
Unblocked Oracle Global Business Unit migration to OCI by creating safe instance modification tooling which enabled rapid prototyping and removed deployment bottlenecks blocking Oracle Fusion Apps team adoption.
Delivered VMware-on-OCI Terraform solution under high-stakes deadline for Oracle OpenWorld 2019 on-stage demonstration and Oracle-VMware partnership announcement, meeting critical go-to-market timeline.
Established partner image security standards and review process, personally reviewing dozens of images and preventing multiple critical vulnerabilities (hardcoded credentials, SSH keys, unsafe permissions), then built and trained dedicated team to maintain quality enforcement.
Eliminated customer-reported outages by implementing Prometheus/Grafana monitoring for 3000+ servers across 45 datacenters, replacing reactive support-ticket-based detection with proactive automated failure detection.
Established foundational source control and CI/CD by deploying GitLab and GitLab CI, eliminating single-server code storage risks and enabling versioned deployments where none existed previously.
Developed standardized cross-datacenter provisioning pipelines and decommissioning procedures, ensuring secure credential management and consistent deployment practices across distributed infrastructure.
Designed incident response procedures to reduce MTTR and coordinate remediation across geographically distributed VPN infrastructure serving privacy-critical customer workloads.
Delivered foundational Go SDK and Terraform provider for Oracle Cloud Infrastructure's competitive market launch (November 2017), enabling configuration-as-code capabilities essential for enterprise customer acquisition against AWS.
Built production-ready infrastructure tooling from minimal documentation during private beta, coordinating across Oracle service teams to define API contracts, ensure idempotent operations, and establish sustainable development practices for future team scaling.
Designed core resource support (networking, compute, load balancing) with Hashicorp integration standards, then trained newly-formed Oracle team on maintenance workflows, establishing foundation now used by OpenAI, NVIDIA DGX Cloud, and TikTok for AI/ML workloads.
Infrastructure Engineer,
Ensighten
–
Automated hybrid cloud infrastructure managing 1,300 servers across 14 datacenters and 3 cloud providers using Terraform, Puppet, and Ansible, enabling multi-cloud deployment flexibility and eliminating vendor lock-in for CDN and customer data platform services.
Designed cross-region monitoring and autoscaling using Sensu/Graphite to ensure high availability across geographically distributed infrastructure supporting customer-facing CDN services.
Led 90-day DNS migration converting 2,000 manually-managed records to infrastructure-as-code, directing 2-engineer team to build open-source Terraform providers for UltraDNS and NS1, preventing production errors through automated testing and detecting unauthorized configuration changes.
Optimized Kafka infrastructure for Customer Data Platform by contributing custom RoundRobinAssignor to open-source project, enabling dynamic cluster scaling during maintenance without data rejection, reducing infrastructure costs through right-sized capacity.
Infrastructure Engineer,
Simply Measured
–
Engineered and maintained scalable infrastructure for 500 servers, achieving 150:1 server-to-admin ratio and ensuring high availability.
Developed monitoring tools and metrics services, reducing downtime and enhancing operational efficiency. Streamlined 100+ SOPs, led training for 45 engineers, and optimized on-call processes, improving response times.
Implemented distributed processing using Resque/Ruby, reducing data processing time. Collaborated with cross-functional teams to integrate infrastructure solutions, improving deployment success rates.
Operations Engineer,
Bitp.it
–
Designed and implemented automated deployment pipelines using Git/Chef, reducing deployment times and enabling zero-downtime deployments, supporting continuous revenue generation.
Optimized 100 GHash/sec bitcoin mining pool via OS-level hardening, firewall tuning, and secure cold wallet management, enhancing operational security and efficiency.
Software Engineer,
Waterfield Technologies
–
Designed and implemented automated deployment strategies using VMWare, Git, Chef, and Capistrano, reducing deployment errors and enabling non-specialist operators to perform upgrades via an intuitive interface.
Developed customizable web UIs for insurance-linked debit cards and engineered a revenue accounting platform for the energy sector, increasing client adaptability through enhanced upgrade support.
Automated testing, continuous integration, and site monitoring processes, ensuring uptime and preventing regressions.
Engineer,
Mobicentric
–
Developed scalable mobile real estate platform reducing agents' operational costs and enhancing branding.
Optimized Ruby API client processing 6,000 listings and 60,000 images daily on single Heroku worker, achieving 10x throughput improvement by implementing streaming SAX parser for efficient RETS XML processing, enabling memory-efficient handling of 1GB+ datasets.
Further experience available upon request
Open Source & Community Contributions
Most of my open source contributions are visible at github.com/josephholsten,
off-github contributions follow: