I design, automate, and operate highly reliable, scalable, and observable
platforms.
With ~6 years of experience in infrastructure, DevOps, and platform engineering, I specialize in
Infrastructure as Code, Kubernetes, CI/CD automation, observability, and system reliability
engineering.
I focus on automation-first operations, production stability, and
cloud-native platform design.
I am a Senior DevOps / SRE / Platform Engineer with strong hands-on experience in building and operating enterprise-scale infrastructure. My career started in production infrastructure support, which gave me deep exposure to real-world outages, incident management, and reliability challenges. Over time, I transitioned into DevOps and platform engineering roles, where I now design automated, scalable, and observable systems.
I enjoy solving complex infrastructure problems, reducing manual operations through automation, and building platforms that are secure, resilient, and easy to operate. I strongly believe in Infrastructure as Code, CI/CD-driven deployments, and observability-led operations.
Terraform, Ansible, & Python for repeatable provisioning and config management.
Orchestrating stateful & stateless workloads using Docker & Kubernetes.
Automated build & deployment pipelines via Jenkins and GitLab CI/CD.
Deep visibility using Prometheus, Grafana, ELK Stack & OpenNMS.
Managing HA clusters for PostgreSQL, MySQL, & Cassandra.
Incident Response, RCA, system hardening, and SSL automation.
Real-world problems solved with scalable architecture.
Problem: Manual VM provisioning caused inconsistency and days of delay.
Solution: Architected a Terraform + Ansible pipeline. Reduced provisioning time by 90% and eliminated drift.
Problem: Monolithic apps on legacy VMs were hard to scale and update.
Solution: Containerized applications to Kubernetes, enabling auto-scaling and zero-downtime deployments.
Problem: Manual releases were error-prone and took hours.
Solution: Unified Jenkins/GitLab pipelines with automated gates, cutting release time by 75%.
Problem: Reactive incident handling due to lack of visibility.
Solution: Built a Prometheus/Grafana stack integrated with Aizen prediction data, reducing MTTR during incidents by ~40%.
Problem: Distributed logs made debugging security incidents painful.
Solution: Ingested generic syslogs across 100+ systems into ELK, providing instant searchability.
Problem: Single points of failure in critical database layers.
Solution: Implemented PostgreSQL High Availability (HA), and performance-tuned MySQL & Cassandra clusters.
Real failures I learned from while operating production systems.
Early scaling efforts failed due to insufficient visibility. This led to designing monitoring and alerting as a prerequisite for platform expansion.
Repeated incident patterns were eliminated by converting runbooks into automation, significantly reducing operational toil.