Senior DevOps | SRE | Cloud & Platform Engineer

Designing Self-Healing Systems.

I design, automate, and operate highly reliable, scalable, and observable platforms.

With ~6 years of experience in infrastructure, DevOps, and platform engineering, I specialize in Infrastructure as Code, Kubernetes, CI/CD automation, observability, and system reliability engineering.

I focus on automation-first operations, production stability, and cloud-native platform design.

vignesh@production:~/platform
100% Uptime Maintained
90% Faster Provisioning
40% MTTR Improvement
100+ Systems Monitored

About Me

I am a Senior DevOps / SRE / Platform Engineer with strong hands-on experience in building and operating enterprise-scale infrastructure. My career started in production infrastructure support, which gave me deep exposure to real-world outages, incident management, and reliability challenges. Over time, I transitioned into DevOps and platform engineering roles, where I now design automated, scalable, and observable systems.

I enjoy solving complex infrastructure problems, reducing manual operations through automation, and building platforms that are secure, resilient, and easy to operate. I strongly believe in Infrastructure as Code, CI/CD-driven deployments, and observability-led operations.

What I Do (Core Expertise)

Infrastructure & Automation

Terraform, Ansible, & Python for repeatable provisioning and config management.

Container Platforms

Orchestrating stateful & stateless workloads using Docker & Kubernetes.

CI/CD Engineering

Automated build & deployment pipelines via Jenkins and GitLab CI/CD.

Observability Stack

Deep visibility using Prometheus, Grafana, ELK Stack & OpenNMS.

Database Reliability

Managing HA clusters for PostgreSQL, MySQL, & Cassandra.

SRE & Security

Incident Response, RCA, system hardening, and SSL automation.

Featured Engineering

Real-world problems solved with scalable architecture.

IaC VMware

Infra Automation & Private Cloud

Problem: Manual VM provisioning caused inconsistency and days of delay.

Solution: Architected a Terraform + Ansible pipeline. Reduced provisioning time by 90% and eliminated drift.

Shell Python Ansible Terraform AWS
Kubernetes

K8s Platform Migration

Problem: Monolithic apps on legacy VMs were hard to scale and update.

Solution: Containerized applications to Kubernetes, enabling auto-scaling and zero-downtime deployments.

Docker K8s Helm
CI/CD

Release Engineering

Problem: Manual releases were error-prone and took hours.

Solution: Unified Jenkins/GitLab pipelines with automated gates, cutting release time by 75%.

JenkinsGitLabMavenSonarQube
Observability

Monitoring & AIOps Platform

Problem: Reactive incident handling due to lack of visibility.

Solution: Built a Prometheus/Grafana stack integrated with Aizen prediction data, reducing MTTR during incidents by ~40%.

PrometheusGrafanaAlertManager
Logging

Centralized ELK Stack

Problem: Distributed logs made debugging security incidents painful.

Solution: Ingested generic syslogs across 100+ systems into ELK, providing instant searchability.

ElasticsearchKibana
Database

DB Engineering

Problem: Single points of failure in critical database layers.

Solution: Implemented PostgreSQL High Availability (HA), and performance-tuned MySQL & Cassandra clusters.

PostgresMySQLCassandra

Career Trajectory

Oct 2024 – Present

Engineer – Systems (Senior DevOps / SRE / Platform Engineering)

Sify Technologies Limited

  • Own end-to-end platform automation, CI/CD pipelines, observability, and system reliability for enterprise environments.
  • Design and implement Infrastructure as Code using Terraform to provision VMware-based private cloud infrastructure, reducing provisioning time by 90%.
  • Automate system provisioning and configuration using Ansible and Shell scripting, reducing manual effort and eliminating drift.
  • Build and maintain CI/CD pipelines using Jenkins and GitLab CI/CD, cutting release time by 75% through automated builds and testing.
  • Integrate SonarQube into CI pipelines for code quality and security.
  • Deploy and operate workloads on Kubernetes clusters, including monitoring and platform services.
  • Implement Prometheus and Grafana for system and application monitoring, reducing MTTR during incidents by ~40%.
  • Design custom Grafana dashboards for data visualization, including systems based on Aizen prediction data.
  • Deploy and manage OpenProject (project management) and OpenNMS (network monitoring) on VMs and Kubernetes.
  • Monitor and support Everest NMS application servers, ensuring high availability and service reliability.
  • Manage PostgreSQL database replication and cloning as part of Everest platform operations.
  • Deploy and maintain ELK Stack (Elasticsearch, Logstash, Kibana) for centralized logging, including syslog and NetFlow monitoring.
  • Work with PostgreSQL, MySQL, and Cassandra databases for deployment, maintenance, and troubleshooting.
  • Automate infrastructure workflows using the Itential platform by creating reusable automation workflows.
  • Actively participate in incident response, RCA, and reliability improvements, following SRE best practices.
Jan 2022 – Sep 2024

DevOps Engineer

Precision Biometric India Private Limited

  • Owned DevOps and production operations for mission-critical applications in a 24/7 environment.
  • Deployed and supported Java-based web applications in multi-tier architectures.
  • Built and automated CI/CD pipelines using Jenkins, including Maven builds, artifact management, and Docker-based deployments.
  • Implemented Infrastructure as Code using Terraform for consistent and repeatable infrastructure provisioning.
  • Containerized applications using Docker and supported container-based deployments.
  • Performed Linux system administration, automating routine tasks using shell scripting and cron jobs.
  • Handled production incidents, log analysis, and root cause analysis, improving deployment reliability.
Jun 2016 – Mar 2018

Technical Support Engineer (Infrastructure & Operations – Foundation Role)

CSS Corp Pvt Ltd

  • Started career in 24/7 Infrastructure & Production Operations, building a strong foundation in systems, networking, and reliability.
  • Supported network devices, security systems, routers, and enterprise desktop environments.
  • Performed Linux and Windows troubleshooting, system health checks, and log analysis.
  • Used ServiceNow for incident, problem, and escalation management following ITIL practices.
  • Contributed to 50% improvement in customer satisfaction metrics through operational excellence.

Technology Stack

IaC & Automation

Terraform
Ansible
Python
Shell
AWS

CI/CD & DevOps

Jenkins
GitLab
Maven
SonarQube

Containers & Orchestration

Kubernetes
Docker
Helm

Observability & Logging

Prometheus
Grafana
ELK Stack
OpenNMS

Databases

PostgreSQL
MySQL
Cassandra

Reliability Lessons

Real failures I learned from while operating production systems.

Monitoring Before Scaling

Early scaling efforts failed due to insufficient visibility. This led to designing monitoring and alerting as a prerequisite for platform expansion.

Automation Over Runbooks

Repeated incident patterns were eliminated by converting runbooks into automation, significantly reducing operational toil.