NOC Engineer
Job ID: NOC-ETP-Pun-1292
Location: Pune
Job Description – 24×7 NOC Engineer (NOC)
We are looking for a 24×7 NOC Engineer (NOC) to ensure the availability, performance, and reliability of production systems. This role is hands-on across monitoring/observability, incident response, troubleshooting, and automation, working closely with engineering and infrastructure teams to reduce downtime and improve operational excellence.
Shift / Support Model: 24×7 rotational shifts (including nights/weekends) with on-call participation as required.
Key Responsibilities
- Monitor applications and infrastructure using New Relic, Datadog, Grafana and related observability tooling; maintain dashboards and actionable alerting.
- Alert creation, tuning, and noise reduction
- Provide L1/L2 incident response in a 24×7 environment; triage alerts, restore service quickly, and manage escalations.
- Perform deep troubleshooting across Linux systems, Kubernetes workloads, infrastructure components, and network paths.
- Conduct log analysis using Newrelic/ELK (and/or similar platforms) to identify patterns, correlate events, and support root cause analysis.
- Build and enhance automation for routine operational tasks, alert remediation, and reporting using Python and Bash.
- Manage infrastructure changes using Terraform and follow Infrastructure-as-Code practices (review, version control, rollback readiness).
- Support Kubernetes platform operations by assisting with deployments, performing cluster/service health checks, executing scaling and recycling activities, monitoring capacity and performance, and troubleshooting issues.
- Maintain clear runbooks, SOPs, and shift handover notes; ensure knowledge is captured and reusable.
- Partner with engineering and cloud/infrastructure teams to improve reliability through post-incident reviews, problem management, and continuous improvements to observability.
Must-have Skills
- Monitoring & Observability: New Relic, Datadog, Grafana; strong alert triage and dashboarding skills.
- Linux: administration fundamentals, process/service troubleshooting, permissions, performance basics.
- Automation & Scripting: Bash and Python for operational tooling and automation.
- Infrastructure as Code: Terraform (hands-on).
- Containers: Kubernetes (workload troubleshooting, cluster concepts).
- Networking: TCP/IP basics, DNS, HTTP/HTTPS, load balancing concepts, connectivity troubleshooting.
- Log Analysis: ELK (or equivalent), querying/correlation for RCA support.
Secondary Skills
- Cloud infrastructure fundamentals (AWS/Azure/GCP).
- Good communication skills: clear incident updates, shift handovers, and stakeholder coordination.
Qualifications & Experience
- Bachelor’s degree (B.Tech/B.E., MCA) or equivalent practical experience.
- 4–6 years of experience in SRE / NOC / Production Support / DevOps / Infrastructure Operations.
- Experience working in a shift-based operations environment with strong ownership and urgency.
- Ability to document clearly (runbooks, post-incident notes) and collaborate effectively with cross-functional teams.
