Job Description
Site Reliability Engineer (SRE)
Position Overview:
We are seeking a highly skilled and motivated Site Reliability Engineer (SRE) with strong expertise in Python, advanced proficiency in Azure-based infrastructure, and significant experience in Customer Reliability Engineering (CRE) and Automation. The ideal candidate will have 3 to 5 years of experience in SRE or related fields and a proven ability to design, deploy, and maintain scalable, reliable, and high-performing cloud solutions. This role focuses on driving system reliability, leveraging automation to optimize operations, and delivering robust solutions for complex infrastructure challenges.
Key Responsibilities:
Design & Plan-
- Design and implement comprehensive Elastic (ELK stack) solutions, including Elasticsearch, Logstash, and Kibana.
- Analyze and document requirements to improve existing infrastructure through automation ("Infrastructure as Code") and seamless Azure cloud integration.
- Develop and document architectural designs for scalable Azure solutions, tailored to customer requirements.
Build & Deploy-
- Build robust CI/CD pipelines (Azure DevOps, Jenkins, ArgoCD) to support efficient code deployment and reusable automation workflows.
- Advance scripting and automation frameworks using Python, Bash, and Painless scripting languages.
- Manage, troubleshoot, and enhance Kubernetes clusters, including Azure Kubernetes Service (AKS)environments.
- Deploy production-ready Elasticsearch clusters on-premises and in Kubernetes clusters.
Operate & Support-
- Proactively monitor systems using tools like Azure Monitor, Elastic Observability, and Application Insights, ensuring high availability and performance.
- Develop self-healing mechanisms and automated scaling for distributed systems to reduce downtime and improve reliability.
- Lead incident response processes, conduct root cause analysis, and drive post-mortem discussions to prevent recurring issues.
- Collaborate with security teams to implement and maintain best practices for system security and compliance.
Automation-
- Develop robust automation scripts for repetitive operational workflows, configuration management, and deployment pipelines using tools such as Ansible, Terraform, and Helm.
- Drive enhancements in infrastructure automation to enable seamless deployments and self-service capabilities for engineering teams.
Collaboration & Customer Engagement-
- Partner with cross-functional teams (engineering, operations, and product) to design systems with reliability and performance in mind.
- Work closely with customers to address specific reliability challenges and ensure tailored Azure-based solutions meet their operational needs.
- Foster a DevOps culture and champion best practices across teams.
Qualifications:
Experience-
- 5+ years of hands-on experience as SRE / SRE Automation Engineer.
- Proven expertise in designing, deploying, and managing Azure cloud infrastructure and services.
- Significant experience in Elastic stack (ELK), including managing Elasticsearch clusters, Logstash pipelines, and Kibana visualizations.
- Advanced proficiency in Python scripting and automation for large-scale systems.
- Strong knowledge of Kubernetes cluster management, including AKS.
- Demonstrated experience building CI/CD pipelines and deploying applications in distributed environments.
- Working knowledge of containerization tools like Docker and orchestration technologies.
Technical Skills-
- Azure Expertise: Azure Kubernetes Service (AKS), Azure DevOps, Application Insights, Log Analytics, and Azure security best practices.
- Automation Tools: Proficiency with Ansible, Terraform, Helm, and ArgoCD.
- Scripting: Python (advanced), Bash, Painless scripting for Elasticsearch pipelines.
- Monitoring: Elastic Observability, Grafana, and Azure-native tools.
- Networking: Understanding of virtual networks, firewalls, and RBAC in cloud environments.
- Security: Familiarity with OAuth, SAML, and secure deployment methodologies.
- Knowledge of highly scalable systems, RESTful APIs, and caching mechanisms.
Soft Skills-
- Strong problem-solving and troubleshooting skills for complex distributed systems.
- Excellent communication and collaboration skills, including the ability to liaise between technical teams and non-technical stakeholders.
- Customer-focused approach, with a track record of designing solutions that meet client-specific reliability requirements.
- Proactive, self-motivated, and committed to continuous learning and improvement.
Education-
- Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent practical experience).
Preferred Qualifications-
- Working knowledge of Elastic Cloud for Kubernetes (ECK).
- Certification in Microsoft Azure or Kubernetes.
- Experience implementing GitOps methodologies for deployment automation.
Job Tags
Remote job,