Description:
About the job
You are responsible for:
- Design, implementation and maintenance of public facing infrastructure and services
- Use of configuration management and deployment tools
- Architectural design and operation at scale
- Monitoring of systems and services, optimization of performance and resource utilization
- Common operating system level tasks such as logging and backup / restore
- Cookbook / runbook implementation for common maintenance actions
- Incident response, diagnosis and follow-up on system outages or alerts
- Automation and streamlining of tasks as well as identifying process gaps
- Collaborating with a global and asynchronously communicating team (don’t worry if you have never worked remotely, we’ll help you get used to it)
- Mentoring peers in your areas of technical and operational strength
Skills and Experience:
- Strong experience with automation and configuration management tools such as Terraform, Ansible. Proficient in at least one programming language (Python,, Go, or similar).
- Strong understanding of CI/CD pipelines and deployment strategies.
- Experience managing Cloud services and discovering cost savings (AWS, Azure, GCP)
- Experience with monitoring, logging, and alerting tools such as Prometheus, Grafana, or ELK stack.
- Strong troubleshooting and problem-solving skills, and ability to work effectively under pressure.
- Excellent communication skills with a strong emphasis on contributing to documenting processes and runbooks, and ability to collaborate effectively with cross-functional teams.
- Incident Management: Experience with incident management and on-call rotation practices, as well as tools like PagerDuty or Opsgenie.
- SRE Best Practices: Understanding of SRE principles, such as Service Level Objectives (SLOs), error budgets, and blameless postmortem.
- Familiarity with Wikimedia or other open source projects is a plus.
- If you are passionate about building and maintaining reliable, scalable, and highly available infrastructure on AWS, and thrive in a dynamic and collaborative environment, we encourage you to apply for this exciting opportunity to join our team at Wikimedia Enterprise
Qualities that are important to us:
- Experience with operating highly available infrastructure
- Experience with running applications and services at scale
- Proficient with shell and a programming language used in an SRE/Operations engineering context (Python, Go,, etc.)
- Comfortable with Open Source configuration management and orchestration tools (, Ansible, TerraForm etc.)
- Communicative technical English