Array

Careers

SRE Engineer

Location: USA
Type: Full-time
Experience: Senior

About The Position

Senior Site Reliability Engineer (SRE) - Infrastructure

About the team:

Our globally distributed team of senior engineers is dedicated to managing and optimizing infrastructure for multi-cloud production services. We specialize in infrastructure monitoring, automation, tools management, and deployment across various cloud platforms. As part of our responsibilities, we actively participate in follow-the-sun on-call rotations. We operate in a dynamic, multi-tasking environment that requires constant learning and adaptation. By ensuring the reliability of our systems, we directly impact the success of our business.


Tech Stack:

●     Version Control: Gitlab

●     Continuous Delivery: ArgoCD

●     Container Orchestration: Kubernetes

●     Configuration Management: Puppet, Ansible

●     Automation: Rundeck

●     Monitoring & Alerting: InfluxDB, Prometheus, Thanos, Grafana, Zabbix

●     Logging: Coralogix

●     Infrastructure as Code: Terraform

●     Caching: Memcached

●     Scripting: Shell, Python

●     Cloud Platforms: AWS, GCP

 

About the role:

As a Senior Site Reliability Engineer (SRE) specializing in Infrastructure, you will play a critical role in managing and optimizing our multi-cloud production services. You will be responsible for infrastructure monitoring, automation, tools management, and production stability. This role requires active participation in follow-the-sun on-call rotations to ensure the reliability and availability of our services.



What You Will Do:

●     Manage and optimize multi-cloud production services infrastructure.

●     Implement and maintain infrastructure monitoring solutions using Prometheus, Thanos, Grafana, and other tools.

●     Develop automation scripts in Bash and Python to streamline operational tasks.

●     Manage tools such as Puppet, Ansible, Rundeck, Teleport and more.

●     Collaborate with cross-functional teams to enhance system reliability and performance.

●     Contribute to the architecture and scalability of our systems.

●     Participate in follow-the-sun on-call rotation to respond to incidents and ensure system availability.

●     Troubleshoot and resolve infrastructure issues across our cloud environments.

●     Drive best practices for reliability, scalability, and observability.

●     Mentor and guide other teams in best practices and technologies.

●     Contribute to the design and implementation of scalable, reliable, and secure solutions.


 

Required Experience:

●     Minimum of 2 years of hands-on experience with Kubernetes.

●     Minimum of 5 years experience as SRE / DevOps / Cloud or System Engineer.

●     At least 1 year of experience working with cloud environments (AWS, Google Cloud Platform).

●     Strong understanding of infrastructure monitoring tools such as Prometheus (Mimir/Thanos/Cortex), including deployment and management.

●     Proficiency in Bash and Python scripting for automation tasks.

●     Experience with SQL and NoSQL databases, such as MySQL, PostgreSQL and MongoDB.

●     Familiarity with in-memory key-value stores such as Redis and Memcached.

●     Solid understanding of networking and web applications, with emphasis on TCP/IP stack, SSL/TLS, and HTTP protocols.

 

 

Additional Skills (Preferred):

●     Experience with Terraform for infrastructure as code.

●     Knowledge of containerization technologies such as Docker.

●     Understanding of CI/CD pipelines.

●     Familiarity with logging and monitoring tools like Coralogix.

 

Why Join Us:

If you are passionate about infrastructure reliability, and automation, and thrive in a fast-paced environment, we would love to hear from you. Join us in delivering the best experience for our customers and ensuring the success of our business. Apply now to be part of our innovative team!

 

●     Opportunity to work with a globally distributed team of senior engineers.

●     Dynamic and challenging environment that encourages constant learning and growth.

●     Direct impact on the reliability and success of our business.

●     Exposure to cutting-edge technologies and cloud platforms.

 

 

 

Apply for this position