Site Reliability Engineer
We're Ushering a New Era of Data Participation. Interested?
We’re shaping the way companies manage data by helping customers connect the right data, insights, and algorithms for all Data Citizens. When everyone across the organization is enabled with data, true transformation can take place. We are building a team of exceptional people to help us deliver on that promise. If you are interested in a career at the leading edge of technology, we look forward to hearing from you.
How you'll make an impact at Collibra:
Collibra seeks a Site Reliability Engineer with a high focus on Reliability of current product. As our customer base keeps growing and our product keeps evolving, we find it important that we get full insight into what is going on. As our expert Reliability Engineer, you will help the teams with understanding how to make our product more reliable and mentor our Engineers. You understand this is the most critical component of improving reliability and the best way of giving feedback to the developers. While still maintaining a good flow of new features (stability vs throughput).
A day in the life of a Site Reliability Engineer at Collibra:
You’ll be reporting directly to the CloudOps Manager, and will be responsible for a wide range of tasks, including:
- You will enhance our customers experience on our live services by reducing the downtime and improving the communication flows.
- You are responsible for leading cross-team engineering discussions to achieve scalable, measurable, fault-tolerant, and cost-effective cloud services.
- You will continuously refine monitoring processes, thresholds, and configuration (SLO/SLI).
- Work closely with product developers to ensure new features have the proper operational support and maintainability--provide deep technical guidance to development teams and put forward the best practices and constant mentoring.
- Help with designing, building and maintaining the cloud native platform needed to support our growth plans, we do that handling Infrastructure as code and automating as much as we can.
- Develop software for the purposes of automating, monitoring and maintaining deployed infrastructure and services.
- Help teams create and maintain documentation and runbooks/playbooks.
- Help teams eliminate their toil with automation or development tools.
- Participate in Scrum processes and ceremonies.
- Building software to help operations and support teams.
- Embrace the pyramid of reliability and ensure that each component is taken care of by the organisation.
You Have:
- Track record of working as a Site Reliability Engineer, DevOps Engineer, or a Software Engineer.
- You feel at home in concepts like Infrastructure as Code and CI/CD.
- You have experience with many of the following tools like Ansible, Terraform, Jenkins.
- Experience in working with cloud platforms such as AWS.
- You feel at home on a Linux system and it does not have a lot of secrets for you.
- Docker, Kubernetes are no longer new technologies for you, you used it and know where it can break.
- You know how applications work, what it needs and how communication flows after seeing a diagram.
- You can explain why SLO & SLI's are so important.
- You trust in data to make good decisions.
- You understand what it means to have a stable system.
- You have experience in at at least one scripting language: python, go.
- Experience in working with cloud platforms such as AWS, GCP.
- Experience using Agile practices.
- Fluent English and Poli