Explore Our Job Openings
Responsible for architecting, designing, and implementing our comprehensive observability stack, including tracing, telemetry, logging, health monitoring, visualization, and dashboards. You will play a key role in ensuring the reliability, performance, and operational efficiency of our services
1. Design and implement a robust observability framework using
technologies like Prometheus,
Grafana, OpenTelemetry, ELK Stack, Zabbix, and Jaeger.
2. Develop and maintain health monitoring and alerting systems for our OpenStack and
Kubernetes-based platforms, with a focus on GPU-supported environments.
3. Create and manage visualization dashboards to monitor system performance, resource
utilization, and operational health
4.Implement scalable, distributed logging and tracing solutions to diagnose, troubleshoot,
and
resolve system issues effectively.
5. Collaborate with development and operations teams to integrate observability practices
into the
development lifecycle.
6. Conduct performance analysis and optimization to ensure system reliability and
efficiency.
7. Stay updated with the latest trends and technologies in observability and performance
monitoring.
1. Bachelor's degree in Computer Science, Engineering, or a related field.
2. Proven experience in observability, monitoring, and system performance analysis,
particularly in
a cloud or data center environment.
3. Expertise in implementing and managing observability tools such as Prometheus,
Grafana,
OpenTelemetry, ELK Stack, Zabbix, and Jaeger.
4. Strong understanding of container orchestration using Kubernetes, and familiarity
with
OpenStack and GPU computing.
5. Proficiency in scripting and automation using languages such as Python, Shell, or
Go.
Excellent problem-solving skills and the ability to work independently or as part of a team.
6. Strong communication skills and the ability to work in a fast-paced, dynamic
environment.