[Job Summary] We are looking for a Site Reliability Engineer (SRE) at Mlytics bridging the gap between development and operations by applying a software engineering mindset to system administration topics. They focus on scaling and automating the production environment, striving to make our services both available and agile.
[What’s Your Daily Looks Like:] System Scalability and Reliability
Enhance the scalability and reliability of the infrastructure through design improvements and service capacity planning. Implement automation tools for efficient server provisioning, deployment processes, as well as incident response. Incident Management
Lead real-time incident response and post-mortem analysis to prevent future incidents. Develop automation tools to reduce manual intervention in incident handling Service Monitoring
Design, implement, and maintain monitoring solutions that proactively detect failures, anomalies, and other issues. Continuously improve monitoring to ensure visibility into service health and performance metrics. Performance Optimization
Analyze and optimize the performance of key system components, ensuring efficient operation under diverse conditions. Collaborate with the Cross Team (RD, DevOps, SOC, CST) to ensure the integration of performance requirements. Disaster Recovery and Backup
Design and execute disaster recovery plans to minimize downtime and ensure fast system recovery Together with DevOps conduct regular backup operations and implement appropriate processes for data protection, disaster recovery, and failover procedures. Documentation and Knowledge Sharing
Maintain detailed documentation for system architecture, configurations, processes, and service records. Conduct training sessions and provide guidance to other team members and stakeholders including customers. On-going Project
SOC Tools (internal tools) maintenance Slackbot (internal tools) maintenance SLA Monitoring (StatusPage) AIOps (Support Bot and other) → focus on the infrastructure