Site Reliability Engineer (SRE) 網站可靠性工程師

[Job Summary] We are looking for a Site Reliability Engineer (SRE) at Mlytics bridging the gap between development and operations by applying a software engineering mindset to system administration topics. They focus on scaling and automating the production environment, striving to make our services both available and agile. [What’s Your Daily Looks Like:] System Scalability and Reliability Enhance the scalability and reliability of the infrastructure through design improvements and service capacity planning. Implement automation tools for efficient server provisioning, deployment processes, as well as incident response. Incident Management Lead real-time incident response and post-mortem analysis to prevent future incidents. Develop automation tools to reduce manual intervention in incident handling Service Monitoring Design, implement, and maintain monitoring solutions that proactively detect failures, anomalies, and other issues. Continuously improve monitoring to ensure visibility into service health and performance metrics. Performance Optimization Analyze and optimize the performance of key system components, ensuring efficient operation under diverse conditions. Collaborate with the Cross Team (RD, DevOps, SOC, CST) to ensure the integration of performance requirements. Disaster Recovery and Backup Design and execute disaster recovery plans to minimize downtime and ensure fast system recovery Together with DevOps conduct regular backup operations and implement appropriate processes for data protection, disaster recovery, and failover procedures. Documentation and Knowledge Sharing Maintain detailed documentation for system architecture, configurations, processes, and service records. Conduct training sessions and provide guidance to other team members and stakeholders including customers. On-going Project SOC Tools (internal tools) maintenance Slackbot (internal tools) maintenance SLA Monitoring (StatusPage) AIOps (Support Bot and other) → focus on the infrastructure