We are looking for a senior AI Infrastructure Engineer to design, build, and operate robust AI/ML infrastructure on public cloud platforms. The ideal candidate will have deep hands-on experience in cloud-native environments, container orchestration, Infrastructure as Code, CI/CD, and observability, ensuring scalable, secure, and efficient AI workloads.
我们正在招聘一名资深的 AI基础架构工程师,负责在公有云环境中设计、搭建和运营稳定高效的AI/ML基础架构。理想候选人需要具备云原生环境、容器编排、基础设施即代码、CI/CD 以及可观测性方面的丰富实战经验,确保AI工作负载的可扩展性、安全性和高性能。
■Design, deploy, and operate AI/ML infrastructure on public cloud platforms (AWS/Azure/GCP or domestic clouds like Alibaba Cloud/Tencent Cloud).
在公有云平台(AWS/Azure/GCP 或阿里云/腾讯云等国内云)上设计、部署并运维 AI/ML 基础架构。■Build and maintain containerized environments using Docker and manage large-scale workloads with Kubernetes.
使用 Docker 构建和维护容器化环境,并通过 Kubernetes 管理大规模工作负载。■Use Infrastructure as Code (e.g., Terraform, Ansible) to manage and automate environment provisioning, configuration, and changes.
使用基础设施即代码工具(如 Terraform、Ansible)进行环境的自动化部署、配置与变更管理。■Design, implement, and optimize CI/CD pipelines to support frequent, reliable, and secure deployment of AI and backend services.
设计、实现并优化 CI/CD 流水线,支持 AI 及后端服务的高频、可靠和安全部署。■Implement and maintain monitoring, logging, and alerting systems to ensure high availability and quick incident response.
部署并维护监控、日志与告警系统,保障系统高可用性并支持快速故障响应。■Collaborate closely with AI/ML engineers and backend teams to ensure infrastructure meets performance, security, and compliance requirements.
与 AI/ML 工程师及后端团队紧密合作,确保基础架构满足性能、安全与合规要求。■Continuously optimize cost, performance, and reliability of infrastructure, and drive best practices in cloud-native and DevOps.
持续优化基础架构的成本、性能与可靠性,推动云原生与 DevOps 相关最佳实践的落地。■Cloud & Operations | 云平台与运维经验- ■Senior level hands-on experience with deployment and operations on public cloud platforms (AWS/Azure/GCP or domestic platforms like Alibaba Cloud/Tencent Cloud).
具备资深水平的公有云平台实战经验,能够在 AWS/Azure/GCP 或阿里云/腾讯云等国内平台上独立完成系统的部署与运维。
■Container & Orchestration | 容器与编排- ■Proficient in containerization technologies (Docker) and container orchestration tools (Kubernetes).
精通容器化技术(Docker)以及容器编排工具(Kubernetes),具有实际生产环境经验。
■Infrastructure as Code | 基础设施即代码- ■Skilled in using Infrastructure as Code tools (e.g., Terraform, Ansible) for environment management.
熟练使用基础设施即代码工具(如 Terraform、Ansible)进行环境管理和自动化运维。
■CI/CD | 持续集成与持续交付- ■Practical experience in building, maintaining, and optimizing CI/CD pipelines (familiar with tools like GitHub Actions/GitLab CI/Jenkins).
具备搭建、维护和优化 CI/CD 流水线的实践经验,熟悉 GitHub Actions、GitLab CI、Jenkins 等工具。
■Monitoring & Observability | 监控与可观测性- ■Familiar with monitoring, logging, and alerting systems (e.g., Prometheus, Grafana, ELK Stack).
熟悉监控、日志与告警系统(如 Prometheus、Grafana、ELK Stack),能独立完成监控体系的搭建与优化。
■Networking Fundamentals | 网络基础- ■Senior level knowledge of computer networking, DNS, CDN, and other related fundamentals.
具备资深水平的计算机网络基础知识,熟悉 DNS、CDN 等相关原理和配置。