Sr Site Reliability Engineer
Remote, CA 94089 US
Connexion’s mission is to provide "best in class" services to job seekers. We strive to achieve excellence in job placement, staffing, and recruiting services, while treating candidates with the professionalism and respect they deserve.
Title: Sr Site Reliability Engineer- BH13226
Hiring Organization: Connexion Systems & Engineering
Sr. Site Reliability Engineer -
Remote Locations available
SRE to join our talented team and build high quality technology solutions that revolutionize wireless networks, powered by Artificial Intelligence in the cloud. Provides services through SaaS applications to several Fortune 100 and Fortune 500 customers. You'll take ops projects from concept through to launch. You will be responsible for maintaining and improving the company's production environment for rapid scaling and outstanding performance. You will be responsible to help us keep stellar uptime and reliability. The improvements you implement will be felt by the entire organization.
As s Site Reliability Engineer (SRE) , you will be responsible for keeping our cloud-based services, streaming frameworks, NoSQL/RDBMS databases and distributed analytical platforms running in multi-cloud environments to deliver unprecedented IT automation and insight into user experiences driven by our AI services over a geographically distributed customers’ networks.
- Build infrastructure as a code using Terraform, Ansible and Kubernetes
- Manage and performance tune either databases (Postgres, Redis, Cassandra, Elasticsearch) or streaming data pipelines (Kafka, Flink, Storm, Spark frameworks)
- Manage CICD pipelines, configuration, automation tools for infrastructure provisioning.
- Write and maintain runbooks for knowledge driven automated processes and bots.
- Do capacity planning based on performance, usage, and utilization stats.
- Partner with developers and quality engineering teams to automate the monitoring, alerting, availability and scalability of our applications and systems.
- Ensure system availability and business continuity by implementing redundant servers/services.
- Manage after-hours infrastructure updates and maintenance.
- Proactively research and propose the use of new concepts, processes, technologies, and tools.
- Proactive monitoring, diagnosis, on-call rotation and resolution of issues in a 24x7 of multi-cloud environment (AWS/GCP), analyze failures and provide support for software engineers to debug production issues across microservices and distributed platforms.
- Follow SRE best practices and procedures.
Experience required for you to be successful:
- Follow SRE best practices and procedures.
- An extensive background in developing and operating large-scale cloud-based distributed applications
- Direct experience developing/running applications on AWS and Google Cloud.
- Laser focus and be able to design infrastructure solutions for scalability, reliability, high availability, performance, software maintainability, and operational excellence
- The ability to "fix the plane while in flight" (not just support greenfield solutions)
- The ability to prioritize existing technical and infrastructure debt, and experience to build and execute a plan to pay it off
- Delivering reliable operations for web-scale infrastructure for a global market at high release velocity
- Must have solid experience with at least 2 of the languages: Go, Java, NodeJS, Python
- Strong experience with Kafka, Spark, Storm, Cassandra, ElasticSearch, PostgreSQL, Redis, Zookeeper, Nginx.
- 10+ years industry experience in managing infrastructure.
- 6+ years maintaining production systems on AWS and/or GCP.
- 3+ years experience in managing Kubernetes in a large scale production environment
- Strong familiarity in running and optimizing RDBs and NoSQL databases.
- Strong demonstrated experience using infrastructure-as-a-code software (eg. Terraform, AWS and Google Cloud Deployment, CloudFormation).
- 5+ years experience in continuous integration practices & tools (Jenkins, Travis CI, CircleCI, etc…)
- Experience with monitoring solutions such as: CloudWatch, Stackdriver, Prometheus, Graphite, Grafana, ELK, SignalFX, Splunk, Alert Logic, Datadog.
- Experience with Linux administration in a large-scale SaaS environment.
Please use the apply button to submit your resume for consideration. A Connexion Representative will contact you immediately.
When responding to this job posting you MUST include the Job# and Job Title in your subject line.
If you are active in a job search but this job is not for you, please reach out to firstname.lastname@example.org. We would be glad to help you find the perfect job!