Senior Site Reliability Engineer Apigee
THIS JOB HAS EXPIRED About Apigee, The API Company
Apps are changing the way we live, and APIs are the secret ingredient that makes apps work. Apigee gives businesses and developers everything they need to be successful in the app economy. Hundreds of companies including AT&T, eBay, Pearson, Gilt Groupe, and Walgreens use Apigee to reach new customers and drive innovation through APIs.
Apigee's API Platform enables businesses and developers to deliver well designed, scalable APIs and apps, drive developer adoption, and extract business value from their API ecosystem.
About Apigee People
Apigee hires smart people who love to solve hard problems and have fun. We?re passionate. We love APIs, we love our customers, and we love application developers. We work as a team, fast and focused, learning as we go. We respect one another, our customers and everyone we do business with.
About Being a Senior Site Reliability Engineer
The Senior Site Reliability Engineer will take a key role on the SRE team, which is responsible for ensuring high site availability, service reliability and optimal service performance. Our team is responsible for almost every service offered, involving some of the largest deployments of cutting-edge technologies in the world. Through small teams based in Palo Alto, California and Bangalore, India, the SRE team provides 24/7 oversight and support of the infrastructure and services that power the Apigee platform service.
The Senior Site Reliability Engineer must be able to work independently and multi-task among several concurrent problems, perform triage and prioritization as necessary through the exercise of discretion and independent judgment including marshaling the appropriate and necessary internal resources during high-pressure situations. The Senior Site Reliability Engineer has a strong sense of responsibility and problem ownership and is committed to driving issues to completion; the Sr. SRE adapts quickly and is capable of compiling together working solutions across a broad technology stack and working with engineering teams on long-term fixes.
The Senior Site Reliability Engineer will:
- Ramp up quickly and demonstrate ownership and accountability of Site Availability and Service Reliability
- Analyze product performance and reliability, proposing and driving enhancements and improvements
- Provide technical leadership across all of Apigee?s production properties and SRE core competencies
- Effectively balance tactical and strategic deliverables amidst changing priorities in a fast paced environment
- Actively drive technical RCA and root cause resolution to preserve and improve availability
- Collaborate with fellow SREs and other teams to investigate and resolve complex problems
- Be responsible for daily management and oversight of SRE work queues to SLA
- Have a strong commitment to all aspects of documentation
- Own automation and tooling for systems management to support the SRE charter
- Communicate effectively to achieve quick diagnosis, resolution or effective handover of issues for progressive troubleshooting
- Perform periodic on-call duty as part of a global team maintaining the availability and performance of the Apigee site and APIs used by third-party services, as well as the various internal services and systems on which these core interfaces depend
- Handle ambiguous situations effectively
Think you might be our next Senior, Site Reliability Engineer? You bring to the table...
- Prior experience in a fast-paced, high stress environment, resolving multiple interrupt- driven priorities simultaneously
- 5+ years experience with distributed unix/linux systems administration and performance tuning
- 3+ years advanced AWS experience preferred or similar cloud service provider experience
- 1-2+ years Puppet experience
- 3+ years experience with load balancing, storage and clustering technologies
- Solid understanding of TCP/IP networking and switching and proven ability to diagnose and resolve networking issues
- Have a strong understanding of Web Application architectures
- Proficiency in one or more of the following languages for operation scripting and text processing (Python, PHP, Perl, or Ruby); Python experience is preferred
- Troubleshooting skills that range from diagnosing low-level hardware issues to large- scale failures within or across datacenter clusters
- Excellent communication abilities to all levels
- Good organization skills and the ability to work independently
- Experience setting up and administering network management systems and monitoring tools such as Nagios and Graphite
- Working experience with Incident Management, Change Management and Problem Management
Apigee offers great compensation, work-life balance, health insurance coverage, insurance for your financial protection, and savings/investment plans. This includes Medical, Dental, Vision, Life Insurance, Short Term and Long Term Disability, Flexible Spending Accounts, and 401(k).
We are a non-accrued vacation time company, whereby we allow as much time you need for personal and vacation matters, with proper management approval. There is freedom in planning the workday with flexible start and stop times subject to the company's needs.
Apigee is an equal opportunity employer and does not discriminate on the basis of race, sex, age, national origin, religion, physical or mental handicaps or disabilities, marital status, veteran status, sexual orientation, nor any other basis prohibited by law.
||Palo Alto, CA |
THIS JOB HAS EXPIRED