Site Reliability Engineer Apigee
THIS JOB HAS EXPIRED About Apigee, The API Company
Apps are changing the way we live, and APIs are the secret ingredient that makes apps work. Apigee gives businesses and developers everything they need to be successful in the app economy. Hundreds of companies including AT&T, eBay, Pearson, Gilt Groupe, and Walgreens use Apigee to reach new customers and drive innovation through APIs.
Apigee's API Platform enables businesses and developers to deliver well designed, scalable APIs and apps, drive developer adoption, and extract business value from their API ecosystem.
About Apigee People
Apigee hires smart people who love to solve hard problems and have fun. We?re passionate. We love APIs, we love our customers, and we love application developers. We work as a team, fast and focused, learning as we go. We respect one another, our customers and everyone we do business with.
About being a Site Reliability Engineer at Apigee
As a member of Site Reliability Engineering, you'll be part of the team responsible for ensuring high site availability, service reliability and optimal service performance. Our team is responsible for almost every service offered, involving some of the largest deployments of cutting-edge technologies in the world. Through small teams based in Palo Alto, California and Bangalore, India, the SRE team provides 24/7 oversight and support of the infrastructure and services that power the Apigee platform service.
A member of the SRE team must be able to work independently and multi-task among several concurrent problems, perform triage and prioritization as necessary through the exercise of discretion and independent judgment including marshaling the appropriate and necessary internal resources during high-pressure situations. The Site Reliability Engineer has a strong sense of responsibility and problem ownership and is committed to driving issues to completion; The SRE adapts quickly and is capable of compiling together working solutions across a broad technology stack and working with engineering teams on long-term fixes.
The Site Reliability Engineer will:
- Perform front line monitoring/support and initial response for automated and manually generated events for all Apigee properties
- Collaborate with fellow SREs and other teams on investigating and resolving complex problems
- Daily management and oversight of SRE work queues to SLA
- Document tickets with actions taken and key learning?s identified during incidents including recommendations for post follow-up improvements from root cause analysis
- Contribute to the design and improvement of automation and tools for systems management to support the SRE charter
- Communicate effectively with fellow SREs and other engineering teams, and describe problems succinctly with sufficient detail that you can hand-off an ongoing problem to another team or a peer for completion
- Accurate information transfer and positive engagement with other teams is a vital SRE responsibility
- Perform periodic on-call duty as part of a global team maintaining the availability and performance of the Apigee site and APIs used by third-party services, as well as the various internal services and systems that these core interfaces depend on
- Perform software installations
- Strong focus on documentation authoring and runbook creations
- Manage configuration changes for the deployed systems (approx. 5%)
- Handle ambiguous situations effectively
Think you might be our next Site Reliability Engineer? You bring to the table...
- Prior experience in a fast-paced, high stress environment, resolving multiple interrupt- driven priorities simultaneously preferred
- 3-5 years of experience with distributed unix/linux systems administration and performance tuning
- 1-2 years of AWS experience preferred or similar cloud service provider experience
- 2-3 years of experience with load balancing, storage and clustering technologies
- Solid understanding of TCP/IP networking and switching and proven ability to diagnose and resolve networking issues
- Proficiency in one of the following languages for operation scripting and text processing is expected(Python, PHP, Perl, or Ruby). Python experience is preferred.
- Troubleshooting skills that range from diagnosing low-level hardware issues to large- scale failures within or across datacenter clusters
- The ability to work independently
- Experience with network management systems and monitoring tools such as Nagios and Graphite
- Working experience with Incident Management, Change Management and Problem Management
Apigee offers great compensation, work-life balance, health insurance coverage, insurance for your financial protection, and savings/investment plans. This includes Medical, Dental, Vision, Life Insurance, Short Term and Long Term Disability, Flexible Spending Accounts, and 401(k).
We are a non-accrued vacation time company, whereby we allow as much time you need for personal and vacation matters, with proper management approval. There is freedom in planning the workday with flexible start and stop times subject to the company's needs.
Apigee is an equal opportunity employer and does not discriminate on the basis of race, sex, age, national origin, religion, physical or mental handicaps or disabilities, marital status, veteran status, sexual orientation, nor any other basis prohibited by law.
||Palo Alto, CA |
THIS JOB HAS EXPIRED