Back to Jobs
DevOps Engineer / Site Reliability Engineer - video streaming service
We have following urgent role with our DIRECT client
Title: DevOps Engineer / Site Reliability Engineer - video streaming service
LOCATION: Sunnyvale, CA
DURATION: 6+ months
Compensation: Competitive ( DOE )
Priority will be given to candidates who:
- Worked in a 24/7 paying customer facing environment.
- Site Reliability Experience is a plus
- Have been part of oncall rotation for a customer facing/money making infrastructure.
As a DevOps Engineer, you will be working with world class engineering and DevOps teams to scale up our application support, application launch and services availability to 99.99%. He/She is responsible for managing applications using automated Build, Release, Deployment and monitoring alerting.
• Supporting and monitoring a 24/7 high-quality video streaming service in a fast paced startup-like environment.
• Participation in oncall rotation, carrying oncall phone.
• Promptly responding to alerts and other issues raised by customers or internal business units.
• Promptly notifying relevant parties and acknowledging alerts.
• Promptly working with vendors, third party service providers (Content delivery networks, payment gateways, content licensing partners etc.) and internal teams to solve production issues.
• Collecting approvals and running software rollouts post-approvals.
• Working on day-to-day tasks like managing user access and supporting internal business units.
• Troubleshooting and fixing infrastructure issues from hardware layer to application layers with no or minimal supervision.
• Working on tailoring monitoring and alerting systems to avoid false positives and update alert related settings.
• Reviewing alerts, emails, updating relevant teams in meetings regarding issues and outages.
• Knowledge or at least familiarity with Linux shell, system internals, network, java applications and MySQL databases.
• Love for debugging: troubleshooting and debugging monitoring alerts and internet connectivity issues.
• Ability to read and understand server/systems logs and produce meaningful issue analyses.
• Good analytics, troubleshooting skills and intuition about probable root cause.
• Familiarity with Splunk, Python, Apache, rsync and monitoring/alerting tools like Nagios, PagerDuty, OpsGenie will be a plus.