
SENIOR SITE RELIABILITY ENGINEER
KMC Work Location: OFFSITE
Location: Taguig City, Metro Manila
Date Posted: 2022-09-28
Hiring Organization: KMC Solutions | XTN-8CE0548
Career Category: Network /System / Database Administration
About Prime Trust
Prime Trust powers innovation in the digital economy by providing fintech and digital asset innovators with financial infrastructure. Through a full suite of APIs, we help clients build seamlessly, launch quickly, and scale securely. Regulated by the State of Nevada, Prime Trust processes hundreds of millions of API calls and settles billions of transactions per month. Prime Trust’s team has extensive regulatory and financial services backgrounds from the OCC, SEC, Federal Reserve, US Department of Justice, US Treasury/Secret Service, JPMorgan Chase, Green Dot, American Express, PNC, Bank of America, and Visa. The company is recognized by Forbes as America’s Best Startup Employer 2022 and is also Great Place to Work-Certified™ 2022. Prime Trust has also been named to CB Insights Blockchain 50 for 2022. Visit us at www.primetrust.com and connect with us on LinkedIn, Twitter, and Facebook.
Job Summary
The SRE team at Prime Trust is responsible for partnering with the DevOps, Release Engineering, and Engineering organization to improve reliability and availability of our infrastructure and services. We are growing the team and looking for an experienced and talented Sr. SRE Engineer to join the team. The primary focus for SRE is responsible for knowing the availability of all environments (production and non prod). Successful engineers leverage their deep knowledge of the services to determine appropriate signals and thresholds to best determine and maintain site health.
We are a highly professional and friendly team that enjoys working together in a collaborative environment. We have a close bond with the Engineering team to work together to create the next financial solution. We loathe technical debt, and toil, and seek to have tooling and systems in place that ease the burden for everyone. We want to maximize team knowledge and professional growth by having time to do research and find ways to continually improve our areas of responsibility.
Job Responsibilities
- Build, improve, and maintain a comprehensive multi-environment monitoring solution
- Leverage our monitoring infrastructure to obtain maximum observability into our environments
- Work with DevOps, Release Engineering, Engineering, and Customer Success teams to improve reporting of the applications the business needs
- Test environment resiliency by leveraging Chaos Engineering (cluster degradation, network latencies, pod /region failure), and application fuzz testing
- Responsible for load testing to know our capacity limits
- Comprehensive understanding of the impact of site availability and the availability of its dependent services and how that impacts.
- Create and maintain dashboards intended to provide relevant information to outside teams (Engineering, C-Staff, etc.), and internally within the Platform Operations team.
- Responsible for “right sizing” resources for all workloads.
- Partner with the NOC to develop playbooks, continuously improve auto remediation to reduce MTTR, and predictive analytics to reduce MTTD to zero.
- Provides cost reporting
- Maintain an authoritative inventory database
- Responsible for all alerting, and it's effectiveness
- Responsible for synthetic testing
- Work with Engineering to troubleshoot, and resolve application issues that reduce the availability
- Work with Product and Engineering to determine KPIs, and formulate SLA and SLOs
Experience & Skills Requirements
- Experience with instrumentation, logging and alerting methods, and best practices
- A proactive approach to problem-solving
- Be an example to others through demonstrated professionalism, discipline, humility, and collaboration
- A passion for making data-driven decisions
- 5+ years experience working in AWS
- 5+ years experience working as an SRE
Education
Bachelor's degree in computer science, engineering, or related field is required
What's In It For You
- Diverse learning & growth opportunities
- Accessible Cloud HR platform (Sprout)
- Above standard leaves
- A competitive salary will pay over market value
- Internet Allowance
- HMO upon hire plus dependents
- Life Insurance
- Work from home temporarily
- Fast-growing company