Location: Alpharetta, GA or Berkeley Heights, NJ
Work Mode: Hybrid (Onsite 2–3 days per week)
Employment Type: Contract (W2 only – No C2C)
Duration: Multi-year engagement, extended annually
H1-B Transfer: Available for the right candidate
About the Role
We are seeking 4 Site Reliability Engineers (SRE) (2 Seniors with 8 years of experience and 2 juniors with at least 4 years of experience) with mandatory, hands-on expertise in telemetry, observability, and site monitoring platforms. This is a hybrid contract role based in Alpharetta, GA or Berkeley Heights, NJ.
This role requires proven, production-level experience with enterprise observability stacks. Candidates without demonstrable telemetry and monitoring expertise will not be considered.
Key Responsibilities
- Design, implement, and maintain comprehensive telemetry and observability solutions across distributed enterprise systems with complex architectures.
- Build, optimize, and scale real-time monitoring dashboards, metrics pipelines, and intelligent alerting systems using industry-standard tools including Datadog, Splunk, Prometheus, Grafana, ELK Stack, and similar platforms.
- Implement end-to-end observability strategies encompassing metrics, logs, traces, and events to ensure complete system visibility.
- Develop and maintain custom instrumentation for applications and infrastructure to capture critical telemetry data.
- Collaborate with engineering teams to embed reliability practices and ensure systems are resilient, observable, and performant.
- Automate monitoring workflows, alert management, and reliability tasks using Python, Shell, or Go scripting.
- Lead incident response efforts: rapidly identify, troubleshoot, and resolve production issues using observability data and telemetry analysis.
- Design and implement SLOs/SLIs, error budgets, and reliability KPIs with corresponding monitoring and alerting for mission-critical services.
- Develop self-healing and auto-remediation capabilities leveraging observability insights.
- Partner with DevOps, Cloud, and Security teams to integrate observability into CI/CD pipelines and optimize infrastructure reliability.
- Conduct post-incident reviews with detailed telemetry analysis and drive systemic improvements.
Mandatory Skills & Qualifications
Telemetry & Observability (MANDATORY)
Candidates MUST demonstrate hands-on, production experience with the following:
- Observability Platforms (REQUIRED): Deep expertise in at least TWO of the following:
- Datadog (metrics, APM, logs, traces)
- Splunk (log aggregation, search, alerting, dashboards)
- Prometheus (time-series metrics, PromQL, alerting rules)
- Grafana (visualization, dashboard creation, data source integration)
- ELK Stack (Elasticsearch, Logstash, Kibana)
- Telemetry & Monitoring Fundamentals (REQUIRED):
- Building and maintaining metrics collection pipelines
- Log aggregation, parsing, and analysis at scale
- Distributed tracing and application performance monitoring (APM)
- Creating actionable alerts with proper signal-to-noise ratios
- Dashboard design for real-time system health visualization
- Metrics instrumentation and custom telemetry implementation
- Observability Best Practices (REQUIRED):
- Implementing the three pillars of observability: metrics, logs, and traces
- Correlation of telemetry data across multiple sources
- Establishing observability for microservices and distributed systems
- Capacity planning using historical telemetry data
- Performance baselining and anomaly detection
Core SRE Requirements (MANDATORY)
- 4-8 years of professional experience in Site Reliability Engineering or DevOps roles with significant focus on observability
- Proven track record in incident management and on-call support in enterprise production environments, using observability tools for rapid diagnosis
- Proficiency in Linux system administration, networking, and performance tuning
- Hands-on experience with cloud platforms (AWS, Azure, or GCP) including cloud-native monitoring solutions (CloudWatch, Azure Monitor, GCP Operations)
- Solid programming/scripting skills in Python, Bash, Go, or equivalent for automation and tooling
- Familiarity with container orchestration (Kubernetes, Docker) and monitoring containerized environments
- Experience designing and maintaining CI/CD pipelines (Jenkins, GitHub Actions, GitLab CI) with integrated monitoring and observability
Nice-to-Have Skills
- AIOps and intelligent monitoring: Experience with ML-based anomaly detection, predictive monitoring, and automated incident correlation
- OpenTelemetry: Implementation experience with OpenTelemetry for standardized observability instrumentation
- Infrastructure-as-code: Terraform, Ansible, Pulumi with monitoring-as-code practices
- Security observability: Integration of security monitoring, SIEM tools, and compliance frameworks with observability stacks
- Advanced telemetry tools: Experience with Jaeger, Zipkin, New Relic, AppDynamics, Dynatrace, or other specialized APM/observability platforms
- Custom metrics exporters: Development of Prometheus exporters or custom telemetry agents
- Cost optimization: Experience optimizing telemetry data retention and observability platform costs
Engagement Rules
- Contract Position (W2 only) – No C2C, No Agencies
- Number of Positions – 4 (2 Seniors with 8 years of experience and 2 juniors with at least 4 years of experience)
- Experience requirement: 4-8 years with mandatory telemetry/observability expertise
- Multi-year contract with annual extensions
- H1-B transfer available for the right candidate
- Hybrid onsite role (2–3 days per week, Alpharetta, GA or Berkeley Heights, NJ)
👉 To Apply
Submit your resume with detailed descriptions of your telemetry, observability, and site monitoring experience including:
- Specific tools and platforms you’ve implemented and managed in production
- Scale of systems monitored (number of hosts, services, data volume)
- Examples of observability solutions you’ve designed
- Incident response experiences leveraging telemetry data
Send to: jobs@clarkstech.com
Only candidates with demonstrable, production-level telemetry and observability expertise will be reviewed.