Location: Alpharetta, GA or Berkeley Heights, NJ
Work Mode: Hybrid (Onsite 2–3 days per week)
Employment Type: Contract (W2 only – No C2C)
Duration: Multi-year engagement, extended annually
H1-B Transfer: Available for the right candidate

About the Role

We are seeking 4 Site Reliability Engineers (SRE) (2 Seniors with 8 years of experience and 2 juniors with at least 4 years of experience) with mandatory, hands-on expertise in telemetry, observability, and site monitoring platforms. This is a hybrid contract role based in Alpharetta, GA or Berkeley Heights, NJ.

This role requires proven, production-level experience with enterprise observability stacks. Candidates without demonstrable telemetry and monitoring expertise will not be considered.

Key Responsibilities

Design, implement, and maintain comprehensive telemetry and observability solutions across distributed enterprise systems with complex architectures.
Build, optimize, and scale real-time monitoring dashboards, metrics pipelines, and intelligent alerting systems using industry-standard tools including Datadog, Splunk, Prometheus, Grafana, ELK Stack, and similar platforms.
Implement end-to-end observability strategies encompassing metrics, logs, traces, and events to ensure complete system visibility.
Develop and maintain custom instrumentation for applications and infrastructure to capture critical telemetry data.
Collaborate with engineering teams to embed reliability practices and ensure systems are resilient, observable, and performant.
Automate monitoring workflows, alert management, and reliability tasks using Python, Shell, or Go scripting.
Lead incident response efforts: rapidly identify, troubleshoot, and resolve production issues using observability data and telemetry analysis.
Design and implement SLOs/SLIs, error budgets, and reliability KPIs with corresponding monitoring and alerting for mission-critical services.
Develop self-healing and auto-remediation capabilities leveraging observability insights.
Partner with DevOps, Cloud, and Security teams to integrate observability into CI/CD pipelines and optimize infrastructure reliability.
Conduct post-incident reviews with detailed telemetry analysis and drive systemic improvements.

Mandatory Skills & Qualifications

Telemetry & Observability (MANDATORY)

Candidates MUST demonstrate hands-on, production experience with the following:

Observability Platforms (REQUIRED): Deep expertise in at least TWO of the following:
- Datadog (metrics, APM, logs, traces)
- Splunk (log aggregation, search, alerting, dashboards)
- Prometheus (time-series metrics, PromQL, alerting rules)
- Grafana (visualization, dashboard creation, data source integration)
- ELK Stack (Elasticsearch, Logstash, Kibana)
Telemetry & Monitoring Fundamentals (REQUIRED):
- Building and maintaining metrics collection pipelines
- Log aggregation, parsing, and analysis at scale
- Distributed tracing and application performance monitoring (APM)
- Creating actionable alerts with proper signal-to-noise ratios
- Dashboard design for real-time system health visualization
- Metrics instrumentation and custom telemetry implementation
Observability Best Practices (REQUIRED):
- Implementing the three pillars of observability: metrics, logs, and traces
- Correlation of telemetry data across multiple sources
- Establishing observability for microservices and distributed systems
- Capacity planning using historical telemetry data
- Performance baselining and anomaly detection

Core SRE Requirements (MANDATORY)

4-8 years of professional experience in Site Reliability Engineering or DevOps roles with significant focus on observability
Proven track record in incident management and on-call support in enterprise production environments, using observability tools for rapid diagnosis
Proficiency in Linux system administration, networking, and performance tuning
Hands-on experience with cloud platforms (AWS, Azure, or GCP) including cloud-native monitoring solutions (CloudWatch, Azure Monitor, GCP Operations)
Solid programming/scripting skills in Python, Bash, Go, or equivalent for automation and tooling
Familiarity with container orchestration (Kubernetes, Docker) and monitoring containerized environments
Experience designing and maintaining CI/CD pipelines (Jenkins, GitHub Actions, GitLab CI) with integrated monitoring and observability

Nice-to-Have Skills

AIOps and intelligent monitoring: Experience with ML-based anomaly detection, predictive monitoring, and automated incident correlation
OpenTelemetry: Implementation experience with OpenTelemetry for standardized observability instrumentation
Infrastructure-as-code: Terraform, Ansible, Pulumi with monitoring-as-code practices
Security observability: Integration of security monitoring, SIEM tools, and compliance frameworks with observability stacks
Advanced telemetry tools: Experience with Jaeger, Zipkin, New Relic, AppDynamics, Dynatrace, or other specialized APM/observability platforms
Custom metrics exporters: Development of Prometheus exporters or custom telemetry agents
Cost optimization: Experience optimizing telemetry data retention and observability platform costs

Engagement Rules

Contract Position (W2 only) – No C2C, No Agencies
Number of Positions – 4 (2 Seniors with 8 years of experience and 2 juniors with at least 4 years of experience)
Experience requirement: 4-8 years with mandatory telemetry/observability expertise
Multi-year contract with annual extensions
H1-B transfer available for the right candidate
Hybrid onsite role (2–3 days per week, Alpharetta, GA or Berkeley Heights, NJ)

👉 To Apply

Submit your resume with detailed descriptions of your telemetry, observability, and site monitoring experience including:

Specific tools and platforms you’ve implemented and managed in production
Scale of systems monitored (number of hosts, services, data volume)
Examples of observability solutions you’ve designed
Incident response experiences leveraging telemetry data

Send to: jobs@clarkstech.com

Only candidates with demonstrable, production-level telemetry and observability expertise will be reviewed.

ClarksTech – We provide comprehensive next generation IT services and solutions. ClarksTech – Specialized in delivering cutting-edge technology solutions that help businesses improve efficiency

ClarksTech – We provide comprehensive next generation IT services and solutions. ClarksTech – Specialized in delivering cutting-edge technology solutions that help businesses improve efficiency

Site Reliability Engineer (W2 only)

Site Reliability Engineer (W2 only)

About the Role

Key Responsibilities

Mandatory Skills & Qualifications

Telemetry & Observability (MANDATORY)

Core SRE Requirements (MANDATORY)

Nice-to-Have Skills

Engagement Rules

👉 To Apply

Apply for this position

Company Information

Our Services

Quick Links