ClarksTech – We provide comprehensive next generation IT services and solutions. ClarksTech – Specialized in delivering cutting-edge technology solutions that help businesses improve efficiency

ClarksTech – We provide comprehensive next generation IT services and solutions. ClarksTech – Specialized in delivering cutting-edge technology solutions that help businesses improve efficiency

Site Reliability Engineer (W2 only)

  • Home
  • Job Openings
  • Site Reliability Engineer (W2 only)

Site Reliability Engineer (W2 only)

Location: Alpharetta, GA or Berkeley Heights, NJ
Work Mode: Hybrid (Onsite 2–3 days per week)
Employment Type: Contract (W2 only – No C2C)
Duration: Multi-year engagement, extended annually
H1-B Transfer: Available for the right candidate

About the Role

We are seeking 4 Site Reliability Engineers (SRE) (2 Seniors with 8 years of experience and 2 juniors with at least 4 years of experience) with mandatory, hands-on expertise in telemetry, observability, and site monitoring platforms. This is a hybrid contract role based in Alpharetta, GA or Berkeley Heights, NJ.

This role requires proven, production-level experience with enterprise observability stacks. Candidates without demonstrable telemetry and monitoring expertise will not be considered.

Key Responsibilities

  • Design, implement, and maintain comprehensive telemetry and observability solutions across distributed enterprise systems with complex architectures.
  • Build, optimize, and scale real-time monitoring dashboards, metrics pipelines, and intelligent alerting systems using industry-standard tools including Datadog, Splunk, Prometheus, Grafana, ELK Stack, and similar platforms.
  • Implement end-to-end observability strategies encompassing metrics, logs, traces, and events to ensure complete system visibility.
  • Develop and maintain custom instrumentation for applications and infrastructure to capture critical telemetry data.
  • Collaborate with engineering teams to embed reliability practices and ensure systems are resilient, observable, and performant.
  • Automate monitoring workflows, alert management, and reliability tasks using Python, Shell, or Go scripting.
  • Lead incident response efforts: rapidly identify, troubleshoot, and resolve production issues using observability data and telemetry analysis.
  • Design and implement SLOs/SLIs, error budgets, and reliability KPIs with corresponding monitoring and alerting for mission-critical services.
  • Develop self-healing and auto-remediation capabilities leveraging observability insights.
  • Partner with DevOps, Cloud, and Security teams to integrate observability into CI/CD pipelines and optimize infrastructure reliability.
  • Conduct post-incident reviews with detailed telemetry analysis and drive systemic improvements.

Mandatory Skills & Qualifications

Telemetry & Observability (MANDATORY)

Candidates MUST demonstrate hands-on, production experience with the following:

  • Observability Platforms (REQUIRED): Deep expertise in at least TWO of the following:
    • Datadog (metrics, APM, logs, traces)
    • Splunk (log aggregation, search, alerting, dashboards)
    • Prometheus (time-series metrics, PromQL, alerting rules)
    • Grafana (visualization, dashboard creation, data source integration)
    • ELK Stack (Elasticsearch, Logstash, Kibana)
  • Telemetry & Monitoring Fundamentals (REQUIRED):
    • Building and maintaining metrics collection pipelines
    • Log aggregation, parsing, and analysis at scale
    • Distributed tracing and application performance monitoring (APM)
    • Creating actionable alerts with proper signal-to-noise ratios
    • Dashboard design for real-time system health visualization
    • Metrics instrumentation and custom telemetry implementation
  • Observability Best Practices (REQUIRED):
    • Implementing the three pillars of observability: metrics, logs, and traces
    • Correlation of telemetry data across multiple sources
    • Establishing observability for microservices and distributed systems
    • Capacity planning using historical telemetry data
    • Performance baselining and anomaly detection

Core SRE Requirements (MANDATORY)

  • 4-8 years of professional experience in Site Reliability Engineering or DevOps roles with significant focus on observability
  • Proven track record in incident management and on-call support in enterprise production environments, using observability tools for rapid diagnosis
  • Proficiency in Linux system administration, networking, and performance tuning
  • Hands-on experience with cloud platforms (AWS, Azure, or GCP) including cloud-native monitoring solutions (CloudWatch, Azure Monitor, GCP Operations)
  • Solid programming/scripting skills in Python, Bash, Go, or equivalent for automation and tooling
  • Familiarity with container orchestration (Kubernetes, Docker) and monitoring containerized environments
  • Experience designing and maintaining CI/CD pipelines (Jenkins, GitHub Actions, GitLab CI) with integrated monitoring and observability

Nice-to-Have Skills

  • AIOps and intelligent monitoring: Experience with ML-based anomaly detection, predictive monitoring, and automated incident correlation
  • OpenTelemetry: Implementation experience with OpenTelemetry for standardized observability instrumentation
  • Infrastructure-as-code: Terraform, Ansible, Pulumi with monitoring-as-code practices
  • Security observability: Integration of security monitoring, SIEM tools, and compliance frameworks with observability stacks
  • Advanced telemetry tools: Experience with Jaeger, Zipkin, New Relic, AppDynamics, Dynatrace, or other specialized APM/observability platforms
  • Custom metrics exporters: Development of Prometheus exporters or custom telemetry agents
  • Cost optimization: Experience optimizing telemetry data retention and observability platform costs

Engagement Rules

  • Contract Position (W2 only) – No C2C, No Agencies
  • Number of Positions – 4 (2 Seniors with 8 years of experience and 2 juniors with at least 4 years of experience)
  • Experience requirement: 4-8 years with mandatory telemetry/observability expertise
  • Multi-year contract with annual extensions
  • H1-B transfer available for the right candidate
  • Hybrid onsite role (2–3 days per week, Alpharetta, GA or Berkeley Heights, NJ)

👉 To Apply

Submit your resume with detailed descriptions of your telemetry, observability, and site monitoring experience including:

  • Specific tools and platforms you’ve implemented and managed in production
  • Scale of systems monitored (number of hosts, services, data volume)
  • Examples of observability solutions you’ve designed
  • Incident response experiences leveraging telemetry data

Send to: jobs@clarkstech.com

Only candidates with demonstrable, production-level telemetry and observability expertise will be reviewed.

Apply for this position

Allowed Type(s): .pdf, .doc, .docx