Overview:
SOFTSWISS continues to expand the team and is looking for a Monitoring Systems Engineer. We need a true, experienced, and accomplished professional who shares our culture and values.
Key responsibilities:
The two main pillars of our workflow are:
Maintaining and Enhancing the Monitoring Systems:
- Collaborating with other teams to understand and define their monitoring needs, then implementing the right solutions
- Setting up and adjusting the monitoring/observability systems for various teams
- Designing and tweaking alerts and dashboards to suit specific needs
- Refining alerts to reduce irrelevant notifications and increase their significance
- Enhancing dashboards for better clarity, understanding, and a more comprehensive view
- Building and sustaining connections between the monitoring systems and other platforms like Jira, Opsgenie, etc. when required
- Establishing and updating a Knowledge Base, covering system configurations, alert processes, troubleshooting guidelines, and user manuals
- Staying updated with the newest trends and best practices to continuously uplift our organization’s monitoring capabilities
Responding to Events/Monitoring Alerts (L1/L2 tasks for certain system parts):
- Offering on-duty service coverage, encompassing day and night shifts
- Addressing incidents by troubleshooting and resolving issues, even seeking assistance from third-party or vendor support when necessary
- Directing issues or queries to the relevant department as needed
- Keeping detailed records and documentation of current infrastructure challenges and Root Cause Analyses (RCAs)
Key technologies:
Monitoring/observability tools (experience with at least two of the following)
- Zabbix (familiarity with concepts such as LLD, prototypes, dependencies, and preprocessing)
- Grafana (knowledge of data sources, dashboard creation, and query usage)
- Prometheus/VictoriaMetrics/etc. (understanding of metrics collection and alerting)
- ELK/Splunk/etc. (ability to use queries and filters for log analysis)
- Site24x7/Pingdom/etc. (experience with web monitoring and performance metrics)
Linux-like operating systems
Strong understanding of key concepts, including:
- File systems
- Process management
- Built-in monitoring tools
- Scripting
- Troubleshooting
Experience needed:
- 3+ years of experience as a Systems Engineer, SRE, DevOps, or Monitoring Support Engineer
- Good understanding of Linux-like operating systems (Debian-based)
- Experience with containerization, virtualization, and orchestration (Docker, Kubernetes, LXC/LXD)
- Development experience in any scripting language (Bash, Python, Ruby, Golang, etc) and familiarity with REST API
- Knowledge of basic database concepts (experience with PostgreSQL is preferable), including transactions and WAL
- Excellent communication skills in Russian, B1+ English level. It’s crucial to understand technical terminology related to our specific tech stack and to be able to interpret technical documentation
Nice to have:
- Kafka
- RabbitMQ
- GitLab
- Nginx/Puma
- Saltstack/Ansible
- Clickhouse
- PostgreSQL
- MongoDB
- Hashicorp Vault
- Kubernetes
- Any IaC implementation