Overview:
SOFTSWISS continues to expand the team and is looking for Monitoring Systems Engineer. We need a true, experienced, and accomplished professional who shares our culture and values.
Key responsibilities:
The two main pillars of our workflow are:
Responding to Events/Monitoring Alerts (L1/L2 tasks for certain system parts):
- Offering on-duty service coverage, encompassing day and night on-call shifts.
- Provide timely and effective solutions to technical problems reported by users.
- Communicate clearly with users to understand their issues and provide updates on resolution status.
- Addressing incidents by troubleshooting and resolving issues, even seeking assistance from third-party or vendor support when necessary.
- Directing issues or queries to the relevant department as needed.
- Keeping detailed records and documentation of current infrastructure challenges and Root Cause Analyses (RCAs).
- Creating detailed reports for all technical support incidents, including descriptions, resolutions, and timelines.
Maintaining and Enhancing the Monitoring Systems:
- Collaborating with other teams to understand and define their monitoring needs, then implementing the right solutions.
- Setting up and adjusting the monitoring/observability systems for various teams.
- Designing and tweaking alerts and dashboards to suit specific needs.
- Refining alerts to reduce irrelevant notifications and increase their significance.
- Enhancing dashboards for better clarity, understanding, and a more comprehensive view.
- Building and sustaining connections between the monitoring systems and other platforms like Jira, Opsgenie, etc. when required.
- Establishing and updating a Knowledge Base, covering system configurations, alert processes, troubleshooting guidelines, and user manuals.
- Staying updated with the newest trends and best practices to continuously uplift our organization’s monitoring capabilities.
Requirements:
- Minimum of 3 years in technical support roles such as Systems Engineer, SRE, DevOps, or Monitoring Support Engineer (L1/L2 Technical).
- Proven track record in providing L1/L2 support, including incident management, troubleshooting, and customer interaction.
- Good understanding of Linux-like operating systems (Debian-based).
- Experience with containerization, virtualization, and orchestration (LXC/LXD, Docker, Kubernetes).
- Development experience in any scripting language (Bash, Python, Go, etc) and familiarity with REST API.
- Knowledge of basic database concepts (experience with PostgreSQL is preferable), including transactions and WAL.
- English proficiency at an Intermediate (B1) level or higher. It’s crucial to understand technical terminology related to our specific tech stack and to be able to interpret technical documentation. Verbal skills are important too.
Skills & Experience
Monitoring/observability tools (experience with at least two of the following):
- Zabbix (familiarity with concepts such as LLD, prototypes, dependencies, and preprocessing)
- Grafana (knowledge of data sources, dashboard creation, and query usage)
- Prometheus/VictoriaMetrics/etc. (understanding of metrics collection and alerting)
- ELK/Splunk/etc. (ability to use queries and filters for log analysis)
Site24x7/Pingdom/etc. (experience with web monitoring and performance metrics
Linux-like operating systems
- Strong understanding of key concepts, including:
- File systems
- Process management
- Built-in monitoring tools
- Scripting
- Troubleshooting
Familiarity with:
- Kafka
- RabbitMQ
- GitLab
- Nginx/Puma
- Saltstack/Ansible
- Clickhouse
- PostgreSQL
- MongoDB
- Hashicorp
- Vault
- Kubernetes
- Any IaC implementation