Case Study: German Research Center for Artificial Intelligence (DFKI).Long Term Storage of Machine & Job-related Metrics in a Slurm Cluster with VictoriaMetrics
The German Research Center for Artificial Intelligence (DFKI) was founded in 1988 as a non-profit public-private partnership. It has research facilities in Kaiserslautern, Saarbrücken and Bremen, a project office in Berlin, a laboratory in Niedersachsen and branch offices in Lübeck, St. Wendel and Trier. In the field of innovative commercial software technology using Artificial Intelligence, DFKI is the leading research center in Germany.
Main Benefits of Using VictoriaMetrics
Challenge
Traditionally, each research group in DFKI used their own hardware. In mid 2020, we started an initiative to consolidate existing (and future) hardware into a central Slurm cluster to enable our researchers and students to run more and larger experiments.
Based on the Nvidia deepops stack, this included Prometheus for short-term metric storage. Our users liked the level of detail they got from our custom dashboards compared with our previous Zabbix-based solution, so we decided to extend the retention period to several years and needed to find the best option on the market for this task.
Ideally, we wanted PhD students to be able to access data from even their earliest experiments while they were finishing their theses. Since we do everything on-premise we needed a solution that has strong compression and is extremely space-efficient.
Solution
VictoriaMetrics kept showing up in searches and benchmarks on time series database performance and consistently came out on top when it came to required storage. Quite frankly, the presented numbers looked like magic, so we decided to put this to the test.
Why VictoriaMetrics Was Chosen Over Other Solutions
- First impressions in our testing were excellent. It was so easy and we simply downloaded the binary and pointed it at a storage location. There was almost no configuration required. Apart from minor tweaks to the command line (turning on deduplication) and running it as a systemd unit, we still use the same instance from the first tests today. VictoriaMetrics was superior to Prometheus in every measurable way. It used considerably less CPU time and RAM than Prometheus and a third of the storage.
- While initially storage efficiency was our primary driver, the simplicity of setting up a testbed definitely helped guide our decision as well. Seeing how effortlessly the single-node VictoriaMetrics instance manages our current setup gives us confidence that it will keep up with our growth for quite a while. When the time comes that we do outgrow it, there is always the robust cluster variant of VictoriaMetrics that we can turn to.
How VictoriaMetrics Is Used Today
- Two VictoriaMetrics instances behind promxy for HA, with one vmagent each for scraping. Removing Prometheus fixed the issue where it would only scrape a subset of all targets, requiring multiple restarts until it could be convinced to do its job properly
- Painless migration process. We restored from the most recent backup, which only took a few minutes and backfilled the small gap in data
- Backfilled a larger dataset of 3.4B samples, which only took ~13 minutes to finish while operations continued as normal
To summarize: Satisfaction with Victoria has only increased over here!
Technical Stats
sum(max_over_time(vm_cache_entries{type="storage/hour_metric_ids"}[24h]))
sum(increase(vm_new_timeseries_created_total[24h]))
sum(rate(vm_rows_inserted_total[24h]))
sum(vm_rows{type=~"storage/.+"})
sum(vm_rows{type="indexdb"})
sum(vm_data_size_bytes{type=~"storage/.+"})
sum(vm_data_size_bytes{type="indexdb"})
sum(vm_data_size_bytes) / sum(vm_rows{type=~"storage/.+"})
sum(rate(vm_http_requests_total{path=~".*/api/v1/query_range"}[24h]))
sum(rate(vm_http_requests_total{path=~".*/api/v1/query"}[24h]))
max(median_over_time(vm_request_duration_seconds{path=~".*/api/v1/query_range"}[24h])) by (quantile)
0.500 0.001041498
0.900 0.003903785
0.970 0.005359947
0.990 0.006418689
max(median_over_time(vm_request_duration_seconds{path=~".*/api/v1/query"}[24h])) by (quantile)
0.500 0.000890873
0.900 0.002929274
0.970 0.005280216
0.990 0.009454632
sum(median_over_time(process_resident_memory_bytes[24h]))
sum(rate(process_cpu_seconds_total[24h]))