Case Study: German Research Center for Artificial Intelligence (DFKI).
Long Term Storage of Machine & Job-related Metrics in a Slurm Cluster with VictoriaMetrics

“It’s the whole package. Easy to use with sane defaults & delivers a completely hassle-free experience. Whenever I have to deal with other software, I wish it was VictoriaMetrics instead.”
Industry:
Artificial Intelligence Research
Location:
Kaiserslautern, Germany

The German Research Center for Artificial Intelligence (DFKI) was founded in 1988 as a non-profit public-private partnership. It has research facilities in Kaiserslautern, Saarbrücken and Bremen, a project office in Berlin, a laboratory in Niedersachsen and branch offices in Lübeck, St. Wendel and Trier. In the field of innovative commercial software technology using Artificial Intelligence, DFKI is the leading research center in Germany.

Main Benefits of Using VictoriaMetrics

Whole-Package-Solution
Easy to Use
Hassle-Free
Fully Prometheus-Compatible
On-Premise Friendly & Space-Efficient
Performance Monitoring Grafana Dashboard

Challenge

Traditionally, each research group in DFKI used their own hardware. In mid 2020, we started an initiative to consolidate existing (and future) hardware into a central Slurm cluster to enable our researchers and students to run more and larger experiments.

Based on the Nvidia deepops stack, this included Prometheus for short-term metric storage. Our users liked the level of detail they got from our custom dashboards compared with our previous Zabbix-based solution, so we decided to extend the retention period to several years and needed to find the best option on the market for this task.

Ideally, we wanted PhD students to be able to access data from even their earliest experiments while they were finishing their theses. Since we do everything on-premise we needed a solution that has strong compression and is extremely space-efficient.

Solution

VictoriaMetrics kept showing up in searches and benchmarks on time series database performance and consistently came out on top when it came to required storage. Quite frankly, the presented numbers looked like magic, so we decided to put this to the test.

Why VictoriaMetrics Was Chosen Over Other Solutions

Excellent Trial Results
Measurably Superior to Other Solutions
Consumes Less CPU Time & RAM
Consumes ⅓ of the Storage
  • First impressions in our testing were excellent. It was so easy and we simply downloaded the binary and pointed it at a storage location. There was almost no configuration required. Apart from minor tweaks to the command line (turning on deduplication) and running it as a systemd unit, we still use the same instance from the first tests today. VictoriaMetrics was superior to Prometheus in every measurable way. It used considerably less CPU time and RAM than Prometheus and a third of the storage.
  • While initially storage efficiency was our primary driver, the simplicity of setting up a testbed definitely helped guide our decision as well. Seeing how effortlessly the single-node VictoriaMetrics instance manages our current setup gives us confidence that it will keep up with our growth for quite a while. When the time comes that we do outgrow it, there is always the robust cluster variant of VictoriaMetrics that we can turn to.

How VictoriaMetrics Is Used Today

  • Two VictoriaMetrics instances behind promxy for HA, with one vmagent each for scraping. Removing Prometheus fixed the issue where it would only scrape a subset of all targets, requiring multiple restarts until it could be convinced to do its job properly
  • Painless migration process. We restored from the most recent backup, which only took a few minutes and backfilled the small gap in data
  • Backfilled a larger dataset of 3.4B samples, which only took ~13 minutes to finish while operations continued as normal

To summarize: Satisfaction with Victoria has only increased over here!

Technical Stats

The maximum number of active time series during the last 24 hours
sum(max_over_time(vm_cache_entries{type="storage/hour_metric_ids"}[24h]))
130212
Daily time series churn rate
sum(increase(vm_new_timeseries_created_total[24h]))
7000-20000
The average ingestion rate over the last 24h
sum(rate(vm_rows_inserted_total[24h]))
24309.404768518518
The total number of datapoints
sum(vm_rows{type=~"storage/.+"})
157440268939
The total number of entries in inverted index
sum(vm_rows{type="indexdb"})
32116049
Data Size on Disk
sum(vm_data_size_bytes{type=~"storage/.+"})
82616524124
Index size on disk
sum(vm_data_size_bytes{type="indexdb"})
294114810
The average datapoint size on disk
sum(vm_data_size_bytes) / sum(vm_rows{type=~"storage/.+"})
0.5266325618708981
The average range query rate over the last 24h
sum(rate(vm_http_requests_total{path=~".*/api/v1/query_range"}[24h]))
2.0832407648523237
The average instant query rate over the last 24h
sum(rate(vm_http_requests_total{path=~".*/api/v1/query"}[24h]))
1.2442476851851851
Median range query duration quantiles over the last 24h
max(median_over_time(vm_request_duration_seconds{path=~".*/api/v1/query_range"}[24h])) by (quantile)
1          0.00867678
0.500  0.001041498
0.900  0.003903785
0.970  0.005359947
0.990  0.006418689
Median instant query duration quantiles over the last 24h
max(median_over_time(vm_request_duration_seconds{path=~".*/api/v1/query"}[24h])) by (quantile)
1          0.01897588
0.500  0.000890873
0.900  0.002929274
0.970  0.005280216
0.990  0.009454632
Median memory usage during the last 24
sum(median_over_time(process_resident_memory_bytes[24h]))
2855964672
The average number of cpu cores used during the last 24h
sum(rate(process_cpu_seconds_total[24h]))
0.11834062363031682

Watch Your Monitoring SkyRocket With VictoriaMetrics!