Case Study: German Research Center for Artificial Intelligence (DFKI).
Long Term Storage of Machine & Job-related Metrics in a Slurm Cluster with VictoriaMetrics

“It’s the whole package. Easy to use with sane defaults & delivers a completely hassle-free experience. Whenever I have to deal with other software, I wish it was VictoriaMetrics instead.”

Industry:

Artificial Intelligence Research

Location:

Kaiserslautern, Germany

The German Research Center for Artificial Intelligence (DFKI) was founded in 1988 as a non-profit public-private partnership. It has research facilities in Kaiserslautern, Saarbrücken and Bremen, a project office in Berlin, a laboratory in Niedersachsen and branch offices in Lübeck, St. Wendel and Trier. In the field of innovative commercial software technology using Artificial Intelligence, DFKI is the leading research center in Germany.

Main Benefits of Using VictoriaMetrics

Whole-Package-Solution

Easy to Use

Hassle-Free

Fully Prometheus-Compatible

On-Premise Friendly & Space-Efficient

Performance Monitoring Grafana Dashboard

Challenge

Traditionally, each research group in DFKI used their own hardware. In mid 2020, we started an initiative to consolidate existing (and future) hardware into a central Slurm cluster to enable our researchers and students to run more and larger experiments.

Based on the Nvidia deepops stack, this included Prometheus for short-term metric storage. Our users liked the level of detail they got from our custom dashboards compared with our previous Zabbix-based solution, so we decided to extend the retention period to several years and needed to find the best option on the market for this task.

Ideally, we wanted PhD students to be able to access data from even their earliest experiments while they were finishing their thesis. Since we do everything on-premise we needed a solution that has strong compression and is extremely space-efficient.

Solution

VictoriaMetrics kept showing up in searches and benchmarks on time series database performance and consistently came out on top when it came to required storage. Quite frankly, the presented numbers looked like magic, so we decided to put this to the test.

Why VictoriaMetrics Was Chosen Over Other Solutions

Excellent Trial Results

Measurably Superior to Other Solutions

Consumes Less CPU Time & RAM

Consumes ⅓ of the Storage

First impressions in our testing were excellent. It was so easy and we simply downloaded the binary and pointed it at a storage location. There was almost no configuration required. Apart from minor tweaks to the command line (turning on deduplication) and running it as a systemd unit, we still use the same instance from the first tests today. VictoriaMetrics was superior to Prometheus in every measurable way. It used considerably less CPU time and RAM than Prometheus and a third of the storage.
While initially storage efficiency was our primary driver, the simplicity of setting up a testbed definitely helped guide our decision as well. Seeing how effortlessly the single-node VictoriaMetrics instance manages our current setup gives us confidence that it will keep up with our growth for quite a while. When the time comes that we do outgrow it, there is always the robust cluster variant of VictoriaMetrics that we can turn to.

How VictoriaMetrics Is Used Today

Two VictoriaMetrics instances behind promxy for HA, with one vmagent each for scraping. Removing Prometheus fixed the issue where it would only scrape a subset of all targets, requiring multiple restarts until it could be convinced to do its job properly
Painless migration process. We restored from the most recent backup, which only took a few minutes and backfilled the small gap in data
Backfilled a larger dataset of 3.4B samples, which only took ~13 minutes to finish while operations continued as normal

To summarize: Satisfaction with Victoria has only increased over here!

Technical Stats

The maximum number of active time series during the last 24 hours

sum(max_over_time(vm_cache_entries{type="storage/hour_metric_ids"}[24h]))

130212

Daily time series churn rate

sum(increase(vm_new_timeseries_created_total[24h]))

7000-20000

The average ingestion rate over the last 24h

sum(rate(vm_rows_inserted_total[24h]))

24309.404768518518

The total number of datapoints

sum(vm_rows{type=~"storage/.+"})

157440268939

The total number of entries in inverted index

sum(vm_rows{type="indexdb"})

32116049

Data Size on Disk

sum(vm_data_size_bytes{type=~"storage/.+"})

82616524124

Index size on disk

sum(vm_data_size_bytes{type="indexdb"})

294114810

The average datapoint size on disk

sum(vm_data_size_bytes) / sum(vm_rows{type=~"storage/.+"})

0.5266325618708981

The average range query rate over the last 24h

sum(rate(vm_http_requests_total{path=~".*/api/v1/query_range"}[24h]))

2.0832407648523237

The average instant query rate over the last 24h

sum(rate(vm_http_requests_total{path=~".*/api/v1/query"}[24h]))

1.2442476851851851

Median range query duration quantiles over the last 24h

max(median_over_time(vm_request_duration_seconds{path=~".*/api/v1/query_range"}[24h])) by (quantile)

1          0.00867678
0.500  0.001041498
0.900  0.003903785
0.970  0.005359947
0.990  0.006418689

Median instant query duration quantiles over the last 24h

max(median_over_time(vm_request_duration_seconds{path=~".*/api/v1/query"}[24h])) by (quantile)

1          0.01897588
0.500  0.000890873
0.900  0.002929274
0.970  0.005280216
0.990  0.009454632

Median memory usage during the last 24

sum(median_over_time(process_resident_memory_bytes[24h]))

2855964672

The average number of cpu cores used during the last 24h

sum(rate(process_cpu_seconds_total[24h]))

0.11834062363031682

Case Study: German Research Center for Artificial Intelligence (DFKI).Long Term Storage of Machine & Job-related Metrics in a Slurm Cluster with VictoriaMetrics