“It’s the whole package. Easy to use with sane defaults & delivers a completely hassle-free experience. Whenever I have to deal with other software, I wish it was VictoriaMetrics instead.”

  • Artificial Intelligence Research
  • Kaiserslautern, Germany

The German Research Center for Artificial Intelligence (DFKI) was founded in 1988 as a non-profit public-private partnership. It has research facilities in Kaiserslautern, Saarbrücken and Bremen, a project office in Berlin, a laboratory in Niedersachsen and branch offices in Lübeck, St. Wendel and Trier. In the field of innovative commercial software technology using Artificial Intelligence, DFKI is the leading research center in Germany.

Main Benefits of Using VictoriaMetrics

  • Whole-Package-Solution

  • Easy to Use

  • Hassle-Free

  • Fully Prometheus-Compatible

  • On-Premise Friendly & Space-Efficient

  • Performance Monitoring Grafana Dashboard

Challenge

Traditionally, each research group in DFKI used their own hardware. In mid 2020, we started an initiative to consolidate existing (and future) hardware into a central Slurm cluster to enable our researchers and students to run more and larger experiments.

Based on the Nvidia deepops stack, this included Prometheus for short-term metric storage. Our users liked the level of detail they got from our custom dashboards compared with our previous Zabbix-based solution, so we decided to extend the retention period to several years and needed to find the best option on the market for this task.

Ideally, we wanted PhD students to be able to access data from even their earliest experiments while they were finishing their thesis. Since we do everything on-premise we needed a solution that has strong compression and is extremely space-efficient.

Solution

VictoriaMetrics kept showing up in searches and benchmarks on time series database performance and consistently came out on top when it came to required storage. Quite frankly, the presented numbers looked like magic, so we decided to put this to the test.

Why VictoriaMetrics Was Chosen Over Other Solutions

  • Excellent Trial Results

  • Measurably Superior to Other Solutions

  • Consumes Less CPU Time & RAM

  • Consumes ⅓ of the Storage

  • First impressions in our testing were excellent. It was so easy and we simply downloaded the binary and pointed it at a storage location. There was almost no configuration required. Apart from minor tweaks to the command line (turning on deduplication) and running it as a systemd unit, we still use the same instance from the first tests today. VictoriaMetrics was superior to Prometheus in every measurable way. It used considerably less CPU time and RAM than Prometheus and a third of the storage.
  • While initially storage efficiency was our primary driver, the simplicity of setting up a testbed definitely helped guide our decision as well. Seeing how effortlessly the single-node VictoriaMetrics instance manages our current setup gives us confidence that it will keep up with our growth for quite a while. When the time comes that we do outgrow it, there is always the robust cluster variant of VictoriaMetrics that we can turn to.

How VictoriaMetrics Is Used Today

  • Two VictoriaMetrics instances behind promxy for HA, with one vmagent each for scraping. Removing Prometheus fixed the issue where it would only scrape a subset of all targets, requiring multiple restarts until it could be convinced to do its job properly
  • Painless migration process. We restored from the most recent backup, which only took a few minutes and backfilled the small gap in data
  • Backfilled a larger dataset of 3.4B samples, which only took ~13 minutes to finish while operations continued as normal

To summarize

Satisfaction with Victoria has only increased over here!

Technical Stats

  • The maximum number of active time series during the last 24 hours

    sum(max_over_time(vm_cache_entries{type="storage/hour_metric_ids"}[24h]))

    130212

  • Daily time series churn rate

    sum(increase(vm_new_timeseries_created_total[24h]))

    7000-20000

  • The average ingestion rate over the last 24h

    sum(rate(vm_rows_inserted_total[24h]))

    24309.404768518518

  • The total number of datapoints

    sum(vm_rows{type=~"storage/.+"})

    157440268939

  • The total number of entries in inverted index

    sum(vm_rows{type="indexdb"})

    32116049

  • Data Size on Disk

    sum(vm_data_size_bytes{type=~"storage/.+"})

    82616524124

  • Index size on disk

    sum(vm_data_size_bytes{type="indexdb"})

    294114810

  • The average datapoint size on disk

    sum(vm_data_size_bytes) / sum(vm_rows{type=~"storage/.+"})

    0.5266325618708981

  • The average range query rate over the last 24h

    sum(rate(vm_http_requests_total{path=~".*/api/v1/query_range"}[24h]))

    2.0832407648523237

  • The average instant query rate over the last 24h

    sum(rate(vm_http_requests_total{path=~".*/api/v1/query"}[24h]))

    1.2442476851851851

  • Median range query duration quantiles over the last 24h

    max(median_over_time(vm_request_duration_seconds{path=~".*/api/v1/query_range"}[24h])) by (quantile)

    1 0.00867678 0.500 0.001041498 0.900 0.003903785 0.970 0.005359947 0.990 0.006418689

  • Median instant query duration quantiles over the last 24h

    max(median_over_time(vm_request_duration_seconds{path=~".*/api/v1/query"}[24h])) by (quantile)

    1 0.01897588 0.500 0.000890873 0.900 0.002929274 0.970 0.005280216 0.990 0.009454632

  • Median memory usage during the last 24

    sum(median_over_time(process_resident_memory_bytes[24h]))

    2855964672

  • The average number of cpu cores used during the last 24h

    sum(rate(process_cpu_seconds_total[24h]))

    0.11834062363031682