The German Research Center for Artificial Intelligence (DFKI) was founded in 1988 as a non-profit public-private partnership. It has research facilities in Kaiserslautern, Saarbrücken and Bremen, a project office in Berlin, a laboratory in Niedersachsen and branch offices in Lübeck, St. Wendel and Trier. In the field of innovative commercial software technology using Artificial Intelligence, DFKI is the leading research center in Germany.
Traditionally, each research group in DFKI used their own hardware. In mid 2020, we started an initiative to consolidate existing (and future) hardware into a central Slurm cluster to enable our researchers and students to run more and larger experiments.
Based on the Nvidia deepops stack, this included Prometheus for short-term metric storage. Our users liked the level of detail they got from our custom dashboards compared with our previous Zabbix-based solution, so we decided to extend the retention period to several years and needed to find the best option on the market for this task.
Ideally, we wanted PhD students to be able to access data from even their earliest experiments while they were finishing their theses. Since we do everything on-premise we needed a solution that has strong compression and is extremely space-efficient.
VictoriaMetrics kept showing up in searches and benchmarks on time series database performance and consistently came out on top when it came to required storage. Quite frankly, the presented numbers looked like magic, so we decided to put this to the test.
sum(max_over_time(vm_cache_entries{type="storage/hour_metric_ids"}[24h]))
sum(increase(vm_new_timeseries_created_total[24h]))
sum(rate(vm_rows_inserted_total[24h]))
sum(vm_rows{type=~"storage/.+"})
sum(vm_rows{type="indexdb"})
sum(vm_data_size_bytes{type=~"storage/.+"})
sum(vm_data_size_bytes{type="indexdb"})
sum(vm_data_size_bytes) / sum(vm_rows{type=~"storage/.+"})
sum(rate(vm_http_requests_total{path=~".*/api/v1/query_range"}[24h]))
sum(rate(vm_http_requests_total{path=~".*/api/v1/query"}[24h]))
max(median_over_time(vm_request_duration_seconds{path=~".*/api/v1/query_range"}[24h])) by (quantile)
max(median_over_time(vm_request_duration_seconds{path=~".*/api/v1/query"}[24h])) by (quantile)
sum(median_over_time(process_resident_memory_bytes[24h]))
sum(rate(process_cpu_seconds_total[24h]))