VictoriaLogs is GA - Try it out now!

How vmstorage handles data ingestion

by Phuong Le on Nov 20, 2024 6 Minutes Read

This article is part of VictoriaMetrics component series:

How vmstorage handles data ingestion

How vmstorage handles data ingestion
  • Flags we mention will start with a -, like -remoteWrite.url.
  • Numbers we refer to are the defaults. Some of these can be changed with flags, while others are set in stone. That said, the defaults work fine for most setups.
  • If you’re using a Helm chart, some defaults might differ because of the chart’s configuration tweaks.
  • If you have a topic in mind that you’d like us to cover, you can drop a DM to X (@func25) or connect with us on VictoriaMetrics’ Slack. We’re always looking for ideas and will focus on the most-requested ones. Thanks for sharing your suggestions!

1. Handshake

After vminsert dials to vmstorage (so called storage nodes), the connection has a special handshake process, it asks vmstorage what compression it wants to use, and then agree on a common compression algorithm before sending any data.

Disable compression (rpc.disableCompression) can reduce the CPU usage of vminsert, but it will increase the amount of network bandwidth.

There are some metrics in this process:

  • How many dial failures: vm_rpc_dial_errors_total.
  • How many handshakes failed: vm_rpc_handshake_errors_total.

These metrics will be appear many times not only during the startup but also in the middle of the operation, as it works as health check mechanism.

2. Parse and Relabel

When data arrives at vminsert, it will be uncompressed, read and parsed depends on the source, it could be Datadog, Graphite, Prometheus format, remote write (compressed with zstd or snappy), etc.

2.1 Parsing

But it only allows up to 2x CPU cores requests processed at a time (-maxConcurrentInserts), any requests that exceed this limit will be wait in queue for maximum 1 minute (-insert.maxQueueDuration) before being rejected.

// TODO(@phuong,image): queue

And there’re some useful metrics to help you to fine-tune the concurrency limit and vminsert resource usage:

  • How many times this concurrency limit is reached: vm_concurrent_insert_limit_reached_total.
  • How many requests are timeout (>1 minute): vm_concurrent_insert_limit_timeout_total.

2.2 Relabeling

After parsing the raw bytes to normal rows (metric name, labels, timestamp, value), these metric rows will be relabeled (-relabelConfig).

// TODO(@phuong,image): raw bytes – uncompressed –> bytes – parsed –> rows – relabeled –> rows

This relabel process may drop some timeseries, so the number of rows we read from request and the number we push to vmstorage could be different:

  • How many rows read from raw bytes: vm_protoparser_rows_read_total.
  • How many rows remaining after relabel: vm_rows_inserted_total.

2.3 Marshaling

These rows then again, being marshaled to the format that vmstorage can understand.

// TODO(@phuong,image): rows – marshal –> [metric raw name][timestamp][value]

The metric raw name is the bytes that represent the metric name, labels, and account id, project id (if you have enabled multi-tenancy).

While doing that, vminsert has some restriction on the labels:

ActionLimitHow to Check
Drop metrics with 30 or more labels-maxLabelsPerTimeseriesMonitor vm_metrics_with_dropped_labels_total metric and check warning logs.
Truncate label names exceeding 256 bytesFixed limitMonitor vm_metrics_with_truncated_label_names_total metric.
Truncate label values exceeding 4 KB-maxLabelValueLengthMonitor vm_too_long_label_values_total metric and check warning logs.

3. Buffering

Then it decides which storage node should receive this row using consistent hashing based on the metric name, labels (and account id, project id if you have enabled multi-tenancy). So if your storage node remain the same, every same timeseries will be assigned to the same storage node.

Then the data will be put into the storage node’s buffer, waiting to be sent over network.

// TODO(@phuong,image): buffer for each vmstorage node

The vminsert allocates 12.5% of its memory to buffers for all storage nodes, if you have 8 storage nodes, each node will get 12.5% / 8 = 1.5625% of the vminsert’s memory. But this limit cap at 30 MB for each buffer to limit how much data sent to vmstorage at once.

In this dicussion, let’s assume each buffer is 30 MB.

If the node is ready and there is enough space in the buffer, the data is successfully added, and no further action is needed. But what if the buffer is full? or the node is currently unavailable (crash, unreachable)?

To avoid data loss, we distribute the data to other healthy nodes using rerouting mechanism.

Rerouting

A storage node can be in one of the following states from vminsert point of view:

  • Ready: Ready to accept data and its buffer has enough space to hold the incoming data.
  • Overloaded: Too many incoming data, the “overloaded” status is judged by the vminsert when a node is receiving over 30 KB of data at a time, but not sent to vmstorage yet.
  • Broken: Unhealthy temporarily, could be overloaded over network ingestion (concurrency limit) on vmstorage side or any reason that return error from the node.
  • Readonly: It’s in readonly mode (low disk space) and won’t accept any new data, it just giving back for vminsert a readonly ACK.

Yet, we have 2 different types of rerouting for different states, overloaded rerouting for overloaded node and unavailable rerouting for broken and readonly nodes.

Overloaded rerouting

Overloaded rerouting is disabled by default (-disableRerouting=true) because it could push your entire storage nodes into the overloaded state together.

Instead, vminsert will block the request and just wait until the buffer has enough space to hold the incoming data, you can change it to drop sample instead (-dropSamplesOnOverload).

“Why not? Spread the load to other nodes should be a better idea.”

It’s indeed the default behavior of vminsert in older versions, but it could push other storage nodes into the storm of new timeseries, and new timeseries aren’t good if you have read about vmstorage, it has to register and consume quite a lot of memory.

This make other nodes unhealthy or even OOM (Out of Memory) and get killed, and again, put the pressure on the remaining nodes, and so on, a domino effect.

But if you confident that your vmstorage nodes are burstable and can handle the load, you can always enable the rerouting by setting -disableRerouting flag to false.

Unavailable rerouting

Unlike overloaded rerouting, unavailable rerouting is enabled by default (-disableReroutingOnUnavailable=false), and it will reroute the data to other healthy nodes.

If you enable that flag, it will block the request and wait until the node is ready again.

When one vmstorage node fails, the vminsert redirects its data to all the remaining nodes. Each healthy node gets part of the data that was supposed to go to the failed node. If there are n nodes available in total, each healthy node gets 1/n of the failed node’s data.

// TODO(@phuong,ask): disableRerouting is not entirely correct as this article states, even Aliaksandr Valialkin doesn’t know why disableRerouting flag is influencing the “Unavailable rerouting”, I’m not sure could we remove that if *disableRerouting {.

We have some metrics in rerouting:

  • How many rows pushed to the buffer successfully: vm_rpc_rows_pushed_total.
  • How many rows rerouted from storage node X: vm_rpc_rows_rerouted_from_here_total.
  • How many rows rerouted to storage node Y: vm_rpc_rows_rerouted_to_here_total.
  • How many rows dropped because of overload: vm_rpc_rows_dropped_on_overload_total.

Sending data to vmstorage

For each storage node, vminsert also has a worker to watch its buffer and do a periodic check every 200ms, it takes all the data in the buffer out and send to storage nodes with replication (-replicationFactor).

If some storage nodes are not working, the process will skip them and try other nodes. If it cannot send to enough nodes for full replication, it will:

  • Keep trying if it hasn’t sent the data anywhere yet.
  • Accept partial replication if at least one copy was made.

Stay Connected

The author’s writing style emphasizes clarity and simplicity. Instead of using complex, textbook-style definitions, we explain concepts in a way that’s easy to understand, even if it’s not always perfectly aligned with academic precision.

If you spot anything that’s outdated or if you have questions, don’t hesitate to reach out. You can drop me a DM on X(@func25).

Who We Are

If you want to monitor your services, track metrics, and see how everything performs, you might want to check out VictoriaMetrics. It’s a fast, open-source, and cost-saving way to keep an eye on your infrastructure.

Leave a comment below or Contact Us if you have any questions!
comments powered by Disqus

Watch Your Monitoring SkyRocket With VictoriaMetrics!