VictoriaLogs is GA - Try it out now!

How vmstorage Turns Raw Metrics into Organized History

by Phuong Le on Dec 6, 2024 20 Minutes Read

This article is part of VictoriaMetrics component series:

How vmstorage Turns Raw Metrics into Organized History

How vmstorage Turns Raw Metrics into Organized History

vmstorage is the component in VictoriaMetrics that handles long-term storage of monitoring data. It receives data from vminsert, organizes the data into efficient storage structures, and manages how long data is kept.

Before vminsert even sees the data, agents are out there collecting it, these agents gather metrics from different sources, hold onto the data briefly, and then send it over to vminsert in batches.

When vminsert receives the data, it compresses it into packets to improve transmission efficiency.

Data ingestion pipeline from agents to vmstorage

Data ingestion pipeline from agents to vmstorage

After compression, vminsert sends these packets to vmstorage. vmstorage stores the data on disk in an organized and optimized way. This structure makes it really fast to retrieve and query the data later on.

When vmstorage starts running, it opens a TCP listener on port 8400 by default (-vminsertAddr). This listener is specifically designed to accept connections from vminsert. Instead of sending separate HTTP requests and waiting for replies, vmstorage and vminsert use a continuous TCP connection. This method is much more efficient and allows them to exchange data directly.

Now, here’s what we’re really diving into today: how vmstorage handles data ingestion coming from vminsert.

The process might be a bit different at first when other agents are involved, but the general idea, is pretty much the same. Before we go any deeper, here are a few quick things to keep in mind:

  • Flags we mention will start with a -, like -remoteWrite.url.
  • Numbers we refer to are the defaults. Some of these can be changed with flags, while others are set in stone. That said, the defaults work fine for most setups.
  • If you’re using a Helm chart, some defaults might differ because of the chart’s configuration tweaks.
  • If you have a topic in mind that you’d like us to cover, you can drop a DM to X (@func25) or connect with us on VictoriaMetrics’ Slack. We’re always looking for ideas and will focus on the most-requested ones. Thanks for sharing your suggestions!

Reading And Parsing Data #

When vmstorage gets data, it doesn’t jump straight into reading it.

First, it checks with the concurrent read limiter. This limiter allows up to 2x the number of CPU cores (-maxConcurrentInserts) to read data at the same time. For example, if your setup has 4 cores, you’ll get up to 8 readers working simultaneously. If more readers try to get in, they’ll end up waiting in line.

If things get too busy and the limiter gets overwhelmed, vmstorage won’t let readers wait forever. Any reader stuck in the queue for over a minute (-insert.maxQueueDuration) gets rejected to keep the system efficient under heavy load.

Now, vmstorage reads one block at a time from the stream, what does a block look like?

vmstorage reads one block at a time

vmstorage reads one block at a time

Each block starts with a simple header—8 bytes that tell us the size of the block.

This size isn’t supposed to go over 100 MB. Interestingly, if you remember from the vmagent article, there’s another limit of 32 MB for each block at the agent level. After reading the header, the body of the block is read based on the size it specifies. If the block checks out and arrives successfully, vmstorage sends back an acknowledgment (ack).

There’s an edge case worth noting: when vmstorage is running low on disk space, it switches to read-only mode.

In this mode, it sends back ‘read-only ack’ for received data but ignores the actual content. vminsert recognizes this type of acknowledgment and resends the data. We’ll discuss about this read-only mode in more detail later.

At this stage, vmstorage only reads the raw block as a stream of bytes. It doesn’t parse anything yet and those raw bytes will eventually be broken down into rows for processing.

Raw data block parsed into structured rows

Raw data block parsed into structured rows

If the block is too big, vmstorage handles it in chunks, processing only 10,000 rows at a time and inserting them into storage.

Here are some metrics you might find useful, they’re listed in the order of how the data is processed:

  • The number of times reading from the stream failed: vm_protoparser_read_errors_total.
  • The number of blocks (or 10,000-row chunks) that failed to parse: vm_protoparser_parse_errors_total.
  • The number of times sending an acknowledgment back to the client failed: vm_protoparser_write_errors_total.
  • The total number of blocks successfully read: vm_protoparser_blocks_read_total.
  • The total number of rows successfully read from the blocks: vm_protoparser_rows_read_total.

The current number of active connections from vminsert nodes to vmstorage can be tracked with vm_vminsert_conns. This metric helps in tuning vmstorage flags or resource allocations to avoid bottlenecks.

If the data is valid and the storage is writable, we’re ready to go.

Finding TSID For Each Metric #

Every piece of data, or metric row, comes with a few key parts:

  • A metric name like http_requests_total, which gives you a clear idea of what the data is tracking.
  • A set of labels (optional) like {job="my_app",instance="host1",path="/foo/bar"}. These labels add extra context, helping to locate where the data is from or what it’s monitoring.
  • A timestamp in milliseconds, such as 1731892875512, which tells when the data was captured. Metrics with timestamps more than 2 days into the future or past the retention policy get dropped.
  • A floating-point value, like 25.1, representing the actual data value.

Example of structured time series data

Example of structured time series data

To make sense of this, vmstorage creates something called a “canonical metric name.” This happens by combining the metric name (e.g., http_requests_total) with its labels which are sorted alphabetically by their names. Because labels can show up in any order when the data is sent, but to store and query things properly, there needs to be a consistent format.

This sorting solves a big potential headache. Without it, two metrics with the same labels but in a different order, like metric{instance="host",job="app"} and metric{job="app",instance="host"}, would be treated as two separate time series. By sorting the labels, both get saved under the same time series, which avoids unnecessary duplication.

From metric name to TSID in VictoriaMetrics

From metric name to TSID in VictoriaMetrics

Once VictoriaMetrics has this sorted, canonical metric name, it either finds or creates a unique identifier for that time series, called the TSID (Time Series ID). The TSID is basically a unique number (with other metadata) that represents the time series. It’s what allows the system to quickly locate the data later on, without having to scan through every row manually.

Back to how the flow works. At this stage, we’ve just got the raw metric name—no sorted labels, just a bunch of bytes, right?

So, what happens next? vmstorage turns to its in-memory cache (or TSID cache) and says, “Hey, I’ve got this raw metric name. Can you tell me the TSID for it?”. This cache works like a quick-access lookup table, mapping raw metric names to their TSIDs:

In-memory cache maps raw names to TSIDs

In-memory cache maps raw names to TSIDs

If the metric is already in the cache, things move along fast since there’s no need to dig any deeper.

But let’s say we’re out of luck, the metric isn’t in the cache, VictoriaMetrics treats it as a slower insert and bumps up a handy metric: vm_slow_row_inserts_total. This metric is worth watching as lower values mean the system is running more efficiently.

We’re not done yet. To figure out the TSID, VictoriaMetrics builds the canonical metric name by sorting the labels. It’s not just alphabetical, there’s an internal logic to how the sorting is done.

If the TSID isn’t already in the cache, VictoriaMetrics goes to IndexDB to look it up. This process is significantly slower because it involves random disk seeks—after all, IndexDB is disk-based, which makes it much more permanent. You’ll find IndexDB stored in the {-storageDataPath}/index folder, where it maintains a mapping between metric names and TSIDs.

Once the lookup succeeds, the result is cached to save time and avoid hitting the disk again for the same time series in the future.

Fallback flow: cache to IndexDB to new TSID

Fallback flow: cache to IndexDB to new TSID

If the system can’t find it in the indexdb either, then it’s safe to say the time series is brand new.

At this stage, the system generates a new TSID for the time series and registers it in both the in-memory cache and IndexDB. In IndexDB, this step speeds up future lookups by creating several important mappings:

  • It maps the canonical metric name to the new TSID.
  • It also sets up a reverse mapping so the TSID can point back to the canonical metric name.
  • Each label in the canonical metric name gets its own entry in the “inverted index.” This helps the system quickly search for time series based on label filters.
  • A per-day index (used when the search time range is 40 days or less) is created to optimize time range queries, especially for data spanning just a few days.

You get the idea, right? For more details, check out the IndexDB brief.

Now, registering a new time series involves writing all this information to disk. This process can slow down the system, particularly when the metric has lots of labels or very lengthy ones.

That’s why keeping an eye on time series churn — how often new time series are created and old ones are dropped — is so important. High churn rate can drag down your performance significantly because the system is constantly generating new TSIDs and registering them.

The good news is that vmstorage lets you control how many new time series can be created in an hour (-storage.maxHourlySeries) and per day (-storage.maxDailySeries). Any new time series that exceeds these limits will be rejected.

Now, when it comes to monitoring this process, there are tons of metrics available, but from a user perspective, these are the key ones to watch:

  • The number of slow inserts: vm_slow_row_inserts_total.
  • The total count of new time series created: vm_new_timeseries_created_total.
  • The number of rows ignored due to reasons like timestamps or series cardinality: vm_rows_ignored_total{reason=*}.

Inserting Data to In-Memory Buffer #

Once the TSID is sorted out and registered, VictoriaMetrics takes the actual data sample, which includes the TSID, a timestamp, and a value, and places it into an in-memory buffer.

This buffer is called “raw-row shards,” and the number of shards in each partition (which represents one month of data) matches the number of CPU cores you have. So, for example, if your machine has 4 cores, you’ll get 4 shards for every month of data. Each shard can hold up to 8 MB of data, which comes out to about 149,796 rows.

If a shard fills up, the rows are pushed into what’s known as “pending series.” From there, they wait to be processed into an “LSM part” and eventually written to disk.

Data enters shards, pending rows, then in-memory parts

Data enters shards, pending rows, then in-memory parts

The rows sitting in this sharded buffer aren’t searchable yet.

So, if you try to query them using Grafana or another tool, they won’t show up until they’re flushed into an LSM part. Once that happens, the data becomes available for querying. And just like that, the data ingestion process wraps up for this block. The system is ready to take on the next batch of metric rows and start the process all over again.

The process of flushing the buffer into an LSM part occurs in the background, but if the data in the shard becomes excessive, it may block data ingestion requests to do the flushing.

How Data Gets Written to Disk #

So, the sharded buffer flushes in two situations, and these are fixed behaviors (no configuration needed):

  • When the buffer size hits a threshold, probably around 120 MB (15 times the size of a shard), the pending series gets flushed.
  • If more than 2 seconds have passed since the last flush, the system automatically flushes both the pending series and raw-row shards, this happens periodically.

During the flush, the data is converted into an LSM (Log-Structured Merge Tree) part, where entries are sorted by TSID and timestamp.

At this point, a couple of metrics are worth keeping an eye on:

  • The number of rows currently waiting to be flushed to an LSM part: vm_pending_rows{type="storage"}, include rows from both pending series and raw-row shards.
  • The number of times errors occurred while processing this stage: vm_protoparser_process_errors_total.

But that’s not the full picture. Behind the scenes, there’s a lot more happening, with various hardworking components ensuring everything runs smoothly.

Types of LSM Parts #

Each partition, which covers one month of data, organizes its data into three types of LSM parts:

  • In-memory part: This is where data from raw-row shards lands after the first flush. At this point, your metrics become searchable and can be queried.
  • Small part: Slightly larger than in-memory parts, these are stored on persistent disk ({-storageDataPath}/data/small folder).
  • Big part: The largest of the parts, also stored on disk (usually in the {-storageDataPath}/data/big folder).

vmstorage can handle up to 60 in-memory parts at a time. Together, they use about 10% of the system’s memory. For instance, if your vmstorage memory is 10 GB, the in-memory parts will take roughly 1 GB. Each part ranges between 1 MB (minimum) and 17 MB (maximum), with these limits hard-coded into the system.

How vmstorage organizes data within a partition

How vmstorage organizes data within a partition

As more data is ingested, more and more parts are created. When there are too many LSM parts, whether in-memory or on disk, every query needs to scan and merge data from all of them to return results. If left unchecked, this could slow the system down over time.

To prevent that, vmstorage relies on two key processes: flushing and merging.

  • Flushing: Moves all in-memory parts to small parts on disk. This is actually the second flush in the process (the first one moves data from raw-row shards to in-memory parts).
  • Merging: Combines parts to create more efficient storage, as we just mentioned. This doesn’t mean all small parts become big parts; some small parts might just merge into slightly larger small parts.

Every 5 seconds (controlled by -inmemoryDataFlushInterval), vmstorage flushes its in-memory parts to disk-based (or file-based) parts.

During this process, it merges these parts, flushes them, and ensures that recently ingested data isn’t lost, even if vmstorage crashes — whether due to an OOM error, a SIGKILL, or something else. That said, there’s always a small window of time when data could be lost before the flush occurs.

Now that we’ve got flushing covered, let’s talk about how the merging process works.

Merge Process #

Unlike flushing, merging doesn’t run on a fixed schedule. Instead, it works on a “cause and effect” basis.

For example, when in-memory parts get flushed to small parts on disk, the number of small parts grows. When this happens, vmstorage triggers a merge process for the small parts in that partition, looking for a chance to combine them into larger parts. Basically, when any type of part starts piling up, the system steps in and merges just those parts.

Most of the time, vmstorage merges up to 15 parts in a single go. These could be in-memory parts merging into a bigger in-memory part, small parts, or even big parts.

“So small parts are bigger than in-memory parts, and big parts are bigger than small parts? Is that always the case?”

It’s tempting to think of it like that because of the names, but it’s not entirely true.

Small parts might be larger than in-memory parts, or they might not. Similarly, small parts could be smaller than big parts, or not. It all depends on the timing of the merge and the available resources like memory and disk space at that moment. There’s no fixed size that decides whether a part is small or big. Instead, the system evaluates the situation during the merge to figure out what kind of part to create.

For example, if the merged part ends up larger than what’s typically considered a small part, it becomes a big part. But some rules do provide rough guidelines:

  • Small parts max out at 10 MB (assuming disk space isn’t an issue).
  • Big parts can go up to around remaining disk space / 4 but won’t exceed 1 TB.

You can keep an eye on the merge process using these metrics:

  • Total rows in parts: vm_parts_rows_total{type="storage/*"} (e.g. "storage/inmemory", "storage/small", "storage/big").
  • Total number of blocks in parts: vm_blocks{type="storage/*"}.
  • Total size of parts in memory: vm_data_size_bytes{type="storage/*"}.
  • Completed merges: vm_merges_completed_total{type="storage/*"}.
  • Current active merges (the ones happening right now): vm_active_merges{type="storage/*"}.
  • Total rows merged so far: vm_rows_merged_total{type="storage/*"}.

As small parts merge into bigger parts, the resulting blocks in those bigger parts eventually need to be written to disk. This is also where deduplication happens.

Deduplication is the process of identifying and removing data points that are almost identical but recorded at slightly different times.

This often occurs when two (or more) systems monitor the same metric for redundancy or reliability purposes and send their data to a shared storage. While the values and labels in the data are identical, their timestamps might vary by a few milliseconds or seconds. Deduplication resolves this by retaining only one version of each duplicate data point, optimizing storage and eliminating redundancy:

Deduplication filters data within time chunks

Deduplication filters data within time chunks

This visual isn’t an exact match for how VictoriaMetrics handles deduplication, but it gives you a good sense of the idea.

By default, deduplication is turned off. To enable it, you need to configure the deduplication window using -dedup.minScrapeInterval. When configured correctly, deduplication can significantly reduce disk usage while improving query speed and efficiency.

Interestingly, deduplication can sometimes function like downsampling, as both processes reduce the amount of data stored over a given time period.

“What happens if I change the deduplication window?”

Each part has a header that tracks the deduplication window at the time it was merged. VictoriaMetrics uses a dedicated worker called the “Deduplication Watcher” to keep an eye on all partitions and parts. Every hour (with a bit of randomness added), it checks whether the deduplication window you’ve set is larger than the one recorded in the part header. If it is, the system applies the updated deduplication settings for those parts.

Behind the scenes, it triggers a merge for all the parts in the target partition. This merge process applies the new deduplication settings.

So, any new data coming in will immediately follow the new deduplication window. However, data in the current partition (for the current month) won’t be retroactively deduplicated by this change.

Along with the deduplication watcher, there are other key workers quietly operating behind the scenes, let’s break down what they do.

Retention, Free Disk Space Guard, and Downsampling #

We don’t want to keep all the data forever, right? That’s a pretty common need, and VictoriaMetrics makes it simple with a retention policy.

This lets you decide how long to keep data on disk using the -retentionPeriod setting. By default, it’s set to 1 month (you can go as low as 1 day or as high as 100 years, though). Any samples sent from vminsert that fall outside the retention period are dropped right away. There’s also a retention watcher running every minute (with a little randomness added) to clear out old parts and partitions.

Old partitions phased out after retention window

Old partitions phased out after retention window

That said, keep in mind that each part contains many samples, and if even one sample in a part falls within the retention period, that whole part will stick around. So, some old data might hang around longer than expected until the entire part is outside the retention window.

Be cautious when changing the retention policy at runtime, as there is a known issue #7609: Changing -retentionPeriod may cause earlier deletion of previous indexDB.

Retention Filters and Downsampling (Enterprise Plan) #

If you’re on the Enterprise plan, you get more flexibility with retention filters. These let you define retention periods for specific types of data based on criteria like labels.

-retentionFilter='{team="juniors"}:3d' -retentionFilter='{env=~"dev|staging"}:30d' -retentionPeriod=1y

For example:

  • Data labeled team="juniors" could have a 3-day retention.
  • Data labeled env=~"dev|staging" could have a 30-day retention.
  • Everything else could have a 1-year retention.

This gives you a lot of control over how long specific slices of your data stick around.

And then there’s downsampling, which is a lifesaver for managing high volumes of older data. Older data doesn’t get queried as often as recent data, so storing every single sample forever isn’t practical. Downsampling reduces the number of samples stored by keeping just one sample per time interval for older data.

-downsampling.period=30d:5m

In this example:

  • For data older than 30 days, the system keeps only the last sample for every 5-minute interval, dropping the rest.
  • For data older than 180 days, the system keeps only the last sample for every hour.

You can combine these rules for multi-level downsampling, applying different levels of granularity as data ages. On top of that, you can even set up downsampling for specific time series using filters, just like retention filters:

-downsampling.period='{__name__=~"(node|process)_.*"}:30d:1m'

This snippet tells VictoriaMetrics to downsample data points older than 30 days to one-minute intervals, but only for time series with names that start with the node_ or process_ prefixes.

Free Disk Space Watcher: Read Only Mode #

Earlier, we touched on how vmstorage can enter a read-only mode when disk space runs low.

In this mode, data sent from vminsert still gets acknowledged, but it’s quietly ignored. This safeguard is managed by a worker called the “free disk space watcher.” Its job is to keep an eye on available disk space and automatically switch vmstorage to read-only mode if things get too tight.

“What counts as low disk space?”

By default, the threshold is set to 10 MB (-storage.minFreeDiskSpaceBytes). The watcher checks the disk space every couple of seconds (around 2 seconds) at the storage path.

If the available disk space drops below this threshold, vmstorage switches to read-only mode. In this state, it continues serving read queries - like searching for metrics or selecting data, but it stops accepting new data writes. Any data sent from vminsert will still receive an acknowledgment but won’t be stored.

Once enough disk space is freed up (above the threshold), the watcher automatically switches vmstorage back to read-write mode. When this happens, it signals all the relevant components to resume normal operations, and new data writes start flowing in again.

Bonus: How a Partition is Structured #

This section goes a bit deeper into the technical details, so think of it as a bonus for anyone curious about how the system works under the hood.

Within a partition, you’ll find in-memory parts, small parts, and big parts. Regardless of the part type, the data is stored in a column-based format. This means that the TSID, timestamp, and value aren’t grouped together in a single record. Instead, they’re separated into distinct columns, with each column stored in its own file.

For in-memory parts, this columnar structure is already in place and ready to be flushed into file-based parts.

LSM parts organized into columnar data files

Parts organized into columnar data files

Here’s how it’s laid out:

  • All TSIDs are stored in a file called index.bin.
  • All timestamps are in timestamps.bin.
  • All values go into values.bin.

This columnar solution enables better compression and faster lookups since each type of data can be handled in a way that suits its characteristics. For example, compression algorithms are chosen based on what works best for the specific type of data in each column, ensuring space is used wisely.

Each block of data within timestamps.bin and values.bin represents rows for a single TSID, with a block holding up to 8,192 rows. A cool optimization here is that multiple consecutive blocks can share the same timestamps.bin block if the data allows it, further saving space.

The index.bin file organizes things a bit differently. Each row in index.bin includes multiple block headers and can be as large as 64 KB.

“Wait, what’s a block header?”

It’s metadata that tells you about the block. For example, it includes:

  • The TSID for the block.
  • The number of rows in the block.
  • Where the block is located in timestamps.bin and values.bin.

With this metadata, the system knows exactly which TSID and timestamps correspond to the values in any given block.

Block headers link index to data blocks

Block headers link index to data blocks

Here’s a real example of how a big part looks in a directory, with file sizes included for context:

total 641M   
-rw-r--r--    1 root     root       12.4M Oct 21 15:11 index.bin
-rw-r--r--    1 root     root       29.7K Oct 21 15:11 metaindex.bin
-rw-r--r--    1 root     root       24.7M Oct 21 15:11 timestamps.bin
-rw-r--r--    1 root     root      604.0M Oct 21 15:11 values.bin

As you can see, the sizes vary quite a bit. Each file serves a specific purpose, and they use different compression strategies to optimize storage. The values.bin file is always the largest because it holds the raw metric values. Next in size is usually timestamps.bin, followed by index.bin.

“What’s the deal with the metaindex.bin file?”

The index.bin file holds metadata for each block, but metaindex.bin adds another layer. It provides metadata about the index.bin file itself. Each row in metaindex.bin corresponds to a row in index.bin and contains:

  • The number of block headers in that row.
  • The offset and size of the row.
  • The TSID of the first block in that row.

It’s a map of the map, essentially, helping the system locate and manage data even more efficiently.

And that’s the complete picture of how vmstorage handles data ingestion. In the meantime, feel free to reach out if you’d like to suggest or request specific topics for us to cover next!

Stay Connected #

The author’s writing style emphasizes clarity and simplicity. Instead of using complex, textbook-style definitions, we explain concepts in a way that’s easy to understand, even if it’s not always perfectly aligned with academic precision.

If you spot anything that’s outdated or if you have questions, don’t hesitate to reach out. You can drop me a DM on X(@func25).

Who We Are #

If you want to monitor your services, track metrics, and see how everything performs, you might want to check out VictoriaMetrics. It’s a fast, open-source, and cost-saving way to keep an eye on your infrastructure.

Leave a comment below or Contact Us if you have any questions!
comments powered by Disqus

Watch Your Monitoring SkyRocket With VictoriaMetrics!