Performance Challenges and Optimization Practices for Handling Tens of Thousands of JSON Fields in AI Agent Scenarios

Written by Admin

JSON Fields in AI Agent Scenarios

As business operations continue to evolve (vehicle models added, data tracking changed, models upgraded), the system constantly introduces new fields, leading to a rapid expansion of the field set. When the field union reaches tens of thousands, single rows of data are highly sparse, and query requirements change frequently; traditional predefined schemas can no longer meet the needs. Typical characteristics include field sizes ranging from hundreds to tens of thousands, rapid evolution, dispersed distribution, high write throughput, and queries often focusing on only a few fields.

Typical application scenarios are as follows:

Connected vehicles: Vehicle models, hardware, and OTA updates are combined, and the fields change frequently with devices and versions, reaching a scale of tens of thousands, with significant differences between vehicle models.

Observability: As microservices and SDKs continue to iterate, log and trace dimensions are constantly expanding to hundreds to thousands of fields.
Behavioral analysis: Business expansion leads to a wider range of attributes, with the field size reaching approximately 1k to 5k in the mature stage.

AI Applications: Prompt and model rapid iteration, agent trace, tool calls, RAG retrieval results, and evaluation data continuously introduce new keys, and field structure evolves with changes in models and workflows. To address these issues, Apache Doris 4.1 provides targeted solutions, effectively improving the system’s performance, scalability, and stability in these scenarios through optimization features such as Doc Mode and Segment V3.

Core performance bottleneck

As throughput continues to climb and the JSON path expands to tens of thousands, the system faces two main bottlenecks:

Metadata bloat: As the size of columns/fields increases, the volume of metadata (such as footers and metadata) grows linearly with the number of columns, leading to a surge in system storage pressure. This is not an isolated phenomenon but a common challenge faced by columnar storage.
Premature sub-column materialization: To improve query performance, some systems (such as the early Doris variant default implementation and Elasticsearch) immediately materialize each JSON path as a separate column during writes. This practice puts double pressure on writes and compaction.

In AI scenarios, because prompt, tool, and trace data evolve much faster, field size and hot/cold data changes are often more pronounced than in traditional event tracking systems. Moreover, these two issues often overlap in real-world scenarios—write bottlenecks lead to increased file fragmentation, which in turn exacerbates metadata bloat, creating a vicious cycle.

Common Solutions

When dealing with the challenge of JSON with widths of tens of thousands or more, different systems have proposed technical paths with different focuses, but in essence, they are all making trade-offs between flexibility, write cost, and query performance.

We selected two representative solutions, ClickHouse’s Advanced Serialization and PostgreSQL’s JSONB, for further analysis.

2.3 ClickHouse: Controls the number of columns but limits coding efficiency

ClickHouse v25.8 introduced the Advanced Shared Data serialization format (see Making complex JSON 58x faster) to alleviate the problem of drastic performance degradation when there are too many JSON paths. This solution controls the total number of files by using a fixed number of buckets, avoiding the problem of the file becoming completely unusable when there are many columns.

This solution significantly improves usability, but it also introduces new costs: due to ClickHouse’s PAX (Partition Attributes Across) storage layout mechanism, data is organized by column or attribute, requiring repeated location and jumps between multiple storage locations during queries, leading to amplified random reads. Simultaneously, to accommodate write and merge processes, the system needs to retain an additional copy of the original data, further increasing storage costs. Overall, while the problem is somewhat alleviated, query performance and resource overhead remain unsatisfactory in large-scale scenarios.

2.4 PostgreSQL: User-friendly documentation but limited analytical scenarios

PostgreSQL’s JSONB type parses JSON into binary format for storage and supports key-value retrieval using GIN indexes. JSONB performs exceptionally well in point lookup and document readback SELECT*scenarios—the original document doesn’t need to be reassembled from numerous sub-columns.

However, it is still essentially a row-based storage, lacking column-based optimization. In analytical scenarios (such as filtering and aggregation by field), even when querying a single key, it is necessary to parse the JSON row by row, making it difficult to take advantage of the compression and computational benefits of columnar storage. As the data scale increases, the query latency will increase significantly.

Doris 4.1: Deferred Materialization + On-Demand Loading

To address the aforementioned issues, Apache Doris introduced two key capabilities in its latest release, 4.1: Doc Mode and Segment V3, which optimize the write path and metadata management aspects, respectively.

Doc Mode (Deferred Sub-item Materialization): Efficiently writes JSON to disk in document form during the write phase and only materializes high-frequency JSON paths on demand during the compaction phase, thereby significantly reducing write amplification and system pressure and improving overall throughput.

Segment V3 (on-demand metadata loading): It splits the column-level metadata, which was originally concentrated in the footer, into an independent storage structure. During the query, only the metadata of the relevant columns is loaded, which effectively reduces memory usage and I/O overhead at the scale of tens of thousands of columns.

The combination of these two factors enables Doris to achieve a more balanced performance across the three key dimensions of write throughput, query latency, and metadata control.

So, compared to industry-leading solutions like ClickHouse and PostgreSQL, is Doris more advantageous?

From a mechanistic perspective:

Compared to ClickHouse, Doris uses pure columnar contiguous storage after materializing sub-columns, avoiding random read amplification issues caused by layouts like PAX; at the same time, under the default strategy, there is no need to retain redundant copies, making storage overhead more controllable.

In contrast to PostgreSQL (JSONB), once the JSON path is materialized, Doris can fully leverage the advantages of columnar storage in compression and vectorized computation, while JSONB is always limited by the row-based storage model, and its I/O mode has a natural bottleneck in analytical scenarios.

In benchmark tests (tens of thousands of JSON paths, 100 million rows of data, single machine with 16 CPUs / 64GB of RAM / SSD, see “Comprehensive Performance Verification” below):

Doris vs. ClickHouse: In wide JSON aggregation and filtering scenarios, Doris’ query latency remains stable in the hundreds of milliseconds, while ClickHouse, affected by random reads due to advanced encoding, experiences query latency rising to several seconds. Furthermore, under default configurations, Doris’ storage space is approximately 60% of ClickHouse’s.

Doris vs. PostgreSQL: PostgreSQL performs exceptionally well in scenarios involving reading entire lines of a document (such as Q3); however, in aggregation and filtering scenarios (Q1 / Q2), its query latency is orders of magnitude higher than Doris, reaching hundreds of times.

Doc Mode: On-demand delayed child materialization

In Doris 4.1, Doc Mode will write, merge, The query process is decoupled from the input process and handled in three stages, thereby gradually optimizing query performance while maintaining write throughput.

During the write phase, JSON is stored only in document form.
Compaction phase: Materialize high-frequency paths into child columns as needed at appropriate times.
Query phase: Automatically select the optimal execution path based on the materialization status of the fields.
4.1 Write Phase: Prioritize encoding and writing JSON keys to disk as hash sharded.

To prioritize write throughput, the system encodes JSON data as a sharded map and performs structured disk writes using hash sharding. This design supports both a SELECT *efficient return of entire rows and provides a unified fallback storage for unmaterialized fields.

The original JSON is split into multiple independent columnar map shards, and variant_doc_hash_shard_count, the number of shards, is controlled by parameters to ensure that the data is evenly distributed.

In fallback scenarios, only the corresponding single shard needs to be hit and scanned, avoiding the performance overhead of scanning the entire dataset. As a result, the write complexity changes from “growing with the number of unknown paths” to “being oriented towards a fixed shard structure,” significantly improving system stability and scalability.

4.2 Merging Phase: Delaying Materialization

The parameters variant_doc_materialization_min_rows define when a path is materialized: when data batches are small or have not yet settled, it is only written to the Sharded Map; only after compaction is triggered and the number of rows reaches a threshold will high-frequency paths be extracted as independent sub-columns. Compared to immediate columnization, this delayed decision-making mechanism significantly improves system stability in scenarios with bursty writes.

Figure 3: First, write to the doc map, and then extract commonly used fields into columns during the compaction phase.

4.3 Query Phase: Automatic Switching Between Three Read Paths

For the upper-level query engine, Doc Mode automatically routes requests to the most suitable reading path based on the materialization state of the fields, achieving a dynamic balance between performance and flexibility.

DOC Materialized Hot fields have been extracted as sub-columns, and the pure columnar reading path is used directly, resulting in the highest query efficiency. At the same time, the optimization capabilities such as index pushdown can be fully utilized.

DOC Map is used for SELECT * scenarios involving reading entire documents; it directly returns the original doc without the need for concatenating sub-columns, resulting in extremely low overall overhead.

DOC Map (Sharded) For queries on cold fields that are not materialized and cannot be returned as a whole document, the request will be redirected to the corresponding hash shard, and only the relevant shard will be scanned, greatly reducing the scanning of invalid data.

Figure 4 Explanation:

Materialized fields follow the columnar path (~76 ms), making full use of the advantages of columnar storage;

Unmaterialized but fragmented fields undergo a single shard scan (~148 ms);

In other scenarios, the process reverts to a full Doc Map scan (~2,533 ms).

As time goes on, the background compaction process continues, and frequently accessed fields will gradually be materialized and enter the columnar path, enabling the system to continuously optimize query performance without sacrificing write efficiency and achieve a dynamic balance between read and write operations.

4.4 JSON key evolution limitations to improve system stability

ExistAgentIn scenarios involving traces or upstream data lacking standardized constraints, JSON keys often evolve continuously (e.g., from the score_one type to another tool_result), lacking stable boundaries. In such cases, appropriately increasing the variant_doc_materialization_min_rows threshold can keep many short-lived or low-frequency fields in the Doc Map form, thereby achieving more robust system behavior.

Stable write path: Avoids frequent large-scale field materialization and dictionary encoding in the background, reducing write amplification and compaction pressure.

High-efficiency whole document reading: SELECT * can directly return the original record, without the need for sub-column assembly, resulting in lower overhead.

Controllable access to cold fields: Low-frequency paths can still be accessed via sharded DocMap, although slightly slower than columnar approaches, effectively avoiding the systemic burden caused by an infinitely expanding number of columns.

Overall, this strategy essentially imposes boundary constraints between performance and controllability, allowing fields to continue evolving but avoiding uncontrolled impacts on storage structure and system resources.

4.5 Applicable Scenarios

In the following three typical business scenarios, Doc Mode can often significantly improve overall performance:

Highly sparse log/event data: The field union is huge, but each record only matches a small number of keys, and the structure is highly discrete.

Write pressure and field expansion coexist: Due to high-throughput writes and rapid field growth, significant compaction backlog or write amplification issues have emerged.

Hybrid access mode (analysis + readback): This requires both field retrieval and analysis and periodic full document readback (SELECT *).

Segment V3: On-demand metadata format

Prior to Doris 4.1, the system used the Segment V2 storage format, storing the metadata of all columns centrally at the footer of the file. This design was efficient in large-scale sequential scan scenarios, but during random reads or small-range queries, the complete metadata needed to be loaded each time, resulting in additional I/O and parsing overhead, becoming a performance bottleneck.

To address this, Doris 4.1 introduces Segment V3, drawing on the practices of newer file storage formats such as Lance and Vortex. It separates metadata from the footer and loads it on demand, resolving the most common issues encountered in scenarios with tens of thousands of columns, such as metadata bloat, slow file opening, and random read overhead. The performance improvement is particularly significant during the initial read phase. It is suitable for ultra-wide tables, tables with a large number of VARIANT sub-columns, objects sensitive to cold starts, and scenarios with frequent random reads.

This is relevant to semi-structured data scenarios in the Internet of Vehicles (IoV). Examples include AI observability, prompt debugging, and online inference analytics, where only a small number of dynamic paths are accessed, but rapid random queries are required.

After enabling Segment V3, Doris Doc achieves more balanced performance for both hot and cold queries, avoiding significant performance stratification.

Take an ultra-wide table with 7,000 columns and 10,000 segments as an example. During the segment opening phase, V3 achieves significant improvements compared to V2:

Opening speed increased by up to 16 times.
Memory usage reduced by up to 60 times
In scenarios with ultra-wide tables and high concurrency access, this translates to faster response times and lower resource costs.

Comprehensive Performance Verification

To evaluate the actual performance of each solution in a wide JSON scenario, we designed a set of benchmark tests that closely resemble real-world business scenarios. The test environment and constraints are as follows:

Data characteristics:

The union of JSON keys is approximately 10K in size (the schema cannot be defined in advance).
Each line randomly writes 100 keys, with each value being a random number, simulating a high-sparse-wide table.
The total data volume is 100 million rows (approximately 160GB), which is split into 1000 files to simulate a high-frequency import scenario.
Hardware environment: Single machine with 16 cores and a 64GB SSD

Products and Configurations

Access mode: High-concurrency writes + random field queries (avoiding optimizations for specific fields)

6.1 Storage and Import Performance

Storage space: As shown in the figure above, Doris’s Variant Default is the best, thanks to its full columnar storage and the elimination of redundant data.

Import performance: PostgreSQL (JSONB) is the best, followed by Variant Doc, both of which are significantly better than ClickHouse, Elasticsearch, and Variant Default.

Query latency (cold search/hot search)

As can be seen from the above, different systems have different focuses in terms of individual capabilities: PostgreSQL performs well in document-oriented storage and read-back scenarios (Q3), but its ability to perform complex analysis is limited; ClickHouse has excellent columnar analysis capabilities when the field size is controllable, but it is easily affected by the number of paths, metadata, and the Shared Data mechanism in ultra-wide JSON scenarios.

In contrast, Doris, based on Deferred Materialization (Doc Mode) and On-Demand Metadata Loading (Segment V3), simultaneously balances write throughput, storage control, complex analysis, and full document reading in wide JSON scenarios, achieving a better balance between performance and resource overhead and demonstrating stronger comprehensive capabilities.

Overall Conclusion

Based on the combined storage, import, and query results, it can be seen that the value of Doris Variant Doc Mode + Segment V3 lies in the fact that it does not simply choose between document storage and columnar storage but rather forms a compromise solution that is more suitable for analytical business scenarios in a wide JSON scenario through a combination of document writing, lazy materialization, on-demand metadata loading, and columnar query execution. It is a better implementation path than traditional JSONB, search engine solutions, and pure columnar JSON solutions.

7. Quick start and verification

If you are struggling with the ever-growing size of JSON keys, you can quickly verify this using the following minimum viable configuration:

CREATE TABLE IF NOT EXISTS sensor_data (
ts DATETIME NOT NULL,
device_id VARCHAR(64) NOT NULL,
model VARCHAR(128),
data VARIANT<—Schema Template Set column properties as needed: ‘bat_temp’: DOUBLE, properties(‘variant_enable_doc_mode’ = ‘true’) >,
INDEX idx_data(data) USING INVERTED PROPERTIES(“field_pattern” = “status”)

)
DUPLICATE KEY( ts, device_id)
DISTRIBUTED BY HASH BUCKETS 16
PROPERTIES (
“replication_num” = “1”,
“storage_format” = “V3”
);
Get the complete project code with one click
SQL

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
storage_format = “V3” Note: This is expected to become the default option for wide table scenarios in subsequent versions.

Conclusion

As AI applications gradually enter the agent-based and real-time stage, the continuous growth of dynamic fields will become a long-term trend, which also places higher demands on the flexibility and scalability of semi-structured data systems. By introducing Doc Mode (deferred materialization) and Segment V3 (on-demand loading), Doris effectively alleviates the problems of metadata bloat and query performance degradation while maintaining high write throughput, especially demonstrating significant advantages in high-concurrency, high-sparsity, and highly evolving wide JSON scenarios. Whether in the fields of connected vehicles, observability, behavioral analysis, or AI applications, Doris can effectively address the challenges of data field bloat and query performance degradation, providing a stable and scalable solution.

For More Info

Admin

Techaiprompt is an educational platform focused on technology, artificial intelligence, and practical AI prompts. We create easy-to-understand guides, tutorials, and real-world examples to help beginners and learners build skills with confidence. Our goal is to simplify complex tech and AI concepts into useful, beginner-friendly resources.

Leave a Comment