Version: 0.14

Migrate from ClickHouse

This guide provides a detailed explanation on how to smoothly migrate your business from ClickHouse to GreptimeDB. It covers pre-migration preparation, data model adjustments, table structure reconstruction, dual-write assurance, and specific methods for data export and import, aiming to achieve a seamless system transition.

Pre-Migration Notes

Compatibility Although GreptimeDB is SQL protocol compatible, the two databases have fundamental differences in data modeling, index design, and compression mechanisms. Be sure to refer to the SQL compatibility documentation and official modeling guidelines to refactor table structures and data flows during migration.
Data Model Differences ClickHouse is a general-purpose big data analytics engine, while GreptimeDB is optimized for time-series, metrics, and log observability scenarios. There are differences in their data models, index systems, and compression algorithms, so it’s important to take these differences and the actual business scenario into account during model design and compatibility considerations.

Refactoring Data Model and Table Structure

Time Index

ClickHouse tables do not always have a time index field. During migration, you need to clearly select the primary time field for your business to serve as the time index and specify it when creating the table in GreptimeDB. For example, typical log or trace print times.
The time precision (such as second, millisecond, microsecond, etc.) should be assessed based on real-world requirements and cannot be changed once set.

Primary Key and Wide Table Recommendations

Primary Key: Just like the order by in ClickHouse without the timestamp column. It’s not recommended to include high-cardinality fields such as log IDs, user IDs or UUIDs to avoid primary key bloat, excessive write amplification, and inefficient queries.
Wide Table vs. Multiple Tables: For multiple metrics collected at the same observation point (such as on the same host), it’s better to use a wide table, which improves batch write efficiency and compression ratio.

Index Planning

Inverted Index: Build indexes for low-cardinality columns to improve filter efficiency.
Skipping Index: Use as needed; Used when certain values are sparse or querying specific values that occur infrequently within large datasets.
Fulltext Index: Use as needed; Designed for text search operations on string columns; avoid building unnecessary indexes on high-cardinality or highly variable fields.
Read Data Index for more info.

Partitioning Table

ClickHouse supports table partitioning via the PARTITION BY syntax. GreptimeDB provides a similar feature with a different syntax; see the table sharding documentation for details.

TTL

GreptimeDB supports TTL (Time-to-live) through the table option ttl. For more information, please refer to Manage data retention with TTL policies.

Example Table

Example ClickHouse table structure:

CREATE TABLE example (
 timestamp DateTime,
 host String,
 app String,
 metric String,
 value Float64
 ) ENGINE = MergeTree()
 TTL timestamp + INTERVAL 30 DAY
 ORDER BY (timestamp, host, app, metric);

Recommended table structure after migrating to GreptimeDB:

CREATE TABLE example (
 `timestamp` TIMESTAMP NOT NULL,
 host STRING,
 app STRING INVERTED INDEX,
 metric STRING INVERTED INDEX,
 `value` DOUBLE,
 PRIMARY KEY (host, app, metric),
 TIME INDEX (`timestamp`)
 ) with(ttl='30d');

The choice of primary key and the granularity of the time index should be carefully planned based on your business's data volume and query scenarios. If the host cardinality is high (e.g., hundreds of thousands of monitored hosts), you should remove it from the primary key and consider creating a skipping index for it.

Migrating Typical Log Tables

GreptimeDB already provides built-in modeling for otel log ingestion, so please refer to the official documentation.

Common ClickHouse log table structure:

CREATE TABLE logs
 (
 timestamp      DateTime,
 host           String,
 service        String,
 log_level      String,
 log_message    String,
 trace_id       String,
 span_id        String,
 INDEX inv_idx(log_message)  TYPE ngrambf_v1(4, 1024, 1, 0) GRANULARITY 1
 ) ENGINE = MergeTree
 ORDER BY (timestamp, host, service);

Recommended GreptimeDB table structure:

Time index: timestamp (precision set based on logging frequency)
Primary key: host, service (fields often used in queries/aggregations)
Field columns: log_message, trace_id, span_id (high-cardinality, unique identifiers, or raw content)

CREATE TABLE logs (
 `timestamp` TIMESTAMP NOT NULL,
 host STRING,
 service STRING,
 log_level STRING,
 log_message STRING FULLTEXT INDEX WITH (
        backend = 'bloom',
        analyzer = 'English',
        case_sensitive = 'false'
    ),
 trace_id STRING SKIPPING INDEX,
 span_id STRING SKIPPING INDEX,
 PRIMARY KEY (host, service),
 TIME INDEX (`timestamp`)
 );

Notes:

host and service serve as common query filters and are included in the primary key to optimize filtering. If there are very many hosts, you might not want to include host in the primary key but instead create a skip index.
log_message is treated as raw content with a full-text index created. If you want the full-text index to take effect during queries, you also need to adjust your SQL query syntax. Please refer to the log query documentation for details
Since trace_id and span_id are mostly high-cardinality fields, it is not recommended to use them in the primary key, but skip indexes have been added.

Migrating Typical Traces Tables

GreptimeDB also provides built-in modeling for otel trace ingestion, please read the official documentation.

Common ClickHouse trace table structure design:

CREATE TABLE traces (
 timestamp DateTime,
 trace_id String,
 span_id String,
 parent_span_id String,
 service String,
 operation String,
 duration UInt64,
 status String,
 tags Map(String, String)
 ) ENGINE = MergeTree()
 ORDER BY (timestamp, trace_id, span_id);

Recommended GreptimeDB table structure:

Time index: timestamp (e.g., collection/start time)
Primary key: service, operation (commonly filtered/aggregated properties)
Field columns: trace_id, span_id, parent_span_id, duration, tags (high-cardinality or Map type)

CREATE TABLE traces (
 `timestamp` TIMESTAMP NOT NULL,
 service STRING,
 operation STRING,
 `status` STRING,
 trace_id STRING SKIPPING INDEX,
 span_id STRING SKIPPING INDEX,
 parent_span_id STRING SKIPPING INDEX,
 duration DOUBLE,
 tags STRING,    -- If this is structured JSON, either store it as-is or use the pipeline to parse fields
 PRIMARY KEY (service, operation),
 TIME INDEX (`timestamp`)
 );

Notes:

service and operation serve as primary key, supporting trace scheduling and aggregate queries by service or operations.
trace_id, span_id, and parent_span_id use skip indexes but are not part of the primary key.
High-cardinality fields are set as fields for efficient writes. For complex properties like tags, JSON storage or string is recommended, and they can be expanded using GreptimeDB’s ETL - Pipeline if necessary.
Depending on overall business volume, consider whether to partition traces into multiple tables (such as in massive multi-service environments).

Dual-write Strategy for Safe Migration

During the migration process, to avoid data loss or inconsistent writes, adopt a dual-write approach:

The application should write to both ClickHouse and GreptimeDB simultaneously, running the two systems in parallel.
Validate and compare data using logs and checks to ensure data consistency. Once the data has been fully validated, you can switch fully over.

Exporting and Importing Historical Data

Enable dual-write before migration The application should write to both ClickHouse and GreptimeDB. Check for data consistency to reduce the risk of missing data.
Data export from ClickHouse Use ClickHouse’s native command to export data as CSV, TSV, Parquet, or other formats. For example:

clickhouse client --query="SELECT * FROM example INTO OUTFILE 'example.csv' FORMAT CSVWithNames"

The exported CSV will look like:

"timestamp","host","app","metric","value"
"2024-04-25 10:00:00","host01","nginx","cpu_usage",12.7
"2024-04-25 10:00:00","host02","redis","cpu_usage",8.4
"2024-04-25 10:00:00","host03","postgres","cpu_usage",15.3
"2024-04-25 10:01:00","host01","nginx","cpu_usage",12.5
"2024-04-25 10:01:00","host01","nginx","mem_usage",1034.5
"2024-04-25 10:01:00","host02","redis","mem_usage",876.2
"2024-04-25 10:01:00","host03","postgres","mem_usage",1145.2
"2024-04-25 10:02:00","host01","nginx","disk_io",120.3
"2024-04-25 10:02:00","host02","redis","disk_io",95.1
"2024-04-25 10:02:00","host03","postgres","disk_io",134.7
"2024-04-25 10:03:00","host02","redis","mem_usage",874
"2024-04-25 10:04:00","host03","postgres","cpu_usage",15.1

Data import into GreptimeDB

The table must be created in GreptimeDB before importing data.

GreptimeDB currently supports batch data import via SQL commands or REST API. For large datasets, import in batches. Use the COPY FROM command to import:

  COPY example FROM "/path/to/example.csv" WITH (FORMAT = 'CSV');

Alternatively, you can convert the CSV to standard INSERT statements for batch import.

Validation and Cutover

Once the import is complete, use GreptimeDB’s query interface to compare data with ClickHouse for consistency.
After data verification and monitoring meet requirements, you can officially switch business writes to GreptimeDB and disable dual-write mode.

Frequently Asked Questions and Optimization Tips

What if SQL/types are incompatible?

Before migration, audit all query SQL and rewrite or translate as necessary, referring to the official documentation (especially for log query) for any incompatible syntax or data types.

How do I efficiently import very large datasets in batches?

For large tables or full historical data, export and import by partition or shard as appropriate. Monitor write speed and import progress closely.

How should high-cardinality fields be handled?

Avoid using high-cardinality fields as primary key. Store them as fields instead, and split into multiple tables if necessary.

How should wide tables be planned?

For each monitoring entity or collection endpoint, consolidate all metrics into a single table. For example, use a host_metrics table to store all server statistics.

If you need a more detailed migration plan or example scripts, please provide the specific table structure and data volume. The GreptimeDB official community will offer further support. Welcome to join the Greptime Slack.

Migrate from ClickHouse

Pre-Migration Notes​

Refactoring Data Model and Table Structure​

Time Index​

Primary Key and Wide Table Recommendations​

Index Planning​

Partitioning Table​

TTL​

Example Table​

Migrating Typical Log Tables​

Migrating Typical Traces Tables​

Dual-write Strategy for Safe Migration​

Exporting and Importing Historical Data​

Validation and Cutover​

Frequently Asked Questions and Optimization Tips​

What if SQL/types are incompatible?​

How do I efficiently import very large datasets in batches?​

How should high-cardinality fields be handled?​

How should wide tables be planned?​