Casandra Data Modelling
This one covers the data modelling for next generation SAP Ariba Catalog solution— worlds largest procurement solution having 2+ billion items in its index with over 3 million updates per day.
Picking the right data model is the hardest part of using Cassandra
We need to carefully design the schema around query patterns specific to the business problem to get the best performance in Casandra. This is always a challenge for complicated data sets and queries.
Basic Modelling Principle
Cassandra is a distributed database in which data is partitioned and stored across multiple nodes within a cluster.
These are the two high-level goals for the data model:
- Spread data evenly around the cluster — The idea is to have same amount of data in every node in the cluster.
- Minimize the number of partitions read — The idea here is to read rows from as few partitions as possible.Partitions are groups of rows that share the same partition key.
Defining the data model i.e. partition key for the Ariba Catalog solution having diverse customer base across industry verticals is stimulating. Typical use cases for the query look up is based on the catalog name and the item key i.e. compound key of supplier id , item id for a particular customer
Catalog Name as the partition key
This will ensure looking up items based on the catalog name efficiently, however looking up items based on the Item key will be sub optimal as a catalog can have millions of items.
Similarly defining customer as the key will not scale well due to varied customer contents.
Item key as the partition key
While this will definitely spread items across nodes evenly, looking up items based on the catalog name will be highly inefficient as it needs to scan multiple partitions.
Hybrid approach — Enforcing the number of partitions
In this approach, we are enforcing the data to be spread across fixed set of partitions for a particular catalog i.e. by bucketing to break the data into moderate size partitions
PRIMARY KEY ((customerId, catalogName, hash_prefix), itemkey)
hash_prefix column is a kind of bucket number i.e. itemKey modula of X. this design ensures the items in a catalog always gets partitioned in X buckets.This modelling pre dominantly enforces the catalogs items being partitioned in controlled number of partitions.
The first thing that you want to look for is whether your tables will have partitions that will be overly large, or to put it another way, too wide. Partition size is measured by the number of cells (values) that are stored in the partition. Cassandra’s hard limit is two billion cells per partition, but you’ll likely run into performance issues before reaching that limit. The recommended size of a partition is not more than 100,000 cells.
We see that in our case the number of rows is equal to the number of cells. To support 10 million items in a catalog we would therefore need 100 partitions with 100,000 cells (rows) per partition.