Partitioning data has two main advantages.
Every SM no longer keeps a complete copy of the database in its archive, only the parititions it is responsible for. This reduces the disk requirement of individual SMs. However, additional SMs are often required to ensure redundancy (multiple copies) of all partitions so the overall disk space needed may be higher.
When performing queries, if entire partitions can be eliminated, then the query need only be run against matching partitions. The number of rows cached and processed can be reduced.
Data partitioning in NuoDB involves the use of Storage Groups.
Within a NuoDB database, all data is assigned to a Storage Group.
Out of the box, there is a single storage group, called
UNPARTITIONED, and all SMs manage it.
In this way, out-of-the-box, all SMs hold a copy of all data in the database because it is contained in the one storage group.
You may optionally create your own storage groups and assign different SMs to look after them. Typically this involves running additional SMs, but each SM only holds a subset of data (the data in its assigned storage groups). This reduces the amount of disk needed for individual SM archives and the amount of data they need to cache. Greater data throughput may be possible if multiple storage groups are being used in parallel.
SMs handling storage groups also need to be redundant. For high-availability, assign at least two SMs to handle each storage group.
For example, in an online shopping application, you might divide all your data into three groups: Customer Data, Supplier Data and Products.
All customer-related tables such as
Payments would be allocated to the Customer Data storage group and so on.
Let’s assume we have six nodes, one for each archive with
archives-ids 0,1,2,3,4, and 5 respectively.
We will create three storage groups with two SMs managing each storage group:
nuocmd add storage-group --db-name testdb --sg-name products --archive-ids 0,1
nuocmd add storage-group --db-name testdb --sg-name customers --archive-ids 2,3
nuocmd add storage-group --db-name testdb --sg-name suppliers --archive-ids 4.5
In the example above, each storage group has two associated archives and therefore two SMs handing it. This is for redundancy.
Assigning one storage group to each archive is just one possibility. An SM can handle multiple storage groups if desired. Another arrangement could use just three archives and three SMs with each SM handling two storage groups as follows:
SG1: archives 0 and 1
SG2: archives 1 and 2
SG3: archives 2 and 0
Alternatively, rows in the same table may be partitioned across multiple storage groups. This is useful for spreading the load for very large tables.
Suppose we have storage groups
sg3. We can divide the world into three regions:
MAP PARTITION asiapac STORE IN sg1;
MAP PARTITION emea STORE IN sg2;
MAP PARTITION americas STORE IN sg3;
CREATE TABLE Customers (..., region STRING, ...) PARTITION BY LIST(region) PARTITION asiapac VALUES IN ('asia', 'australia', 'pacific'), PARTITION emea VALUES IN ('europe', 'middle-east', 'africa'), PARTITION americas VALUES IN (DEFAULT);
Note that anything other than the 5 defined region names will go in
When using any form of partitioning, a uniform spread of data gives the best results. In this case, we are distributing customers across the three regions. If the customer base is predominantly in a single region, data is unbalanced and one storage group will be overloaded. Partitioning by region, zip or postal code, the first letter of customer or business name may seem appealing, but be careful of the spread of data.
Another common approach is to partition by date. Typically only the most recent data is in current use. Older data sits in other storage groups and is rarely used. Assuming there is the same amount of data generated in any given period (typically per month or per year) then an even loading of storage groups is possible.
When partitioning, it is important to ensure data is partitioned according to the criterion most likely to improve performance and disk usage.