Data Partitioning
Partitioning data has two main advantages.
-
Every SM no longer keeps a complete copy of the database in its archive, only the partitions it is responsible for. This reduces the disk requirement of individual SMs. However, additional SMs are often required to ensure redundancy (multiple copies) of all partitions so the overall disk space needed may be higher.
-
When performing queries, if entire partitions can be eliminated, then the query need only be run against matching partitions. The number of rows cached and processed can be reduced.
Data partitioning in NuoDB involves the use of Storage Groups.
Storage Groups
Within a NuoDB database, all data is assigned to a Storage Group.
By default, there is a single storage group called UNPARTITIONED
(alias DEFAULT
), and all SMs manage it.
In this way, each SM manages a full copy of the database because it contains the single UNPARTITIONED
storage group.
You may optionally create your own storage groups and assign different SMs to look after them. Typically this involves running additional SMs, but each SM only holds a subset of data (the data in its assigned storage groups). This reduces the amount of disk needed for individual SM archives and the amount of data they need to cache. Greater data throughput may be possible if multiple storage groups are being used in parallel.
SMs handling storage groups also need to be redundant. For high-availability, assign at least two SMs to handle each storage group.
Data Type Partitioning
For example, in an online shopping application, you might divide all your data into three groups: Customer Data, Supplier Data and Products.
All customer-related tables such as Customer
, Invoices
, Payments
would be allocated to the Customer Data storage group and so on.
Let’s assume we have six nodes, one for each archive with archives-ids
0,1,2,3,4, and 5 respectively.
Running nuocmd show archives will show the SM to archive-id mappings
|
We will create three storage groups with two SMs managing each storage group:
nuocmd add storage-group --db-name testdb --sg-name products --archive-ids 0,1
nuocmd add storage-group --db-name testdb --sg-name customers --archive-ids 2,3
nuocmd add storage-group --db-name testdb --sg-name suppliers --archive-ids 4.5
SMs, Archives, and Storage Groups
In the example above, each storage group has two associated archives and therefore two SMs handing it. This is for redundancy.
Assigning one storage group to each archive is just one possibility. An SM can handle multiple storage groups if desired. Another arrangement could use just three archives and three SMs with each SM handling two storage groups as follows:
-
SG1: archives 0 and 1
-
SG2: archives 1 and 2
-
SG3: archives 2 and 0
Table Partitioning
Alternatively, rows in the same table may be partitioned across multiple storage groups. This is useful for spreading the load for very large tables.
In the following example, the Customers
table is partitioned into asiapac
, emea
, and americas
based on the values in the region
column and are stored in storage groups sg1
, sg2
, and sg3
.
Any data associated with a region other than the six defined regions will be stored within the americas
partition located in the sg3
storage group.
CREATE TABLE Customers (id INTEGER, region STRING, name STRING)
PARTITION BY LIST(region)
PARTITION asiapac VALUES IN ('asia', 'australia', 'pacific') STORE IN sg1,
PARTITION emea VALUES IN ('europe', 'middle-east', 'africa') STORE IN sg2,
PARTITION americas VALUES IN (DEFAULT) STORE IN sg3;
Or
CREATE TABLE Customers (id INTEGER, region STRING, name STRING)
PARTITION BY LIST(region) PARTITION asiapac VALUES IN ('asia', 'australia', 'pacific') STORE IN sg1;
ALTER TABLE Customers ADD PARTITION emea VALUES IN ('europe', 'middle-east', 'africa') STORE IN sg2;
ALTER TABLE Customers ADD PARTITION americas VALUES IN (DEFAULT) STORE IN sg3;
When using any form of partitioning, a uniform spread of data gives the best results. In this case, we are distributing customers across the three regions. If the customer base is predominantly in a single region, data is unbalanced and one storage group will be overloaded. Partitioning by region, zip or postal code, the first letter of customer or business name may seem appealing, but be careful of the spread of data.
Another common approach is to partition by date. Typically only the most recent data is in current use. Older data sits in other storage groups and is rarely used. Assuming there is the same amount of data generated in any given period (typically per month or per year) then an even loading of storage groups is possible.
When partitioning, it is important to ensure data is partitioned according to the criterion most likely to improve performance and disk usage.