Data Partitioning

Partitioning data has two main advantages.

  1. Every SM no longer keeps a complete copy of the database in its archive, only the parititions it is responsible for. This reduces the disk requirement of individual SMs. However, additional SMs are often required to ensure redundancy (multiple copies) of all partitions so the overall disk space needed may be higher.

  2. When performing queries, if entire partitions can be eliminated, then the query need only be run against matching partitions. The number of rows cached and processed can be reduced.

Data partitioning in NuoDB involves the use of Storage Groups.

Storage Groups

Within a NuoDB database, all data is assigned to a Storage Group. Out of the box, there is a single storage group, called UNPARTITIONED, and all SMs manage it. In this way, out-of-the-box, all SMs hold a copy of all data in the database because it is contained in the one storage group.

You may optionally create your own storage groups and assign different SMs to look after them. Typically this involves running additional SMs, but each SM only holds a subset of data (the data in its assigned storage groups). This reduces the amount of disk needed for individual SM archives and the amount of data they need to cache. Greater data throughput may be possible if multiple storage groups are being used in parallel.

SMs handling storage groups also need to be redundant. For high-availability, assign at least two SMs to handle each storage group.

Data Type Partitioning

For example, in an online shopping application, you might divide all your data into three groups: Customer Data, Supplier Data and Products. All customer-related tables such as Customer, Invoices, Payments would be allocated to the Customer Data storage group and so on.

Let’s assume we have six nodes, one for each archive with archives-ids 0,1,2,3,4, and 5 respectively.

Running nuocmd show archives will show the SM to archive-id mappings

We will create three storage groups with two SMs managing each storage group:

nuocmd add storage-group --db-name testdb --sg-name products --archive-ids 0,1
nuocmd add storage-group --db-name testdb --sg-name customers --archive-ids 2,3
nuocmd add storage-group --db-name testdb --sg-name suppliers --archive-ids 4.5

SMs, Archives, and Storage Groups

In the example above, each storage group has two associated archives and therefore two SMs handing it. This is for redundancy.

Assigning one storage group to each archive is just one possibility. An SM can handle multiple storage groups if desired. Another arrangement could use just three archives and three SMs with each SM handling two storage groups as follows:

  • SG1: archives 0 and 1

  • SG2: archives 1 and 2

  • SG3: archives 2 and 0

Assigning Storage Groups to Tables

Create the tables

CREATE TABLE Products (...) STORE IN products;
CREATE TABLE StockLevels (...) STORE IN products;

And so on.

Table Partitioning

Alternatively, rows in the same table may be partitioned across multiple storage groups. This is useful for spreading the load for very large tables.

In the following example, the Customers table is partitioned into asiapac, emea, and americas based on the values in the region column and are stored in storage groups sg1, sg2, and sg3.

Any data associated with a region other than the six defined regions will be stored within the americas partition located in the sg3 storage group.

CREATE TABLE Customers (id INTEGER, region STRING, name STRING)
PARTITION BY LIST(region)
    PARTITION asiapac VALUES IN ('asia', 'australia', 'pacific') STORE IN sg1,
    PARTITION emea VALUES IN ('europe', 'middle-east', 'africa') STORE IN sg2,
    PARTITION americas VALUES IN (DEFAULT) STORE IN sg3;

Or

CREATE TABLE Customers (id INTEGER, region STRING, name STRING)
    PARTITION BY LIST(region) PARTITION asiapac VALUES IN ('asia', 'australia', 'pacific') STORE IN sg1;
ALTER TABLE Customers ADD PARTITION emea VALUES IN ('europe', 'middle-east', 'africa') STORE IN sg2;
ALTER TABLE Customers ADD PARTITION americas VALUES IN (DEFAULT) STORE IN sg3;

When using any form of partitioning, a uniform spread of data gives the best results. In this case, we are distributing customers across the three regions. If the customer base is predominantly in a single region, data is unbalanced and one storage group will be overloaded. Partitioning by region, zip or postal code, the first letter of customer or business name may seem appealing, but be careful of the spread of data.

Another common approach is to partition by date. Typically only the most recent data is in current use. Older data sits in other storage groups and is rarely used. Assuming there is the same amount of data generated in any given period (typically per month or per year) then an even loading of storage groups is possible.

When partitioning, it is important to ensure data is partitioned according to the criterion most likely to improve performance and disk usage.