MongoDB sharding by monotonic id with zones for archiving
Situation:
I have an ever-growing collection of documents which do have monotonically increasing unique id.
All queries are direct lookup for documents with specified _id
(no range queries nor queries on other fields).
Queries on newer documents are more frequent than older documents.
The workload is both read and write heavy (newer data).
Goals:
- distributing reads/writes for the newest data onto multiple shards
- ability to move older data to nodes for archive with cheaper hardware
Considerations/options:
All queries use _id
and cardinality is good (since values are unique) shard key should be _id
.
- Use shard key
{_id:1}
- Problem is, of course, that shard (current) with newest data (biggest ids) will be hot since all new writes will hit it and most reads will also hit it.
- A benefit with this setup is that I can periodically add a tag to new ranges, and then later (in a couple of months or a year) assign tag zone to shards on cheaper hardware so that whole range of data can be transparently moved onto archiving nodes.
Simple outline of shards
+---------+ +------------+ +-------------+
|HDD | |HDD | |SSD |
|archive1 | |archive2 | |current |
|id:[0,99]| |id:[100,199]| |id:[200,inf] |
+---------+ +------------+ +-------------+
After time goes by, new machine is added and ranges are changed and data is moved
new machine
+----------+ +------------+ +-------------+ +------------+
|HDD | |HDD | |HDD | |SSD |
|archive1 | |archive2 | |archive3 | |current |
|id:[0,100]| |id:[100,200]| |id:[200,300] | |id:[300,inf]|
+----------+ +------------+ +-------------+ +------------+
- Use shard key
{_id:hashed}
- A benefit of this kind of sharding is that writes and reads are evenly distributed
- Problem is that now it's not possible to add a tag to a range of ids which would be assigned to an archive server. Cause of this problem is that range tag is not applied to original id, but rather on hashed value of id, as stated in documentation Hashed Shard Keys and Zones
Outline of evenly distributed current shards (assume equal ranges, -100 and 100 is just placeholder)
+--------------------+
|SSD |
|current1 |
|hash(id):[-inf,-100]|
+--------------------+
+--------------------+
|SSD |
|current2 |
|hash(id):[-100,100] |
+--------------------+
+--------------------+
|SSD |
|current3 |
|hash(id):[100,inf] |
+--------------------+
Shard key which is some combination of compound key
- I was thinking that if I could make a benefit from compound key but can't figure out how to achieve the desired behavior
- I was considering to use shard key something like
{id:1,hashedId:1}
where hashedId would be hash(id) calculated at client side
Desired outline:
A goal is to have shards with HDD which would store older data and shards with SSD which would handle most read and write operations.
+----------+ . +------------+ . +--------------------+
|HDD | . |HDD | . |SSD |
|archive1 | . |archive2 | . |current1 |
|id:[0,100]| . |id:[100,200]| . |hash(id):[-inf,-100]|
+----------+ . +------------+ . +--------------------+
. . +--------------------+
. . |SSD |
. . |current2 |
. . |hash(id):[-100,100] |
. . +--------------------+
. . +--------------------+
. . |SSD |
. . |current3 |
. . |hash(id):[100,inf] |
. . +--------------------+
. .
id:[0,100] . id:[100,200] . id:[200,inf]
My current workaround
Now, I have manually written a wrapper around mongo client at application level which routes requests to appropriate mongo server and thus simulating this behavior. Currently, those servers are not connected into a cluster. Problem with the current workaround is operational complexity when changes are made (change routing config, restart the application(s), ...). This is the reason I would move that logic to the database level.
Question
What would be a way to choose a shard key to achieve the desired architecture?
Or, in general, when using monotonic id, how to achieve architecture so that new data is written to evenly distributed among set of shards and as time goes by and data gets older, to be able to move this data to archive nodes initiated by DB admin and all of that transparent to the application.
mongodb sharding archive
bumped to the homepage by Community♦ 7 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
Situation:
I have an ever-growing collection of documents which do have monotonically increasing unique id.
All queries are direct lookup for documents with specified _id
(no range queries nor queries on other fields).
Queries on newer documents are more frequent than older documents.
The workload is both read and write heavy (newer data).
Goals:
- distributing reads/writes for the newest data onto multiple shards
- ability to move older data to nodes for archive with cheaper hardware
Considerations/options:
All queries use _id
and cardinality is good (since values are unique) shard key should be _id
.
- Use shard key
{_id:1}
- Problem is, of course, that shard (current) with newest data (biggest ids) will be hot since all new writes will hit it and most reads will also hit it.
- A benefit with this setup is that I can periodically add a tag to new ranges, and then later (in a couple of months or a year) assign tag zone to shards on cheaper hardware so that whole range of data can be transparently moved onto archiving nodes.
Simple outline of shards
+---------+ +------------+ +-------------+
|HDD | |HDD | |SSD |
|archive1 | |archive2 | |current |
|id:[0,99]| |id:[100,199]| |id:[200,inf] |
+---------+ +------------+ +-------------+
After time goes by, new machine is added and ranges are changed and data is moved
new machine
+----------+ +------------+ +-------------+ +------------+
|HDD | |HDD | |HDD | |SSD |
|archive1 | |archive2 | |archive3 | |current |
|id:[0,100]| |id:[100,200]| |id:[200,300] | |id:[300,inf]|
+----------+ +------------+ +-------------+ +------------+
- Use shard key
{_id:hashed}
- A benefit of this kind of sharding is that writes and reads are evenly distributed
- Problem is that now it's not possible to add a tag to a range of ids which would be assigned to an archive server. Cause of this problem is that range tag is not applied to original id, but rather on hashed value of id, as stated in documentation Hashed Shard Keys and Zones
Outline of evenly distributed current shards (assume equal ranges, -100 and 100 is just placeholder)
+--------------------+
|SSD |
|current1 |
|hash(id):[-inf,-100]|
+--------------------+
+--------------------+
|SSD |
|current2 |
|hash(id):[-100,100] |
+--------------------+
+--------------------+
|SSD |
|current3 |
|hash(id):[100,inf] |
+--------------------+
Shard key which is some combination of compound key
- I was thinking that if I could make a benefit from compound key but can't figure out how to achieve the desired behavior
- I was considering to use shard key something like
{id:1,hashedId:1}
where hashedId would be hash(id) calculated at client side
Desired outline:
A goal is to have shards with HDD which would store older data and shards with SSD which would handle most read and write operations.
+----------+ . +------------+ . +--------------------+
|HDD | . |HDD | . |SSD |
|archive1 | . |archive2 | . |current1 |
|id:[0,100]| . |id:[100,200]| . |hash(id):[-inf,-100]|
+----------+ . +------------+ . +--------------------+
. . +--------------------+
. . |SSD |
. . |current2 |
. . |hash(id):[-100,100] |
. . +--------------------+
. . +--------------------+
. . |SSD |
. . |current3 |
. . |hash(id):[100,inf] |
. . +--------------------+
. .
id:[0,100] . id:[100,200] . id:[200,inf]
My current workaround
Now, I have manually written a wrapper around mongo client at application level which routes requests to appropriate mongo server and thus simulating this behavior. Currently, those servers are not connected into a cluster. Problem with the current workaround is operational complexity when changes are made (change routing config, restart the application(s), ...). This is the reason I would move that logic to the database level.
Question
What would be a way to choose a shard key to achieve the desired architecture?
Or, in general, when using monotonic id, how to achieve architecture so that new data is written to evenly distributed among set of shards and as time goes by and data gets older, to be able to move this data to archive nodes initiated by DB admin and all of that transparent to the application.
mongodb sharding archive
bumped to the homepage by Community♦ 7 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
Situation:
I have an ever-growing collection of documents which do have monotonically increasing unique id.
All queries are direct lookup for documents with specified _id
(no range queries nor queries on other fields).
Queries on newer documents are more frequent than older documents.
The workload is both read and write heavy (newer data).
Goals:
- distributing reads/writes for the newest data onto multiple shards
- ability to move older data to nodes for archive with cheaper hardware
Considerations/options:
All queries use _id
and cardinality is good (since values are unique) shard key should be _id
.
- Use shard key
{_id:1}
- Problem is, of course, that shard (current) with newest data (biggest ids) will be hot since all new writes will hit it and most reads will also hit it.
- A benefit with this setup is that I can periodically add a tag to new ranges, and then later (in a couple of months or a year) assign tag zone to shards on cheaper hardware so that whole range of data can be transparently moved onto archiving nodes.
Simple outline of shards
+---------+ +------------+ +-------------+
|HDD | |HDD | |SSD |
|archive1 | |archive2 | |current |
|id:[0,99]| |id:[100,199]| |id:[200,inf] |
+---------+ +------------+ +-------------+
After time goes by, new machine is added and ranges are changed and data is moved
new machine
+----------+ +------------+ +-------------+ +------------+
|HDD | |HDD | |HDD | |SSD |
|archive1 | |archive2 | |archive3 | |current |
|id:[0,100]| |id:[100,200]| |id:[200,300] | |id:[300,inf]|
+----------+ +------------+ +-------------+ +------------+
- Use shard key
{_id:hashed}
- A benefit of this kind of sharding is that writes and reads are evenly distributed
- Problem is that now it's not possible to add a tag to a range of ids which would be assigned to an archive server. Cause of this problem is that range tag is not applied to original id, but rather on hashed value of id, as stated in documentation Hashed Shard Keys and Zones
Outline of evenly distributed current shards (assume equal ranges, -100 and 100 is just placeholder)
+--------------------+
|SSD |
|current1 |
|hash(id):[-inf,-100]|
+--------------------+
+--------------------+
|SSD |
|current2 |
|hash(id):[-100,100] |
+--------------------+
+--------------------+
|SSD |
|current3 |
|hash(id):[100,inf] |
+--------------------+
Shard key which is some combination of compound key
- I was thinking that if I could make a benefit from compound key but can't figure out how to achieve the desired behavior
- I was considering to use shard key something like
{id:1,hashedId:1}
where hashedId would be hash(id) calculated at client side
Desired outline:
A goal is to have shards with HDD which would store older data and shards with SSD which would handle most read and write operations.
+----------+ . +------------+ . +--------------------+
|HDD | . |HDD | . |SSD |
|archive1 | . |archive2 | . |current1 |
|id:[0,100]| . |id:[100,200]| . |hash(id):[-inf,-100]|
+----------+ . +------------+ . +--------------------+
. . +--------------------+
. . |SSD |
. . |current2 |
. . |hash(id):[-100,100] |
. . +--------------------+
. . +--------------------+
. . |SSD |
. . |current3 |
. . |hash(id):[100,inf] |
. . +--------------------+
. .
id:[0,100] . id:[100,200] . id:[200,inf]
My current workaround
Now, I have manually written a wrapper around mongo client at application level which routes requests to appropriate mongo server and thus simulating this behavior. Currently, those servers are not connected into a cluster. Problem with the current workaround is operational complexity when changes are made (change routing config, restart the application(s), ...). This is the reason I would move that logic to the database level.
Question
What would be a way to choose a shard key to achieve the desired architecture?
Or, in general, when using monotonic id, how to achieve architecture so that new data is written to evenly distributed among set of shards and as time goes by and data gets older, to be able to move this data to archive nodes initiated by DB admin and all of that transparent to the application.
mongodb sharding archive
Situation:
I have an ever-growing collection of documents which do have monotonically increasing unique id.
All queries are direct lookup for documents with specified _id
(no range queries nor queries on other fields).
Queries on newer documents are more frequent than older documents.
The workload is both read and write heavy (newer data).
Goals:
- distributing reads/writes for the newest data onto multiple shards
- ability to move older data to nodes for archive with cheaper hardware
Considerations/options:
All queries use _id
and cardinality is good (since values are unique) shard key should be _id
.
- Use shard key
{_id:1}
- Problem is, of course, that shard (current) with newest data (biggest ids) will be hot since all new writes will hit it and most reads will also hit it.
- A benefit with this setup is that I can periodically add a tag to new ranges, and then later (in a couple of months or a year) assign tag zone to shards on cheaper hardware so that whole range of data can be transparently moved onto archiving nodes.
Simple outline of shards
+---------+ +------------+ +-------------+
|HDD | |HDD | |SSD |
|archive1 | |archive2 | |current |
|id:[0,99]| |id:[100,199]| |id:[200,inf] |
+---------+ +------------+ +-------------+
After time goes by, new machine is added and ranges are changed and data is moved
new machine
+----------+ +------------+ +-------------+ +------------+
|HDD | |HDD | |HDD | |SSD |
|archive1 | |archive2 | |archive3 | |current |
|id:[0,100]| |id:[100,200]| |id:[200,300] | |id:[300,inf]|
+----------+ +------------+ +-------------+ +------------+
- Use shard key
{_id:hashed}
- A benefit of this kind of sharding is that writes and reads are evenly distributed
- Problem is that now it's not possible to add a tag to a range of ids which would be assigned to an archive server. Cause of this problem is that range tag is not applied to original id, but rather on hashed value of id, as stated in documentation Hashed Shard Keys and Zones
Outline of evenly distributed current shards (assume equal ranges, -100 and 100 is just placeholder)
+--------------------+
|SSD |
|current1 |
|hash(id):[-inf,-100]|
+--------------------+
+--------------------+
|SSD |
|current2 |
|hash(id):[-100,100] |
+--------------------+
+--------------------+
|SSD |
|current3 |
|hash(id):[100,inf] |
+--------------------+
Shard key which is some combination of compound key
- I was thinking that if I could make a benefit from compound key but can't figure out how to achieve the desired behavior
- I was considering to use shard key something like
{id:1,hashedId:1}
where hashedId would be hash(id) calculated at client side
Desired outline:
A goal is to have shards with HDD which would store older data and shards with SSD which would handle most read and write operations.
+----------+ . +------------+ . +--------------------+
|HDD | . |HDD | . |SSD |
|archive1 | . |archive2 | . |current1 |
|id:[0,100]| . |id:[100,200]| . |hash(id):[-inf,-100]|
+----------+ . +------------+ . +--------------------+
. . +--------------------+
. . |SSD |
. . |current2 |
. . |hash(id):[-100,100] |
. . +--------------------+
. . +--------------------+
. . |SSD |
. . |current3 |
. . |hash(id):[100,inf] |
. . +--------------------+
. .
id:[0,100] . id:[100,200] . id:[200,inf]
My current workaround
Now, I have manually written a wrapper around mongo client at application level which routes requests to appropriate mongo server and thus simulating this behavior. Currently, those servers are not connected into a cluster. Problem with the current workaround is operational complexity when changes are made (change routing config, restart the application(s), ...). This is the reason I would move that logic to the database level.
Question
What would be a way to choose a shard key to achieve the desired architecture?
Or, in general, when using monotonic id, how to achieve architecture so that new data is written to evenly distributed among set of shards and as time goes by and data gets older, to be able to move this data to archive nodes initiated by DB admin and all of that transparent to the application.
mongodb sharding archive
mongodb sharding archive
asked Jul 28 '18 at 8:36
Tomac AntonioTomac Antonio
61
61
bumped to the homepage by Community♦ 7 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 7 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Unfortunately with a fully monotonic shard key, you cannot remove hot shards entirely. However, you can somewhat mitigate its effect by inserting into a faster shard. You would need to determine if a hot shard or archival using document age is more important for you, since they are pretty much mutually exclusive if you have to use a single field as the shard key.
Another solution is to use a compound shard key, but this will complicate your queries.
From your description, if you need to have only _id
as the shard key, one possible workaround is to use zone sharding. Zone sharding allows you to assign shards to specific "zones" based on the shard key. The balancer will then automatically move chunks as required so that the chunks with the specific shard key will reside in some specific shard (or shards).
To illustrate, suppose you use {_id: 1}
as your shard key, and you have 4 shards of to store archive
and current
document:
Disable the balancer.
Create zone tags according to the zones you need and the shards you have, e.g.:
sh.addShardTag("shard0000", "archive1")
sh.addShardTag("shard0001", "archive2")
sh.addShardTag("shard0002", "current")
sh.addShardTag("shard0003", "current")
Note that you can assign multiple zones into a shard, and multiple shards into a zone.
Determine that
_id:MinKey
to_id:200
should be located inarchive1
andarchive2
:
sh.addTagRange("db.coll", { _id: MinKey }, { _id: 100 }, "archive1")
sh.addTagRange("db.coll", { _id: 100 }, { _id: 200 }, "archive2")
Determine that
_id:200
to_id:MaxKey
should be located incurrent
:
sh.addTagRange("db.coll", { _id: 200 }, { _id: MaxKey }, "current")
Enable the balancer.
The balancer will split and rebalance the collection according to the zones you specify. Note that this would take some time & resources if you have a non-empty collection.
At some point, you might want to change the boundaries of the zones. For this purpose, the manual page
Tiered Hardware for Varying SLA or SLO contains a description of using zone sharding in a situation similar to what you described, under the section Updating zone ranges.
Would it be good (or bad) idea if I were to change id generator from being strictly monotonic into monotonic on long-term and being distributed on short-term? Imagine if I take 64-bit integer generated by incrementing generator and take, say... 24 (or more, depending on insertion speed) of least significant bits and perform bit reversal permutation. This should allow me to specify zone ranges for archiving and give local distributed space for multiple SSD shards. Only thing I'm afraid of is a possible maintenance nightmare in future.
– Tomac Antonio
Aug 6 '18 at 20:43
That sounds like it can be achieved using a compound shard key, e.g. usingcreated_date:1, _id:1
, and specifying a date for the zones, and keep using your working_id
. The catch is, for fastfind
queries, you would also need to specify both fields in the query (instead of just_id
like what you have now).
– Kevin Adistambha
Aug 6 '18 at 23:03
The main issue is that you need two conflicting features in a single deployment. You need it to be random half the time, while you need it to be ordered the other half of the time. I don't think a single field can achieve this. To solve this cleanly, you would need at least 2 fields, one to provide the order, and the other to provide the randomness. I'm afraid if you use complicated measures, it will be prone to breakage.
– Kevin Adistambha
Aug 6 '18 at 23:07
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f213469%2fmongodb-sharding-by-monotonic-id-with-zones-for-archiving%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Unfortunately with a fully monotonic shard key, you cannot remove hot shards entirely. However, you can somewhat mitigate its effect by inserting into a faster shard. You would need to determine if a hot shard or archival using document age is more important for you, since they are pretty much mutually exclusive if you have to use a single field as the shard key.
Another solution is to use a compound shard key, but this will complicate your queries.
From your description, if you need to have only _id
as the shard key, one possible workaround is to use zone sharding. Zone sharding allows you to assign shards to specific "zones" based on the shard key. The balancer will then automatically move chunks as required so that the chunks with the specific shard key will reside in some specific shard (or shards).
To illustrate, suppose you use {_id: 1}
as your shard key, and you have 4 shards of to store archive
and current
document:
Disable the balancer.
Create zone tags according to the zones you need and the shards you have, e.g.:
sh.addShardTag("shard0000", "archive1")
sh.addShardTag("shard0001", "archive2")
sh.addShardTag("shard0002", "current")
sh.addShardTag("shard0003", "current")
Note that you can assign multiple zones into a shard, and multiple shards into a zone.
Determine that
_id:MinKey
to_id:200
should be located inarchive1
andarchive2
:
sh.addTagRange("db.coll", { _id: MinKey }, { _id: 100 }, "archive1")
sh.addTagRange("db.coll", { _id: 100 }, { _id: 200 }, "archive2")
Determine that
_id:200
to_id:MaxKey
should be located incurrent
:
sh.addTagRange("db.coll", { _id: 200 }, { _id: MaxKey }, "current")
Enable the balancer.
The balancer will split and rebalance the collection according to the zones you specify. Note that this would take some time & resources if you have a non-empty collection.
At some point, you might want to change the boundaries of the zones. For this purpose, the manual page
Tiered Hardware for Varying SLA or SLO contains a description of using zone sharding in a situation similar to what you described, under the section Updating zone ranges.
Would it be good (or bad) idea if I were to change id generator from being strictly monotonic into monotonic on long-term and being distributed on short-term? Imagine if I take 64-bit integer generated by incrementing generator and take, say... 24 (or more, depending on insertion speed) of least significant bits and perform bit reversal permutation. This should allow me to specify zone ranges for archiving and give local distributed space for multiple SSD shards. Only thing I'm afraid of is a possible maintenance nightmare in future.
– Tomac Antonio
Aug 6 '18 at 20:43
That sounds like it can be achieved using a compound shard key, e.g. usingcreated_date:1, _id:1
, and specifying a date for the zones, and keep using your working_id
. The catch is, for fastfind
queries, you would also need to specify both fields in the query (instead of just_id
like what you have now).
– Kevin Adistambha
Aug 6 '18 at 23:03
The main issue is that you need two conflicting features in a single deployment. You need it to be random half the time, while you need it to be ordered the other half of the time. I don't think a single field can achieve this. To solve this cleanly, you would need at least 2 fields, one to provide the order, and the other to provide the randomness. I'm afraid if you use complicated measures, it will be prone to breakage.
– Kevin Adistambha
Aug 6 '18 at 23:07
add a comment |
Unfortunately with a fully monotonic shard key, you cannot remove hot shards entirely. However, you can somewhat mitigate its effect by inserting into a faster shard. You would need to determine if a hot shard or archival using document age is more important for you, since they are pretty much mutually exclusive if you have to use a single field as the shard key.
Another solution is to use a compound shard key, but this will complicate your queries.
From your description, if you need to have only _id
as the shard key, one possible workaround is to use zone sharding. Zone sharding allows you to assign shards to specific "zones" based on the shard key. The balancer will then automatically move chunks as required so that the chunks with the specific shard key will reside in some specific shard (or shards).
To illustrate, suppose you use {_id: 1}
as your shard key, and you have 4 shards of to store archive
and current
document:
Disable the balancer.
Create zone tags according to the zones you need and the shards you have, e.g.:
sh.addShardTag("shard0000", "archive1")
sh.addShardTag("shard0001", "archive2")
sh.addShardTag("shard0002", "current")
sh.addShardTag("shard0003", "current")
Note that you can assign multiple zones into a shard, and multiple shards into a zone.
Determine that
_id:MinKey
to_id:200
should be located inarchive1
andarchive2
:
sh.addTagRange("db.coll", { _id: MinKey }, { _id: 100 }, "archive1")
sh.addTagRange("db.coll", { _id: 100 }, { _id: 200 }, "archive2")
Determine that
_id:200
to_id:MaxKey
should be located incurrent
:
sh.addTagRange("db.coll", { _id: 200 }, { _id: MaxKey }, "current")
Enable the balancer.
The balancer will split and rebalance the collection according to the zones you specify. Note that this would take some time & resources if you have a non-empty collection.
At some point, you might want to change the boundaries of the zones. For this purpose, the manual page
Tiered Hardware for Varying SLA or SLO contains a description of using zone sharding in a situation similar to what you described, under the section Updating zone ranges.
Would it be good (or bad) idea if I were to change id generator from being strictly monotonic into monotonic on long-term and being distributed on short-term? Imagine if I take 64-bit integer generated by incrementing generator and take, say... 24 (or more, depending on insertion speed) of least significant bits and perform bit reversal permutation. This should allow me to specify zone ranges for archiving and give local distributed space for multiple SSD shards. Only thing I'm afraid of is a possible maintenance nightmare in future.
– Tomac Antonio
Aug 6 '18 at 20:43
That sounds like it can be achieved using a compound shard key, e.g. usingcreated_date:1, _id:1
, and specifying a date for the zones, and keep using your working_id
. The catch is, for fastfind
queries, you would also need to specify both fields in the query (instead of just_id
like what you have now).
– Kevin Adistambha
Aug 6 '18 at 23:03
The main issue is that you need two conflicting features in a single deployment. You need it to be random half the time, while you need it to be ordered the other half of the time. I don't think a single field can achieve this. To solve this cleanly, you would need at least 2 fields, one to provide the order, and the other to provide the randomness. I'm afraid if you use complicated measures, it will be prone to breakage.
– Kevin Adistambha
Aug 6 '18 at 23:07
add a comment |
Unfortunately with a fully monotonic shard key, you cannot remove hot shards entirely. However, you can somewhat mitigate its effect by inserting into a faster shard. You would need to determine if a hot shard or archival using document age is more important for you, since they are pretty much mutually exclusive if you have to use a single field as the shard key.
Another solution is to use a compound shard key, but this will complicate your queries.
From your description, if you need to have only _id
as the shard key, one possible workaround is to use zone sharding. Zone sharding allows you to assign shards to specific "zones" based on the shard key. The balancer will then automatically move chunks as required so that the chunks with the specific shard key will reside in some specific shard (or shards).
To illustrate, suppose you use {_id: 1}
as your shard key, and you have 4 shards of to store archive
and current
document:
Disable the balancer.
Create zone tags according to the zones you need and the shards you have, e.g.:
sh.addShardTag("shard0000", "archive1")
sh.addShardTag("shard0001", "archive2")
sh.addShardTag("shard0002", "current")
sh.addShardTag("shard0003", "current")
Note that you can assign multiple zones into a shard, and multiple shards into a zone.
Determine that
_id:MinKey
to_id:200
should be located inarchive1
andarchive2
:
sh.addTagRange("db.coll", { _id: MinKey }, { _id: 100 }, "archive1")
sh.addTagRange("db.coll", { _id: 100 }, { _id: 200 }, "archive2")
Determine that
_id:200
to_id:MaxKey
should be located incurrent
:
sh.addTagRange("db.coll", { _id: 200 }, { _id: MaxKey }, "current")
Enable the balancer.
The balancer will split and rebalance the collection according to the zones you specify. Note that this would take some time & resources if you have a non-empty collection.
At some point, you might want to change the boundaries of the zones. For this purpose, the manual page
Tiered Hardware for Varying SLA or SLO contains a description of using zone sharding in a situation similar to what you described, under the section Updating zone ranges.
Unfortunately with a fully monotonic shard key, you cannot remove hot shards entirely. However, you can somewhat mitigate its effect by inserting into a faster shard. You would need to determine if a hot shard or archival using document age is more important for you, since they are pretty much mutually exclusive if you have to use a single field as the shard key.
Another solution is to use a compound shard key, but this will complicate your queries.
From your description, if you need to have only _id
as the shard key, one possible workaround is to use zone sharding. Zone sharding allows you to assign shards to specific "zones" based on the shard key. The balancer will then automatically move chunks as required so that the chunks with the specific shard key will reside in some specific shard (or shards).
To illustrate, suppose you use {_id: 1}
as your shard key, and you have 4 shards of to store archive
and current
document:
Disable the balancer.
Create zone tags according to the zones you need and the shards you have, e.g.:
sh.addShardTag("shard0000", "archive1")
sh.addShardTag("shard0001", "archive2")
sh.addShardTag("shard0002", "current")
sh.addShardTag("shard0003", "current")
Note that you can assign multiple zones into a shard, and multiple shards into a zone.
Determine that
_id:MinKey
to_id:200
should be located inarchive1
andarchive2
:
sh.addTagRange("db.coll", { _id: MinKey }, { _id: 100 }, "archive1")
sh.addTagRange("db.coll", { _id: 100 }, { _id: 200 }, "archive2")
Determine that
_id:200
to_id:MaxKey
should be located incurrent
:
sh.addTagRange("db.coll", { _id: 200 }, { _id: MaxKey }, "current")
Enable the balancer.
The balancer will split and rebalance the collection according to the zones you specify. Note that this would take some time & resources if you have a non-empty collection.
At some point, you might want to change the boundaries of the zones. For this purpose, the manual page
Tiered Hardware for Varying SLA or SLO contains a description of using zone sharding in a situation similar to what you described, under the section Updating zone ranges.
edited Aug 6 '18 at 7:07
answered Aug 6 '18 at 7:01
Kevin AdistambhaKevin Adistambha
37317
37317
Would it be good (or bad) idea if I were to change id generator from being strictly monotonic into monotonic on long-term and being distributed on short-term? Imagine if I take 64-bit integer generated by incrementing generator and take, say... 24 (or more, depending on insertion speed) of least significant bits and perform bit reversal permutation. This should allow me to specify zone ranges for archiving and give local distributed space for multiple SSD shards. Only thing I'm afraid of is a possible maintenance nightmare in future.
– Tomac Antonio
Aug 6 '18 at 20:43
That sounds like it can be achieved using a compound shard key, e.g. usingcreated_date:1, _id:1
, and specifying a date for the zones, and keep using your working_id
. The catch is, for fastfind
queries, you would also need to specify both fields in the query (instead of just_id
like what you have now).
– Kevin Adistambha
Aug 6 '18 at 23:03
The main issue is that you need two conflicting features in a single deployment. You need it to be random half the time, while you need it to be ordered the other half of the time. I don't think a single field can achieve this. To solve this cleanly, you would need at least 2 fields, one to provide the order, and the other to provide the randomness. I'm afraid if you use complicated measures, it will be prone to breakage.
– Kevin Adistambha
Aug 6 '18 at 23:07
add a comment |
Would it be good (or bad) idea if I were to change id generator from being strictly monotonic into monotonic on long-term and being distributed on short-term? Imagine if I take 64-bit integer generated by incrementing generator and take, say... 24 (or more, depending on insertion speed) of least significant bits and perform bit reversal permutation. This should allow me to specify zone ranges for archiving and give local distributed space for multiple SSD shards. Only thing I'm afraid of is a possible maintenance nightmare in future.
– Tomac Antonio
Aug 6 '18 at 20:43
That sounds like it can be achieved using a compound shard key, e.g. usingcreated_date:1, _id:1
, and specifying a date for the zones, and keep using your working_id
. The catch is, for fastfind
queries, you would also need to specify both fields in the query (instead of just_id
like what you have now).
– Kevin Adistambha
Aug 6 '18 at 23:03
The main issue is that you need two conflicting features in a single deployment. You need it to be random half the time, while you need it to be ordered the other half of the time. I don't think a single field can achieve this. To solve this cleanly, you would need at least 2 fields, one to provide the order, and the other to provide the randomness. I'm afraid if you use complicated measures, it will be prone to breakage.
– Kevin Adistambha
Aug 6 '18 at 23:07
Would it be good (or bad) idea if I were to change id generator from being strictly monotonic into monotonic on long-term and being distributed on short-term? Imagine if I take 64-bit integer generated by incrementing generator and take, say... 24 (or more, depending on insertion speed) of least significant bits and perform bit reversal permutation. This should allow me to specify zone ranges for archiving and give local distributed space for multiple SSD shards. Only thing I'm afraid of is a possible maintenance nightmare in future.
– Tomac Antonio
Aug 6 '18 at 20:43
Would it be good (or bad) idea if I were to change id generator from being strictly monotonic into monotonic on long-term and being distributed on short-term? Imagine if I take 64-bit integer generated by incrementing generator and take, say... 24 (or more, depending on insertion speed) of least significant bits and perform bit reversal permutation. This should allow me to specify zone ranges for archiving and give local distributed space for multiple SSD shards. Only thing I'm afraid of is a possible maintenance nightmare in future.
– Tomac Antonio
Aug 6 '18 at 20:43
That sounds like it can be achieved using a compound shard key, e.g. using
created_date:1, _id:1
, and specifying a date for the zones, and keep using your working _id
. The catch is, for fast find
queries, you would also need to specify both fields in the query (instead of just _id
like what you have now).– Kevin Adistambha
Aug 6 '18 at 23:03
That sounds like it can be achieved using a compound shard key, e.g. using
created_date:1, _id:1
, and specifying a date for the zones, and keep using your working _id
. The catch is, for fast find
queries, you would also need to specify both fields in the query (instead of just _id
like what you have now).– Kevin Adistambha
Aug 6 '18 at 23:03
The main issue is that you need two conflicting features in a single deployment. You need it to be random half the time, while you need it to be ordered the other half of the time. I don't think a single field can achieve this. To solve this cleanly, you would need at least 2 fields, one to provide the order, and the other to provide the randomness. I'm afraid if you use complicated measures, it will be prone to breakage.
– Kevin Adistambha
Aug 6 '18 at 23:07
The main issue is that you need two conflicting features in a single deployment. You need it to be random half the time, while you need it to be ordered the other half of the time. I don't think a single field can achieve this. To solve this cleanly, you would need at least 2 fields, one to provide the order, and the other to provide the randomness. I'm afraid if you use complicated measures, it will be prone to breakage.
– Kevin Adistambha
Aug 6 '18 at 23:07
add a comment |
Thanks for contributing an answer to Database Administrators Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f213469%2fmongodb-sharding-by-monotonic-id-with-zones-for-archiving%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown