MongoDB sharding by monotonic id with zones for archiving












1















Situation:



I have an ever-growing collection of documents which do have monotonically increasing unique id.
All queries are direct lookup for documents with specified _id (no range queries nor queries on other fields).
Queries on newer documents are more frequent than older documents.
The workload is both read and write heavy (newer data).



Goals:




  1. distributing reads/writes for the newest data onto multiple shards

  2. ability to move older data to nodes for archive with cheaper hardware


Considerations/options:



All queries use _id and cardinality is good (since values are unique) shard key should be _id.




  1. Use shard key {_id:1}


    • Problem is, of course, that shard (current) with newest data (biggest ids) will be hot since all new writes will hit it and most reads will also hit it.

    • A benefit with this setup is that I can periodically add a tag to new ranges, and then later (in a couple of months or a year) assign tag zone to shards on cheaper hardware so that whole range of data can be transparently moved onto archiving nodes.




Simple outline of shards



+---------+    +------------+    +-------------+
|HDD | |HDD | |SSD |
|archive1 | |archive2 | |current |
|id:[0,99]| |id:[100,199]| |id:[200,inf] |
+---------+ +------------+ +-------------+


After time goes by, new machine is added and ranges are changed and data is moved



                                   new machine
+----------+ +------------+ +-------------+ +------------+
|HDD | |HDD | |HDD | |SSD |
|archive1 | |archive2 | |archive3 | |current |
|id:[0,100]| |id:[100,200]| |id:[200,300] | |id:[300,inf]|
+----------+ +------------+ +-------------+ +------------+



  1. Use shard key {_id:hashed}


    • A benefit of this kind of sharding is that writes and reads are evenly distributed

    • Problem is that now it's not possible to add a tag to a range of ids which would be assigned to an archive server. Cause of this problem is that range tag is not applied to original id, but rather on hashed value of id, as stated in documentation Hashed Shard Keys and Zones




Outline of evenly distributed current shards (assume equal ranges, -100 and 100 is just placeholder)



+--------------------+
|SSD |
|current1 |
|hash(id):[-inf,-100]|
+--------------------+
+--------------------+
|SSD |
|current2 |
|hash(id):[-100,100] |
+--------------------+
+--------------------+
|SSD |
|current3 |
|hash(id):[100,inf] |
+--------------------+




  1. Shard key which is some combination of compound key




    • I was thinking that if I could make a benefit from compound key but can't figure out how to achieve the desired behavior

    • I was considering to use shard key something like {id:1,hashedId:1} where hashedId would be hash(id) calculated at client side




Desired outline:



A goal is to have shards with HDD which would store older data and shards with SSD which would handle most read and write operations.



+----------+  .  +------------+  .  +--------------------+
|HDD | . |HDD | . |SSD |
|archive1 | . |archive2 | . |current1 |
|id:[0,100]| . |id:[100,200]| . |hash(id):[-inf,-100]|
+----------+ . +------------+ . +--------------------+
. . +--------------------+
. . |SSD |
. . |current2 |
. . |hash(id):[-100,100] |
. . +--------------------+
. . +--------------------+
. . |SSD |
. . |current3 |
. . |hash(id):[100,inf] |
. . +--------------------+
. .
id:[0,100] . id:[100,200] . id:[200,inf]


My current workaround



Now, I have manually written a wrapper around mongo client at application level which routes requests to appropriate mongo server and thus simulating this behavior. Currently, those servers are not connected into a cluster. Problem with the current workaround is operational complexity when changes are made (change routing config, restart the application(s), ...). This is the reason I would move that logic to the database level.



Question



What would be a way to choose a shard key to achieve the desired architecture?
Or, in general, when using monotonic id, how to achieve architecture so that new data is written to evenly distributed among set of shards and as time goes by and data gets older, to be able to move this data to archive nodes initiated by DB admin and all of that transparent to the application.










share|improve this question














bumped to the homepage by Community 7 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.




















    1















    Situation:



    I have an ever-growing collection of documents which do have monotonically increasing unique id.
    All queries are direct lookup for documents with specified _id (no range queries nor queries on other fields).
    Queries on newer documents are more frequent than older documents.
    The workload is both read and write heavy (newer data).



    Goals:




    1. distributing reads/writes for the newest data onto multiple shards

    2. ability to move older data to nodes for archive with cheaper hardware


    Considerations/options:



    All queries use _id and cardinality is good (since values are unique) shard key should be _id.




    1. Use shard key {_id:1}


      • Problem is, of course, that shard (current) with newest data (biggest ids) will be hot since all new writes will hit it and most reads will also hit it.

      • A benefit with this setup is that I can periodically add a tag to new ranges, and then later (in a couple of months or a year) assign tag zone to shards on cheaper hardware so that whole range of data can be transparently moved onto archiving nodes.




    Simple outline of shards



    +---------+    +------------+    +-------------+
    |HDD | |HDD | |SSD |
    |archive1 | |archive2 | |current |
    |id:[0,99]| |id:[100,199]| |id:[200,inf] |
    +---------+ +------------+ +-------------+


    After time goes by, new machine is added and ranges are changed and data is moved



                                       new machine
    +----------+ +------------+ +-------------+ +------------+
    |HDD | |HDD | |HDD | |SSD |
    |archive1 | |archive2 | |archive3 | |current |
    |id:[0,100]| |id:[100,200]| |id:[200,300] | |id:[300,inf]|
    +----------+ +------------+ +-------------+ +------------+



    1. Use shard key {_id:hashed}


      • A benefit of this kind of sharding is that writes and reads are evenly distributed

      • Problem is that now it's not possible to add a tag to a range of ids which would be assigned to an archive server. Cause of this problem is that range tag is not applied to original id, but rather on hashed value of id, as stated in documentation Hashed Shard Keys and Zones




    Outline of evenly distributed current shards (assume equal ranges, -100 and 100 is just placeholder)



    +--------------------+
    |SSD |
    |current1 |
    |hash(id):[-inf,-100]|
    +--------------------+
    +--------------------+
    |SSD |
    |current2 |
    |hash(id):[-100,100] |
    +--------------------+
    +--------------------+
    |SSD |
    |current3 |
    |hash(id):[100,inf] |
    +--------------------+




    1. Shard key which is some combination of compound key




      • I was thinking that if I could make a benefit from compound key but can't figure out how to achieve the desired behavior

      • I was considering to use shard key something like {id:1,hashedId:1} where hashedId would be hash(id) calculated at client side




    Desired outline:



    A goal is to have shards with HDD which would store older data and shards with SSD which would handle most read and write operations.



    +----------+  .  +------------+  .  +--------------------+
    |HDD | . |HDD | . |SSD |
    |archive1 | . |archive2 | . |current1 |
    |id:[0,100]| . |id:[100,200]| . |hash(id):[-inf,-100]|
    +----------+ . +------------+ . +--------------------+
    . . +--------------------+
    . . |SSD |
    . . |current2 |
    . . |hash(id):[-100,100] |
    . . +--------------------+
    . . +--------------------+
    . . |SSD |
    . . |current3 |
    . . |hash(id):[100,inf] |
    . . +--------------------+
    . .
    id:[0,100] . id:[100,200] . id:[200,inf]


    My current workaround



    Now, I have manually written a wrapper around mongo client at application level which routes requests to appropriate mongo server and thus simulating this behavior. Currently, those servers are not connected into a cluster. Problem with the current workaround is operational complexity when changes are made (change routing config, restart the application(s), ...). This is the reason I would move that logic to the database level.



    Question



    What would be a way to choose a shard key to achieve the desired architecture?
    Or, in general, when using monotonic id, how to achieve architecture so that new data is written to evenly distributed among set of shards and as time goes by and data gets older, to be able to move this data to archive nodes initiated by DB admin and all of that transparent to the application.










    share|improve this question














    bumped to the homepage by Community 7 mins ago


    This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.


















      1












      1








      1








      Situation:



      I have an ever-growing collection of documents which do have monotonically increasing unique id.
      All queries are direct lookup for documents with specified _id (no range queries nor queries on other fields).
      Queries on newer documents are more frequent than older documents.
      The workload is both read and write heavy (newer data).



      Goals:




      1. distributing reads/writes for the newest data onto multiple shards

      2. ability to move older data to nodes for archive with cheaper hardware


      Considerations/options:



      All queries use _id and cardinality is good (since values are unique) shard key should be _id.




      1. Use shard key {_id:1}


        • Problem is, of course, that shard (current) with newest data (biggest ids) will be hot since all new writes will hit it and most reads will also hit it.

        • A benefit with this setup is that I can periodically add a tag to new ranges, and then later (in a couple of months or a year) assign tag zone to shards on cheaper hardware so that whole range of data can be transparently moved onto archiving nodes.




      Simple outline of shards



      +---------+    +------------+    +-------------+
      |HDD | |HDD | |SSD |
      |archive1 | |archive2 | |current |
      |id:[0,99]| |id:[100,199]| |id:[200,inf] |
      +---------+ +------------+ +-------------+


      After time goes by, new machine is added and ranges are changed and data is moved



                                         new machine
      +----------+ +------------+ +-------------+ +------------+
      |HDD | |HDD | |HDD | |SSD |
      |archive1 | |archive2 | |archive3 | |current |
      |id:[0,100]| |id:[100,200]| |id:[200,300] | |id:[300,inf]|
      +----------+ +------------+ +-------------+ +------------+



      1. Use shard key {_id:hashed}


        • A benefit of this kind of sharding is that writes and reads are evenly distributed

        • Problem is that now it's not possible to add a tag to a range of ids which would be assigned to an archive server. Cause of this problem is that range tag is not applied to original id, but rather on hashed value of id, as stated in documentation Hashed Shard Keys and Zones




      Outline of evenly distributed current shards (assume equal ranges, -100 and 100 is just placeholder)



      +--------------------+
      |SSD |
      |current1 |
      |hash(id):[-inf,-100]|
      +--------------------+
      +--------------------+
      |SSD |
      |current2 |
      |hash(id):[-100,100] |
      +--------------------+
      +--------------------+
      |SSD |
      |current3 |
      |hash(id):[100,inf] |
      +--------------------+




      1. Shard key which is some combination of compound key




        • I was thinking that if I could make a benefit from compound key but can't figure out how to achieve the desired behavior

        • I was considering to use shard key something like {id:1,hashedId:1} where hashedId would be hash(id) calculated at client side




      Desired outline:



      A goal is to have shards with HDD which would store older data and shards with SSD which would handle most read and write operations.



      +----------+  .  +------------+  .  +--------------------+
      |HDD | . |HDD | . |SSD |
      |archive1 | . |archive2 | . |current1 |
      |id:[0,100]| . |id:[100,200]| . |hash(id):[-inf,-100]|
      +----------+ . +------------+ . +--------------------+
      . . +--------------------+
      . . |SSD |
      . . |current2 |
      . . |hash(id):[-100,100] |
      . . +--------------------+
      . . +--------------------+
      . . |SSD |
      . . |current3 |
      . . |hash(id):[100,inf] |
      . . +--------------------+
      . .
      id:[0,100] . id:[100,200] . id:[200,inf]


      My current workaround



      Now, I have manually written a wrapper around mongo client at application level which routes requests to appropriate mongo server and thus simulating this behavior. Currently, those servers are not connected into a cluster. Problem with the current workaround is operational complexity when changes are made (change routing config, restart the application(s), ...). This is the reason I would move that logic to the database level.



      Question



      What would be a way to choose a shard key to achieve the desired architecture?
      Or, in general, when using monotonic id, how to achieve architecture so that new data is written to evenly distributed among set of shards and as time goes by and data gets older, to be able to move this data to archive nodes initiated by DB admin and all of that transparent to the application.










      share|improve this question














      Situation:



      I have an ever-growing collection of documents which do have monotonically increasing unique id.
      All queries are direct lookup for documents with specified _id (no range queries nor queries on other fields).
      Queries on newer documents are more frequent than older documents.
      The workload is both read and write heavy (newer data).



      Goals:




      1. distributing reads/writes for the newest data onto multiple shards

      2. ability to move older data to nodes for archive with cheaper hardware


      Considerations/options:



      All queries use _id and cardinality is good (since values are unique) shard key should be _id.




      1. Use shard key {_id:1}


        • Problem is, of course, that shard (current) with newest data (biggest ids) will be hot since all new writes will hit it and most reads will also hit it.

        • A benefit with this setup is that I can periodically add a tag to new ranges, and then later (in a couple of months or a year) assign tag zone to shards on cheaper hardware so that whole range of data can be transparently moved onto archiving nodes.




      Simple outline of shards



      +---------+    +------------+    +-------------+
      |HDD | |HDD | |SSD |
      |archive1 | |archive2 | |current |
      |id:[0,99]| |id:[100,199]| |id:[200,inf] |
      +---------+ +------------+ +-------------+


      After time goes by, new machine is added and ranges are changed and data is moved



                                         new machine
      +----------+ +------------+ +-------------+ +------------+
      |HDD | |HDD | |HDD | |SSD |
      |archive1 | |archive2 | |archive3 | |current |
      |id:[0,100]| |id:[100,200]| |id:[200,300] | |id:[300,inf]|
      +----------+ +------------+ +-------------+ +------------+



      1. Use shard key {_id:hashed}


        • A benefit of this kind of sharding is that writes and reads are evenly distributed

        • Problem is that now it's not possible to add a tag to a range of ids which would be assigned to an archive server. Cause of this problem is that range tag is not applied to original id, but rather on hashed value of id, as stated in documentation Hashed Shard Keys and Zones




      Outline of evenly distributed current shards (assume equal ranges, -100 and 100 is just placeholder)



      +--------------------+
      |SSD |
      |current1 |
      |hash(id):[-inf,-100]|
      +--------------------+
      +--------------------+
      |SSD |
      |current2 |
      |hash(id):[-100,100] |
      +--------------------+
      +--------------------+
      |SSD |
      |current3 |
      |hash(id):[100,inf] |
      +--------------------+




      1. Shard key which is some combination of compound key




        • I was thinking that if I could make a benefit from compound key but can't figure out how to achieve the desired behavior

        • I was considering to use shard key something like {id:1,hashedId:1} where hashedId would be hash(id) calculated at client side




      Desired outline:



      A goal is to have shards with HDD which would store older data and shards with SSD which would handle most read and write operations.



      +----------+  .  +------------+  .  +--------------------+
      |HDD | . |HDD | . |SSD |
      |archive1 | . |archive2 | . |current1 |
      |id:[0,100]| . |id:[100,200]| . |hash(id):[-inf,-100]|
      +----------+ . +------------+ . +--------------------+
      . . +--------------------+
      . . |SSD |
      . . |current2 |
      . . |hash(id):[-100,100] |
      . . +--------------------+
      . . +--------------------+
      . . |SSD |
      . . |current3 |
      . . |hash(id):[100,inf] |
      . . +--------------------+
      . .
      id:[0,100] . id:[100,200] . id:[200,inf]


      My current workaround



      Now, I have manually written a wrapper around mongo client at application level which routes requests to appropriate mongo server and thus simulating this behavior. Currently, those servers are not connected into a cluster. Problem with the current workaround is operational complexity when changes are made (change routing config, restart the application(s), ...). This is the reason I would move that logic to the database level.



      Question



      What would be a way to choose a shard key to achieve the desired architecture?
      Or, in general, when using monotonic id, how to achieve architecture so that new data is written to evenly distributed among set of shards and as time goes by and data gets older, to be able to move this data to archive nodes initiated by DB admin and all of that transparent to the application.







      mongodb sharding archive






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Jul 28 '18 at 8:36









      Tomac AntonioTomac Antonio

      61




      61





      bumped to the homepage by Community 7 mins ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







      bumped to the homepage by Community 7 mins ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
























          1 Answer
          1






          active

          oldest

          votes


















          0














          Unfortunately with a fully monotonic shard key, you cannot remove hot shards entirely. However, you can somewhat mitigate its effect by inserting into a faster shard. You would need to determine if a hot shard or archival using document age is more important for you, since they are pretty much mutually exclusive if you have to use a single field as the shard key.



          Another solution is to use a compound shard key, but this will complicate your queries.



          From your description, if you need to have only _id as the shard key, one possible workaround is to use zone sharding. Zone sharding allows you to assign shards to specific "zones" based on the shard key. The balancer will then automatically move chunks as required so that the chunks with the specific shard key will reside in some specific shard (or shards).



          To illustrate, suppose you use {_id: 1} as your shard key, and you have 4 shards of to store archive and current document:




          1. Disable the balancer.



          2. Create zone tags according to the zones you need and the shards you have, e.g.:



            sh.addShardTag("shard0000", "archive1")
            sh.addShardTag("shard0001", "archive2")
            sh.addShardTag("shard0002", "current")
            sh.addShardTag("shard0003", "current")


            Note that you can assign multiple zones into a shard, and multiple shards into a zone.




          3. Determine that _id:MinKey to _id:200 should be located in archive1 and archive2:



            sh.addTagRange("db.coll", { _id: MinKey }, { _id: 100 }, "archive1")
            sh.addTagRange("db.coll", { _id: 100 }, { _id: 200 }, "archive2")



          4. Determine that _id:200 to _id:MaxKey should be located in current:



            sh.addTagRange("db.coll", { _id: 200 }, { _id: MaxKey }, "current")


          5. Enable the balancer.



          The balancer will split and rebalance the collection according to the zones you specify. Note that this would take some time & resources if you have a non-empty collection.



          At some point, you might want to change the boundaries of the zones. For this purpose, the manual page
          Tiered Hardware for Varying SLA or SLO contains a description of using zone sharding in a situation similar to what you described, under the section Updating zone ranges.






          share|improve this answer


























          • Would it be good (or bad) idea if I were to change id generator from being strictly monotonic into monotonic on long-term and being distributed on short-term? Imagine if I take 64-bit integer generated by incrementing generator and take, say... 24 (or more, depending on insertion speed) of least significant bits and perform bit reversal permutation. This should allow me to specify zone ranges for archiving and give local distributed space for multiple SSD shards. Only thing I'm afraid of is a possible maintenance nightmare in future.

            – Tomac Antonio
            Aug 6 '18 at 20:43













          • That sounds like it can be achieved using a compound shard key, e.g. using created_date:1, _id:1, and specifying a date for the zones, and keep using your working _id. The catch is, for fast find queries, you would also need to specify both fields in the query (instead of just _id like what you have now).

            – Kevin Adistambha
            Aug 6 '18 at 23:03











          • The main issue is that you need two conflicting features in a single deployment. You need it to be random half the time, while you need it to be ordered the other half of the time. I don't think a single field can achieve this. To solve this cleanly, you would need at least 2 fields, one to provide the order, and the other to provide the randomness. I'm afraid if you use complicated measures, it will be prone to breakage.

            – Kevin Adistambha
            Aug 6 '18 at 23:07











          Your Answer








          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "182"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f213469%2fmongodb-sharding-by-monotonic-id-with-zones-for-archiving%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0














          Unfortunately with a fully monotonic shard key, you cannot remove hot shards entirely. However, you can somewhat mitigate its effect by inserting into a faster shard. You would need to determine if a hot shard or archival using document age is more important for you, since they are pretty much mutually exclusive if you have to use a single field as the shard key.



          Another solution is to use a compound shard key, but this will complicate your queries.



          From your description, if you need to have only _id as the shard key, one possible workaround is to use zone sharding. Zone sharding allows you to assign shards to specific "zones" based on the shard key. The balancer will then automatically move chunks as required so that the chunks with the specific shard key will reside in some specific shard (or shards).



          To illustrate, suppose you use {_id: 1} as your shard key, and you have 4 shards of to store archive and current document:




          1. Disable the balancer.



          2. Create zone tags according to the zones you need and the shards you have, e.g.:



            sh.addShardTag("shard0000", "archive1")
            sh.addShardTag("shard0001", "archive2")
            sh.addShardTag("shard0002", "current")
            sh.addShardTag("shard0003", "current")


            Note that you can assign multiple zones into a shard, and multiple shards into a zone.




          3. Determine that _id:MinKey to _id:200 should be located in archive1 and archive2:



            sh.addTagRange("db.coll", { _id: MinKey }, { _id: 100 }, "archive1")
            sh.addTagRange("db.coll", { _id: 100 }, { _id: 200 }, "archive2")



          4. Determine that _id:200 to _id:MaxKey should be located in current:



            sh.addTagRange("db.coll", { _id: 200 }, { _id: MaxKey }, "current")


          5. Enable the balancer.



          The balancer will split and rebalance the collection according to the zones you specify. Note that this would take some time & resources if you have a non-empty collection.



          At some point, you might want to change the boundaries of the zones. For this purpose, the manual page
          Tiered Hardware for Varying SLA or SLO contains a description of using zone sharding in a situation similar to what you described, under the section Updating zone ranges.






          share|improve this answer


























          • Would it be good (or bad) idea if I were to change id generator from being strictly monotonic into monotonic on long-term and being distributed on short-term? Imagine if I take 64-bit integer generated by incrementing generator and take, say... 24 (or more, depending on insertion speed) of least significant bits and perform bit reversal permutation. This should allow me to specify zone ranges for archiving and give local distributed space for multiple SSD shards. Only thing I'm afraid of is a possible maintenance nightmare in future.

            – Tomac Antonio
            Aug 6 '18 at 20:43













          • That sounds like it can be achieved using a compound shard key, e.g. using created_date:1, _id:1, and specifying a date for the zones, and keep using your working _id. The catch is, for fast find queries, you would also need to specify both fields in the query (instead of just _id like what you have now).

            – Kevin Adistambha
            Aug 6 '18 at 23:03











          • The main issue is that you need two conflicting features in a single deployment. You need it to be random half the time, while you need it to be ordered the other half of the time. I don't think a single field can achieve this. To solve this cleanly, you would need at least 2 fields, one to provide the order, and the other to provide the randomness. I'm afraid if you use complicated measures, it will be prone to breakage.

            – Kevin Adistambha
            Aug 6 '18 at 23:07
















          0














          Unfortunately with a fully monotonic shard key, you cannot remove hot shards entirely. However, you can somewhat mitigate its effect by inserting into a faster shard. You would need to determine if a hot shard or archival using document age is more important for you, since they are pretty much mutually exclusive if you have to use a single field as the shard key.



          Another solution is to use a compound shard key, but this will complicate your queries.



          From your description, if you need to have only _id as the shard key, one possible workaround is to use zone sharding. Zone sharding allows you to assign shards to specific "zones" based on the shard key. The balancer will then automatically move chunks as required so that the chunks with the specific shard key will reside in some specific shard (or shards).



          To illustrate, suppose you use {_id: 1} as your shard key, and you have 4 shards of to store archive and current document:




          1. Disable the balancer.



          2. Create zone tags according to the zones you need and the shards you have, e.g.:



            sh.addShardTag("shard0000", "archive1")
            sh.addShardTag("shard0001", "archive2")
            sh.addShardTag("shard0002", "current")
            sh.addShardTag("shard0003", "current")


            Note that you can assign multiple zones into a shard, and multiple shards into a zone.




          3. Determine that _id:MinKey to _id:200 should be located in archive1 and archive2:



            sh.addTagRange("db.coll", { _id: MinKey }, { _id: 100 }, "archive1")
            sh.addTagRange("db.coll", { _id: 100 }, { _id: 200 }, "archive2")



          4. Determine that _id:200 to _id:MaxKey should be located in current:



            sh.addTagRange("db.coll", { _id: 200 }, { _id: MaxKey }, "current")


          5. Enable the balancer.



          The balancer will split and rebalance the collection according to the zones you specify. Note that this would take some time & resources if you have a non-empty collection.



          At some point, you might want to change the boundaries of the zones. For this purpose, the manual page
          Tiered Hardware for Varying SLA or SLO contains a description of using zone sharding in a situation similar to what you described, under the section Updating zone ranges.






          share|improve this answer


























          • Would it be good (or bad) idea if I were to change id generator from being strictly monotonic into monotonic on long-term and being distributed on short-term? Imagine if I take 64-bit integer generated by incrementing generator and take, say... 24 (or more, depending on insertion speed) of least significant bits and perform bit reversal permutation. This should allow me to specify zone ranges for archiving and give local distributed space for multiple SSD shards. Only thing I'm afraid of is a possible maintenance nightmare in future.

            – Tomac Antonio
            Aug 6 '18 at 20:43













          • That sounds like it can be achieved using a compound shard key, e.g. using created_date:1, _id:1, and specifying a date for the zones, and keep using your working _id. The catch is, for fast find queries, you would also need to specify both fields in the query (instead of just _id like what you have now).

            – Kevin Adistambha
            Aug 6 '18 at 23:03











          • The main issue is that you need two conflicting features in a single deployment. You need it to be random half the time, while you need it to be ordered the other half of the time. I don't think a single field can achieve this. To solve this cleanly, you would need at least 2 fields, one to provide the order, and the other to provide the randomness. I'm afraid if you use complicated measures, it will be prone to breakage.

            – Kevin Adistambha
            Aug 6 '18 at 23:07














          0












          0








          0







          Unfortunately with a fully monotonic shard key, you cannot remove hot shards entirely. However, you can somewhat mitigate its effect by inserting into a faster shard. You would need to determine if a hot shard or archival using document age is more important for you, since they are pretty much mutually exclusive if you have to use a single field as the shard key.



          Another solution is to use a compound shard key, but this will complicate your queries.



          From your description, if you need to have only _id as the shard key, one possible workaround is to use zone sharding. Zone sharding allows you to assign shards to specific "zones" based on the shard key. The balancer will then automatically move chunks as required so that the chunks with the specific shard key will reside in some specific shard (or shards).



          To illustrate, suppose you use {_id: 1} as your shard key, and you have 4 shards of to store archive and current document:




          1. Disable the balancer.



          2. Create zone tags according to the zones you need and the shards you have, e.g.:



            sh.addShardTag("shard0000", "archive1")
            sh.addShardTag("shard0001", "archive2")
            sh.addShardTag("shard0002", "current")
            sh.addShardTag("shard0003", "current")


            Note that you can assign multiple zones into a shard, and multiple shards into a zone.




          3. Determine that _id:MinKey to _id:200 should be located in archive1 and archive2:



            sh.addTagRange("db.coll", { _id: MinKey }, { _id: 100 }, "archive1")
            sh.addTagRange("db.coll", { _id: 100 }, { _id: 200 }, "archive2")



          4. Determine that _id:200 to _id:MaxKey should be located in current:



            sh.addTagRange("db.coll", { _id: 200 }, { _id: MaxKey }, "current")


          5. Enable the balancer.



          The balancer will split and rebalance the collection according to the zones you specify. Note that this would take some time & resources if you have a non-empty collection.



          At some point, you might want to change the boundaries of the zones. For this purpose, the manual page
          Tiered Hardware for Varying SLA or SLO contains a description of using zone sharding in a situation similar to what you described, under the section Updating zone ranges.






          share|improve this answer















          Unfortunately with a fully monotonic shard key, you cannot remove hot shards entirely. However, you can somewhat mitigate its effect by inserting into a faster shard. You would need to determine if a hot shard or archival using document age is more important for you, since they are pretty much mutually exclusive if you have to use a single field as the shard key.



          Another solution is to use a compound shard key, but this will complicate your queries.



          From your description, if you need to have only _id as the shard key, one possible workaround is to use zone sharding. Zone sharding allows you to assign shards to specific "zones" based on the shard key. The balancer will then automatically move chunks as required so that the chunks with the specific shard key will reside in some specific shard (or shards).



          To illustrate, suppose you use {_id: 1} as your shard key, and you have 4 shards of to store archive and current document:




          1. Disable the balancer.



          2. Create zone tags according to the zones you need and the shards you have, e.g.:



            sh.addShardTag("shard0000", "archive1")
            sh.addShardTag("shard0001", "archive2")
            sh.addShardTag("shard0002", "current")
            sh.addShardTag("shard0003", "current")


            Note that you can assign multiple zones into a shard, and multiple shards into a zone.




          3. Determine that _id:MinKey to _id:200 should be located in archive1 and archive2:



            sh.addTagRange("db.coll", { _id: MinKey }, { _id: 100 }, "archive1")
            sh.addTagRange("db.coll", { _id: 100 }, { _id: 200 }, "archive2")



          4. Determine that _id:200 to _id:MaxKey should be located in current:



            sh.addTagRange("db.coll", { _id: 200 }, { _id: MaxKey }, "current")


          5. Enable the balancer.



          The balancer will split and rebalance the collection according to the zones you specify. Note that this would take some time & resources if you have a non-empty collection.



          At some point, you might want to change the boundaries of the zones. For this purpose, the manual page
          Tiered Hardware for Varying SLA or SLO contains a description of using zone sharding in a situation similar to what you described, under the section Updating zone ranges.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Aug 6 '18 at 7:07

























          answered Aug 6 '18 at 7:01









          Kevin AdistambhaKevin Adistambha

          37317




          37317













          • Would it be good (or bad) idea if I were to change id generator from being strictly monotonic into monotonic on long-term and being distributed on short-term? Imagine if I take 64-bit integer generated by incrementing generator and take, say... 24 (or more, depending on insertion speed) of least significant bits and perform bit reversal permutation. This should allow me to specify zone ranges for archiving and give local distributed space for multiple SSD shards. Only thing I'm afraid of is a possible maintenance nightmare in future.

            – Tomac Antonio
            Aug 6 '18 at 20:43













          • That sounds like it can be achieved using a compound shard key, e.g. using created_date:1, _id:1, and specifying a date for the zones, and keep using your working _id. The catch is, for fast find queries, you would also need to specify both fields in the query (instead of just _id like what you have now).

            – Kevin Adistambha
            Aug 6 '18 at 23:03











          • The main issue is that you need two conflicting features in a single deployment. You need it to be random half the time, while you need it to be ordered the other half of the time. I don't think a single field can achieve this. To solve this cleanly, you would need at least 2 fields, one to provide the order, and the other to provide the randomness. I'm afraid if you use complicated measures, it will be prone to breakage.

            – Kevin Adistambha
            Aug 6 '18 at 23:07



















          • Would it be good (or bad) idea if I were to change id generator from being strictly monotonic into monotonic on long-term and being distributed on short-term? Imagine if I take 64-bit integer generated by incrementing generator and take, say... 24 (or more, depending on insertion speed) of least significant bits and perform bit reversal permutation. This should allow me to specify zone ranges for archiving and give local distributed space for multiple SSD shards. Only thing I'm afraid of is a possible maintenance nightmare in future.

            – Tomac Antonio
            Aug 6 '18 at 20:43













          • That sounds like it can be achieved using a compound shard key, e.g. using created_date:1, _id:1, and specifying a date for the zones, and keep using your working _id. The catch is, for fast find queries, you would also need to specify both fields in the query (instead of just _id like what you have now).

            – Kevin Adistambha
            Aug 6 '18 at 23:03











          • The main issue is that you need two conflicting features in a single deployment. You need it to be random half the time, while you need it to be ordered the other half of the time. I don't think a single field can achieve this. To solve this cleanly, you would need at least 2 fields, one to provide the order, and the other to provide the randomness. I'm afraid if you use complicated measures, it will be prone to breakage.

            – Kevin Adistambha
            Aug 6 '18 at 23:07

















          Would it be good (or bad) idea if I were to change id generator from being strictly monotonic into monotonic on long-term and being distributed on short-term? Imagine if I take 64-bit integer generated by incrementing generator and take, say... 24 (or more, depending on insertion speed) of least significant bits and perform bit reversal permutation. This should allow me to specify zone ranges for archiving and give local distributed space for multiple SSD shards. Only thing I'm afraid of is a possible maintenance nightmare in future.

          – Tomac Antonio
          Aug 6 '18 at 20:43







          Would it be good (or bad) idea if I were to change id generator from being strictly monotonic into monotonic on long-term and being distributed on short-term? Imagine if I take 64-bit integer generated by incrementing generator and take, say... 24 (or more, depending on insertion speed) of least significant bits and perform bit reversal permutation. This should allow me to specify zone ranges for archiving and give local distributed space for multiple SSD shards. Only thing I'm afraid of is a possible maintenance nightmare in future.

          – Tomac Antonio
          Aug 6 '18 at 20:43















          That sounds like it can be achieved using a compound shard key, e.g. using created_date:1, _id:1, and specifying a date for the zones, and keep using your working _id. The catch is, for fast find queries, you would also need to specify both fields in the query (instead of just _id like what you have now).

          – Kevin Adistambha
          Aug 6 '18 at 23:03





          That sounds like it can be achieved using a compound shard key, e.g. using created_date:1, _id:1, and specifying a date for the zones, and keep using your working _id. The catch is, for fast find queries, you would also need to specify both fields in the query (instead of just _id like what you have now).

          – Kevin Adistambha
          Aug 6 '18 at 23:03













          The main issue is that you need two conflicting features in a single deployment. You need it to be random half the time, while you need it to be ordered the other half of the time. I don't think a single field can achieve this. To solve this cleanly, you would need at least 2 fields, one to provide the order, and the other to provide the randomness. I'm afraid if you use complicated measures, it will be prone to breakage.

          – Kevin Adistambha
          Aug 6 '18 at 23:07





          The main issue is that you need two conflicting features in a single deployment. You need it to be random half the time, while you need it to be ordered the other half of the time. I don't think a single field can achieve this. To solve this cleanly, you would need at least 2 fields, one to provide the order, and the other to provide the randomness. I'm afraid if you use complicated measures, it will be prone to breakage.

          – Kevin Adistambha
          Aug 6 '18 at 23:07


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Database Administrators Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f213469%2fmongodb-sharding-by-monotonic-id-with-zones-for-archiving%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          SQL Server 17 - Attemping to backup to remote NAS but Access is denied

          Always On Availability groups resolving state after failover - Remote harden of transaction...

          Restoring from pg_dump with foreign key constraints