Transaction terminated running SELECT on secondary AG group












1















Using SQL Server 2016, Always-On



Running a simple SELECT on a table within a secondary Availability Group.



select distinct some_column 
from some_table oo (nolock)
inner join some_other_ ss (nolock) on ss.some_column= oo.some_other_column


SSMS eventually sends back this error:




Msg 3948, Level 16, State 2, Line 14061 The transaction was terminated
because of the availability replica config/state change or because
ghost records are being deleted on the primary and the secondary
availability replica that might be needed by queries running under
snapshot isolation. Retry the transaction.




The same SELECT with (NOLOCK) works fine on the primary AG.



With or without the NOLOCK it fails most of the time 9/10 and sometimes executes correctly but that's rare. No changes were made to the AG at all.



There are. no changes in the AlwaysOn_Health extended event session.



The low_water_mark_for_ghosts in sys.dm_hadr_database_replica_states is NULL for some and for others there is a value. For those with a value, this is also the primary.



Using the script provided by Nic:



;
WITH PrimaryStats
AS ( SELECT DB_NAME(database_id) AS DatabaseName ,
low_water_mark_for_ghosts ,
ar.replica_server_name ,
ar.availability_mode_desc
FROM sys.dm_hadr_database_replica_states hdrs
JOIN sys.availability_replicas ar ON hdrs.replica_id = ar.replica_id
WHERE ar.replica_server_name = @@SERVERNAME
)
SELECT DB_NAME(database_id) AS DatabaseName ,
hdrs.low_water_mark_for_ghosts AS LowWaterMarkSecondaryReplica,
ps.low_water_mark_for_ghosts AS LowWaterMarkLocalReplica,
ps.low_water_mark_for_ghosts - hdrs.low_water_mark_for_ghosts AS GhostWatermarkDiff,
ar.replica_server_name AS ReplicaNode,
DATEDIFF(SECOND, last_redone_time, GETDATE()) AS RedoDiffSec,
last_redone_time
FROM sys.dm_hadr_database_replica_states hdrs
JOIN sys.availability_replicas ar ON hdrs.replica_id = ar.replica_id
JOIN PrimaryStats ps ON ps.DatabaseName = DB_NAME(database_id)
ORDER BY
DatabaseName ASC,'NODE ' + right(ar.replica_server_name ,1);


The output is:



enter image description here



When the LowWaterMarkSecondaryReplica is NULL, its because that's the instance I'm currently executing this query from.










share|improve this question





























    1















    Using SQL Server 2016, Always-On



    Running a simple SELECT on a table within a secondary Availability Group.



    select distinct some_column 
    from some_table oo (nolock)
    inner join some_other_ ss (nolock) on ss.some_column= oo.some_other_column


    SSMS eventually sends back this error:




    Msg 3948, Level 16, State 2, Line 14061 The transaction was terminated
    because of the availability replica config/state change or because
    ghost records are being deleted on the primary and the secondary
    availability replica that might be needed by queries running under
    snapshot isolation. Retry the transaction.




    The same SELECT with (NOLOCK) works fine on the primary AG.



    With or without the NOLOCK it fails most of the time 9/10 and sometimes executes correctly but that's rare. No changes were made to the AG at all.



    There are. no changes in the AlwaysOn_Health extended event session.



    The low_water_mark_for_ghosts in sys.dm_hadr_database_replica_states is NULL for some and for others there is a value. For those with a value, this is also the primary.



    Using the script provided by Nic:



    ;
    WITH PrimaryStats
    AS ( SELECT DB_NAME(database_id) AS DatabaseName ,
    low_water_mark_for_ghosts ,
    ar.replica_server_name ,
    ar.availability_mode_desc
    FROM sys.dm_hadr_database_replica_states hdrs
    JOIN sys.availability_replicas ar ON hdrs.replica_id = ar.replica_id
    WHERE ar.replica_server_name = @@SERVERNAME
    )
    SELECT DB_NAME(database_id) AS DatabaseName ,
    hdrs.low_water_mark_for_ghosts AS LowWaterMarkSecondaryReplica,
    ps.low_water_mark_for_ghosts AS LowWaterMarkLocalReplica,
    ps.low_water_mark_for_ghosts - hdrs.low_water_mark_for_ghosts AS GhostWatermarkDiff,
    ar.replica_server_name AS ReplicaNode,
    DATEDIFF(SECOND, last_redone_time, GETDATE()) AS RedoDiffSec,
    last_redone_time
    FROM sys.dm_hadr_database_replica_states hdrs
    JOIN sys.availability_replicas ar ON hdrs.replica_id = ar.replica_id
    JOIN PrimaryStats ps ON ps.DatabaseName = DB_NAME(database_id)
    ORDER BY
    DatabaseName ASC,'NODE ' + right(ar.replica_server_name ,1);


    The output is:



    enter image description here



    When the LowWaterMarkSecondaryReplica is NULL, its because that's the instance I'm currently executing this query from.










    share|improve this question



























      1












      1








      1








      Using SQL Server 2016, Always-On



      Running a simple SELECT on a table within a secondary Availability Group.



      select distinct some_column 
      from some_table oo (nolock)
      inner join some_other_ ss (nolock) on ss.some_column= oo.some_other_column


      SSMS eventually sends back this error:




      Msg 3948, Level 16, State 2, Line 14061 The transaction was terminated
      because of the availability replica config/state change or because
      ghost records are being deleted on the primary and the secondary
      availability replica that might be needed by queries running under
      snapshot isolation. Retry the transaction.




      The same SELECT with (NOLOCK) works fine on the primary AG.



      With or without the NOLOCK it fails most of the time 9/10 and sometimes executes correctly but that's rare. No changes were made to the AG at all.



      There are. no changes in the AlwaysOn_Health extended event session.



      The low_water_mark_for_ghosts in sys.dm_hadr_database_replica_states is NULL for some and for others there is a value. For those with a value, this is also the primary.



      Using the script provided by Nic:



      ;
      WITH PrimaryStats
      AS ( SELECT DB_NAME(database_id) AS DatabaseName ,
      low_water_mark_for_ghosts ,
      ar.replica_server_name ,
      ar.availability_mode_desc
      FROM sys.dm_hadr_database_replica_states hdrs
      JOIN sys.availability_replicas ar ON hdrs.replica_id = ar.replica_id
      WHERE ar.replica_server_name = @@SERVERNAME
      )
      SELECT DB_NAME(database_id) AS DatabaseName ,
      hdrs.low_water_mark_for_ghosts AS LowWaterMarkSecondaryReplica,
      ps.low_water_mark_for_ghosts AS LowWaterMarkLocalReplica,
      ps.low_water_mark_for_ghosts - hdrs.low_water_mark_for_ghosts AS GhostWatermarkDiff,
      ar.replica_server_name AS ReplicaNode,
      DATEDIFF(SECOND, last_redone_time, GETDATE()) AS RedoDiffSec,
      last_redone_time
      FROM sys.dm_hadr_database_replica_states hdrs
      JOIN sys.availability_replicas ar ON hdrs.replica_id = ar.replica_id
      JOIN PrimaryStats ps ON ps.DatabaseName = DB_NAME(database_id)
      ORDER BY
      DatabaseName ASC,'NODE ' + right(ar.replica_server_name ,1);


      The output is:



      enter image description here



      When the LowWaterMarkSecondaryReplica is NULL, its because that's the instance I'm currently executing this query from.










      share|improve this question
















      Using SQL Server 2016, Always-On



      Running a simple SELECT on a table within a secondary Availability Group.



      select distinct some_column 
      from some_table oo (nolock)
      inner join some_other_ ss (nolock) on ss.some_column= oo.some_other_column


      SSMS eventually sends back this error:




      Msg 3948, Level 16, State 2, Line 14061 The transaction was terminated
      because of the availability replica config/state change or because
      ghost records are being deleted on the primary and the secondary
      availability replica that might be needed by queries running under
      snapshot isolation. Retry the transaction.




      The same SELECT with (NOLOCK) works fine on the primary AG.



      With or without the NOLOCK it fails most of the time 9/10 and sometimes executes correctly but that's rare. No changes were made to the AG at all.



      There are. no changes in the AlwaysOn_Health extended event session.



      The low_water_mark_for_ghosts in sys.dm_hadr_database_replica_states is NULL for some and for others there is a value. For those with a value, this is also the primary.



      Using the script provided by Nic:



      ;
      WITH PrimaryStats
      AS ( SELECT DB_NAME(database_id) AS DatabaseName ,
      low_water_mark_for_ghosts ,
      ar.replica_server_name ,
      ar.availability_mode_desc
      FROM sys.dm_hadr_database_replica_states hdrs
      JOIN sys.availability_replicas ar ON hdrs.replica_id = ar.replica_id
      WHERE ar.replica_server_name = @@SERVERNAME
      )
      SELECT DB_NAME(database_id) AS DatabaseName ,
      hdrs.low_water_mark_for_ghosts AS LowWaterMarkSecondaryReplica,
      ps.low_water_mark_for_ghosts AS LowWaterMarkLocalReplica,
      ps.low_water_mark_for_ghosts - hdrs.low_water_mark_for_ghosts AS GhostWatermarkDiff,
      ar.replica_server_name AS ReplicaNode,
      DATEDIFF(SECOND, last_redone_time, GETDATE()) AS RedoDiffSec,
      last_redone_time
      FROM sys.dm_hadr_database_replica_states hdrs
      JOIN sys.availability_replicas ar ON hdrs.replica_id = ar.replica_id
      JOIN PrimaryStats ps ON ps.DatabaseName = DB_NAME(database_id)
      ORDER BY
      DatabaseName ASC,'NODE ' + right(ar.replica_server_name ,1);


      The output is:



      enter image description here



      When the LowWaterMarkSecondaryReplica is NULL, its because that's the instance I'm currently executing this query from.







      sql-server sql-server-2016 availability-groups ghost-cleanup






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited 2 mins ago









      Paul White

      50.9k14278448




      50.9k14278448










      asked Nov 16 '16 at 13:58









      Craig EfreinCraig Efrein

      7,21284080




      7,21284080






















          1 Answer
          1






          active

          oldest

          votes


















          2














          This is caused by one or two things, much like the error says:




          1. A config change was made to the AG

          2. The primary restarted

          3. The ghost cleanup low watermark was set


          The first two are fairly obvious as to why this would be an issue. The 3rd is not.



          Since secondary replicas automagically remap read-committed isolation level to the snapshot isolation level a few things need to kick in. One of those things is row versioning, since after all SI is an optimistic concurrency.



          Row versioning depends on the version store rows to be available on the secondary. If the low water for ghost cleanup is changed, then the secondary may need to cleanup records that are currently used for queries being satisfied by the version store which have a timestamp lower than the new ghost cleanup low watermark. When this is the case, those versions will be cleaned up and any SI datasets that required them will no longer be valid. That's when this error will surface for the session for that dataset.



          select distinct some_column 
          from some_table oo (nolock)
          inner join some_other_ ss (nolock) on ss.some_column= oo.some_other_column


          If this query runs quickly, I don't see why it should error out like that as I doubt your low watermark is changing that rapidly. Thus, I'm going to assume this takes some time to run. Having the appropriate indexes, not using a long transaction, and up-to-date statistics will help with this query. Additionally, the NOLOCK hint is completely ignored on the readable secondary, you may already know that and it may just be code re-use which is fine but I wanted to point it out anyway.




          We are considering using read-intent to resolve the issue, do you think that will resolve the problem?




          ApplicationIntent is only used for two things in general with AOAGs:




          1. read only routing (ROR)

          2. to access AG Replicas that are set as secondary mode read-intent only.


          This will not solve the underlying problem. IMHO you have two options, make the query run faster (if it isn't already) or open up a support ticket with MS. The reason I say open a ticket is this query should not constantly fail 24/7 and the issue should be looked into.






          share|improve this answer


























          • Hi, you are definitely right about the SI kicking in, so yeah the NOLOCK and ROWLOCK are totally ignored. Since the queries failing with Always On worked perfectly in Replication, its considered a regression so we have to find the answer in Always On and not by changing the query. We are considering using read-intent to resolve the issue, do you think that will resolve the problem?

            – Craig Efrein
            Nov 18 '16 at 12:47











          Your Answer








          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "182"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f155462%2ftransaction-terminated-running-select-on-secondary-ag-group%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          2














          This is caused by one or two things, much like the error says:




          1. A config change was made to the AG

          2. The primary restarted

          3. The ghost cleanup low watermark was set


          The first two are fairly obvious as to why this would be an issue. The 3rd is not.



          Since secondary replicas automagically remap read-committed isolation level to the snapshot isolation level a few things need to kick in. One of those things is row versioning, since after all SI is an optimistic concurrency.



          Row versioning depends on the version store rows to be available on the secondary. If the low water for ghost cleanup is changed, then the secondary may need to cleanup records that are currently used for queries being satisfied by the version store which have a timestamp lower than the new ghost cleanup low watermark. When this is the case, those versions will be cleaned up and any SI datasets that required them will no longer be valid. That's when this error will surface for the session for that dataset.



          select distinct some_column 
          from some_table oo (nolock)
          inner join some_other_ ss (nolock) on ss.some_column= oo.some_other_column


          If this query runs quickly, I don't see why it should error out like that as I doubt your low watermark is changing that rapidly. Thus, I'm going to assume this takes some time to run. Having the appropriate indexes, not using a long transaction, and up-to-date statistics will help with this query. Additionally, the NOLOCK hint is completely ignored on the readable secondary, you may already know that and it may just be code re-use which is fine but I wanted to point it out anyway.




          We are considering using read-intent to resolve the issue, do you think that will resolve the problem?




          ApplicationIntent is only used for two things in general with AOAGs:




          1. read only routing (ROR)

          2. to access AG Replicas that are set as secondary mode read-intent only.


          This will not solve the underlying problem. IMHO you have two options, make the query run faster (if it isn't already) or open up a support ticket with MS. The reason I say open a ticket is this query should not constantly fail 24/7 and the issue should be looked into.






          share|improve this answer


























          • Hi, you are definitely right about the SI kicking in, so yeah the NOLOCK and ROWLOCK are totally ignored. Since the queries failing with Always On worked perfectly in Replication, its considered a regression so we have to find the answer in Always On and not by changing the query. We are considering using read-intent to resolve the issue, do you think that will resolve the problem?

            – Craig Efrein
            Nov 18 '16 at 12:47
















          2














          This is caused by one or two things, much like the error says:




          1. A config change was made to the AG

          2. The primary restarted

          3. The ghost cleanup low watermark was set


          The first two are fairly obvious as to why this would be an issue. The 3rd is not.



          Since secondary replicas automagically remap read-committed isolation level to the snapshot isolation level a few things need to kick in. One of those things is row versioning, since after all SI is an optimistic concurrency.



          Row versioning depends on the version store rows to be available on the secondary. If the low water for ghost cleanup is changed, then the secondary may need to cleanup records that are currently used for queries being satisfied by the version store which have a timestamp lower than the new ghost cleanup low watermark. When this is the case, those versions will be cleaned up and any SI datasets that required them will no longer be valid. That's when this error will surface for the session for that dataset.



          select distinct some_column 
          from some_table oo (nolock)
          inner join some_other_ ss (nolock) on ss.some_column= oo.some_other_column


          If this query runs quickly, I don't see why it should error out like that as I doubt your low watermark is changing that rapidly. Thus, I'm going to assume this takes some time to run. Having the appropriate indexes, not using a long transaction, and up-to-date statistics will help with this query. Additionally, the NOLOCK hint is completely ignored on the readable secondary, you may already know that and it may just be code re-use which is fine but I wanted to point it out anyway.




          We are considering using read-intent to resolve the issue, do you think that will resolve the problem?




          ApplicationIntent is only used for two things in general with AOAGs:




          1. read only routing (ROR)

          2. to access AG Replicas that are set as secondary mode read-intent only.


          This will not solve the underlying problem. IMHO you have two options, make the query run faster (if it isn't already) or open up a support ticket with MS. The reason I say open a ticket is this query should not constantly fail 24/7 and the issue should be looked into.






          share|improve this answer


























          • Hi, you are definitely right about the SI kicking in, so yeah the NOLOCK and ROWLOCK are totally ignored. Since the queries failing with Always On worked perfectly in Replication, its considered a regression so we have to find the answer in Always On and not by changing the query. We are considering using read-intent to resolve the issue, do you think that will resolve the problem?

            – Craig Efrein
            Nov 18 '16 at 12:47














          2












          2








          2







          This is caused by one or two things, much like the error says:




          1. A config change was made to the AG

          2. The primary restarted

          3. The ghost cleanup low watermark was set


          The first two are fairly obvious as to why this would be an issue. The 3rd is not.



          Since secondary replicas automagically remap read-committed isolation level to the snapshot isolation level a few things need to kick in. One of those things is row versioning, since after all SI is an optimistic concurrency.



          Row versioning depends on the version store rows to be available on the secondary. If the low water for ghost cleanup is changed, then the secondary may need to cleanup records that are currently used for queries being satisfied by the version store which have a timestamp lower than the new ghost cleanup low watermark. When this is the case, those versions will be cleaned up and any SI datasets that required them will no longer be valid. That's when this error will surface for the session for that dataset.



          select distinct some_column 
          from some_table oo (nolock)
          inner join some_other_ ss (nolock) on ss.some_column= oo.some_other_column


          If this query runs quickly, I don't see why it should error out like that as I doubt your low watermark is changing that rapidly. Thus, I'm going to assume this takes some time to run. Having the appropriate indexes, not using a long transaction, and up-to-date statistics will help with this query. Additionally, the NOLOCK hint is completely ignored on the readable secondary, you may already know that and it may just be code re-use which is fine but I wanted to point it out anyway.




          We are considering using read-intent to resolve the issue, do you think that will resolve the problem?




          ApplicationIntent is only used for two things in general with AOAGs:




          1. read only routing (ROR)

          2. to access AG Replicas that are set as secondary mode read-intent only.


          This will not solve the underlying problem. IMHO you have two options, make the query run faster (if it isn't already) or open up a support ticket with MS. The reason I say open a ticket is this query should not constantly fail 24/7 and the issue should be looked into.






          share|improve this answer















          This is caused by one or two things, much like the error says:




          1. A config change was made to the AG

          2. The primary restarted

          3. The ghost cleanup low watermark was set


          The first two are fairly obvious as to why this would be an issue. The 3rd is not.



          Since secondary replicas automagically remap read-committed isolation level to the snapshot isolation level a few things need to kick in. One of those things is row versioning, since after all SI is an optimistic concurrency.



          Row versioning depends on the version store rows to be available on the secondary. If the low water for ghost cleanup is changed, then the secondary may need to cleanup records that are currently used for queries being satisfied by the version store which have a timestamp lower than the new ghost cleanup low watermark. When this is the case, those versions will be cleaned up and any SI datasets that required them will no longer be valid. That's when this error will surface for the session for that dataset.



          select distinct some_column 
          from some_table oo (nolock)
          inner join some_other_ ss (nolock) on ss.some_column= oo.some_other_column


          If this query runs quickly, I don't see why it should error out like that as I doubt your low watermark is changing that rapidly. Thus, I'm going to assume this takes some time to run. Having the appropriate indexes, not using a long transaction, and up-to-date statistics will help with this query. Additionally, the NOLOCK hint is completely ignored on the readable secondary, you may already know that and it may just be code re-use which is fine but I wanted to point it out anyway.




          We are considering using read-intent to resolve the issue, do you think that will resolve the problem?




          ApplicationIntent is only used for two things in general with AOAGs:




          1. read only routing (ROR)

          2. to access AG Replicas that are set as secondary mode read-intent only.


          This will not solve the underlying problem. IMHO you have two options, make the query run faster (if it isn't already) or open up a support ticket with MS. The reason I say open a ticket is this query should not constantly fail 24/7 and the issue should be looked into.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited 8 mins ago









          Paul White

          50.9k14278448




          50.9k14278448










          answered Nov 17 '16 at 19:52









          Sean GallardySean Gallardy

          15.8k22548




          15.8k22548













          • Hi, you are definitely right about the SI kicking in, so yeah the NOLOCK and ROWLOCK are totally ignored. Since the queries failing with Always On worked perfectly in Replication, its considered a regression so we have to find the answer in Always On and not by changing the query. We are considering using read-intent to resolve the issue, do you think that will resolve the problem?

            – Craig Efrein
            Nov 18 '16 at 12:47



















          • Hi, you are definitely right about the SI kicking in, so yeah the NOLOCK and ROWLOCK are totally ignored. Since the queries failing with Always On worked perfectly in Replication, its considered a regression so we have to find the answer in Always On and not by changing the query. We are considering using read-intent to resolve the issue, do you think that will resolve the problem?

            – Craig Efrein
            Nov 18 '16 at 12:47

















          Hi, you are definitely right about the SI kicking in, so yeah the NOLOCK and ROWLOCK are totally ignored. Since the queries failing with Always On worked perfectly in Replication, its considered a regression so we have to find the answer in Always On and not by changing the query. We are considering using read-intent to resolve the issue, do you think that will resolve the problem?

          – Craig Efrein
          Nov 18 '16 at 12:47





          Hi, you are definitely right about the SI kicking in, so yeah the NOLOCK and ROWLOCK are totally ignored. Since the queries failing with Always On worked perfectly in Replication, its considered a regression so we have to find the answer in Always On and not by changing the query. We are considering using read-intent to resolve the issue, do you think that will resolve the problem?

          – Craig Efrein
          Nov 18 '16 at 12:47


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Database Administrators Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f155462%2ftransaction-terminated-running-select-on-secondary-ag-group%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          SQL Server 17 - Attemping to backup to remote NAS but Access is denied

          Always On Availability groups resolving state after failover - Remote harden of transaction...

          Restoring from pg_dump with foreign key constraints