Always On Availability groups resolving state after failover - Remote harden of transaction...












0















My explanation will be opinion based as it is based on my findings. If more information is needed, I am always happy to add it.



The setup



We have 4 SQL Server 2016 Instances, and all 4 underlying servers are in a WSFC (In my eyes a strange setup since Always On AG's are 2 by 2. But that is a different question.)



enter image description here



4 nodes + Fileshare witness



Version: 13.0.5026.0 (all nodes)



The instances that are having the failover issues are instances 1 & 2.
They have 3 AG's between them. AG1 is primary on instance 1 and is having the issue. AG2 and AG3 are primary on instance 2.



Instances 3 and 4 have their own separate AG's (2).



AG1's settings



enter image description here



The week before



Last week, after checking for long running queries (none) and the synchronization state (green accross the board) we initiated a failover on AG1 from instance1 (primary) to instance2 (secondary).



The failover failed, leaving both the former primary (instance1) and the former secondary (instance2) in a temporary resolving state. Two databases on the former primary even showed Reverting / In Recovery for a few minutes. In the error log, a message was shown on many occasions:



enter image description here



The databases where not accessible during the outage, the outage took about 10 minutes total.



Before failing over I checked:




  • Current running queries (no long running queries)

  • DBCC OPENTRAN

  • Always On dashboard (Green accross the board)


We found a post by microsoft to enable a traceflag, 3923. We enabled this one on both instances after the outage.



Take #2, today



We tried testing the failover again to see if the issues where resolved.



Spoiler alert, they where not resolved.



enter image description here



Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9da2 0000:078d645c) started at Feb 19 2019 11:17AM in database 'DB2' at LSN (5909:54307:137) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9dee 0000:0017e40a) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (164:21769:2) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141ba1a8 0000:0024e912) started at Feb 19 2019 11:17AM in database 'DB3' at LSN (1246:7285:2) failed.


Again, the same error messages where shown.



The actual question(s)




  1. The beginning question is, am i right to assume that the failover
    not happening is due to the GhostCleanupTask not being able to
    harden the transaction log, or is it a consequence of the failure
    and not the cause.

  2. If it is the cause, what could be done to resolve this?




Something I am looking into, since the problem does not seem to be due to any long running queries rolling back, that the secondary databases are in an inconsistent state? But they show up as Synchronized.



Extra info



The only interesting message in the clusterlog from last week appeared to be this one:



 mscs::TopologyPersister::TryGetNetworkPrivateProperties: (2)' because of 'OpenSubKey failed.'


Failover cluster events




Cluster resource 'AG1' of type 'SQL Server Availability Group' in
clustered role 'AG1' failed.



Based on the failure policies for the resource and role, the cluster
service may try to bring the resource online on this node or move the
group to another node of the cluster and then restart it. Check the
resource and group state using Failover Cluster Manager or the
Get-ClusterResource Windows PowerShell cmdlet.



The Cluster service failed to bring clustered role 'AG_SP_2016_New'
completely online or offline. One or more resources may be in a failed
state. This may impact the availability of the clustered role.











share|improve this question

























  • Just to cover the obvious, the secondary is set to synchronous commit and is synchronised at the time of failover?

    – George.Palacios
    7 mins ago











  • @George.Palacios Added it, Db's where synchronized and are set to sync commit.

    – Randi Vertongen
    6 mins ago


















0















My explanation will be opinion based as it is based on my findings. If more information is needed, I am always happy to add it.



The setup



We have 4 SQL Server 2016 Instances, and all 4 underlying servers are in a WSFC (In my eyes a strange setup since Always On AG's are 2 by 2. But that is a different question.)



enter image description here



4 nodes + Fileshare witness



Version: 13.0.5026.0 (all nodes)



The instances that are having the failover issues are instances 1 & 2.
They have 3 AG's between them. AG1 is primary on instance 1 and is having the issue. AG2 and AG3 are primary on instance 2.



Instances 3 and 4 have their own separate AG's (2).



AG1's settings



enter image description here



The week before



Last week, after checking for long running queries (none) and the synchronization state (green accross the board) we initiated a failover on AG1 from instance1 (primary) to instance2 (secondary).



The failover failed, leaving both the former primary (instance1) and the former secondary (instance2) in a temporary resolving state. Two databases on the former primary even showed Reverting / In Recovery for a few minutes. In the error log, a message was shown on many occasions:



enter image description here



The databases where not accessible during the outage, the outage took about 10 minutes total.



Before failing over I checked:




  • Current running queries (no long running queries)

  • DBCC OPENTRAN

  • Always On dashboard (Green accross the board)


We found a post by microsoft to enable a traceflag, 3923. We enabled this one on both instances after the outage.



Take #2, today



We tried testing the failover again to see if the issues where resolved.



Spoiler alert, they where not resolved.



enter image description here



Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9da2 0000:078d645c) started at Feb 19 2019 11:17AM in database 'DB2' at LSN (5909:54307:137) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9dee 0000:0017e40a) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (164:21769:2) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141ba1a8 0000:0024e912) started at Feb 19 2019 11:17AM in database 'DB3' at LSN (1246:7285:2) failed.


Again, the same error messages where shown.



The actual question(s)




  1. The beginning question is, am i right to assume that the failover
    not happening is due to the GhostCleanupTask not being able to
    harden the transaction log, or is it a consequence of the failure
    and not the cause.

  2. If it is the cause, what could be done to resolve this?




Something I am looking into, since the problem does not seem to be due to any long running queries rolling back, that the secondary databases are in an inconsistent state? But they show up as Synchronized.



Extra info



The only interesting message in the clusterlog from last week appeared to be this one:



 mscs::TopologyPersister::TryGetNetworkPrivateProperties: (2)' because of 'OpenSubKey failed.'


Failover cluster events




Cluster resource 'AG1' of type 'SQL Server Availability Group' in
clustered role 'AG1' failed.



Based on the failure policies for the resource and role, the cluster
service may try to bring the resource online on this node or move the
group to another node of the cluster and then restart it. Check the
resource and group state using Failover Cluster Manager or the
Get-ClusterResource Windows PowerShell cmdlet.



The Cluster service failed to bring clustered role 'AG_SP_2016_New'
completely online or offline. One or more resources may be in a failed
state. This may impact the availability of the clustered role.











share|improve this question

























  • Just to cover the obvious, the secondary is set to synchronous commit and is synchronised at the time of failover?

    – George.Palacios
    7 mins ago











  • @George.Palacios Added it, Db's where synchronized and are set to sync commit.

    – Randi Vertongen
    6 mins ago
















0












0








0








My explanation will be opinion based as it is based on my findings. If more information is needed, I am always happy to add it.



The setup



We have 4 SQL Server 2016 Instances, and all 4 underlying servers are in a WSFC (In my eyes a strange setup since Always On AG's are 2 by 2. But that is a different question.)



enter image description here



4 nodes + Fileshare witness



Version: 13.0.5026.0 (all nodes)



The instances that are having the failover issues are instances 1 & 2.
They have 3 AG's between them. AG1 is primary on instance 1 and is having the issue. AG2 and AG3 are primary on instance 2.



Instances 3 and 4 have their own separate AG's (2).



AG1's settings



enter image description here



The week before



Last week, after checking for long running queries (none) and the synchronization state (green accross the board) we initiated a failover on AG1 from instance1 (primary) to instance2 (secondary).



The failover failed, leaving both the former primary (instance1) and the former secondary (instance2) in a temporary resolving state. Two databases on the former primary even showed Reverting / In Recovery for a few minutes. In the error log, a message was shown on many occasions:



enter image description here



The databases where not accessible during the outage, the outage took about 10 minutes total.



Before failing over I checked:




  • Current running queries (no long running queries)

  • DBCC OPENTRAN

  • Always On dashboard (Green accross the board)


We found a post by microsoft to enable a traceflag, 3923. We enabled this one on both instances after the outage.



Take #2, today



We tried testing the failover again to see if the issues where resolved.



Spoiler alert, they where not resolved.



enter image description here



Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9da2 0000:078d645c) started at Feb 19 2019 11:17AM in database 'DB2' at LSN (5909:54307:137) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9dee 0000:0017e40a) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (164:21769:2) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141ba1a8 0000:0024e912) started at Feb 19 2019 11:17AM in database 'DB3' at LSN (1246:7285:2) failed.


Again, the same error messages where shown.



The actual question(s)




  1. The beginning question is, am i right to assume that the failover
    not happening is due to the GhostCleanupTask not being able to
    harden the transaction log, or is it a consequence of the failure
    and not the cause.

  2. If it is the cause, what could be done to resolve this?




Something I am looking into, since the problem does not seem to be due to any long running queries rolling back, that the secondary databases are in an inconsistent state? But they show up as Synchronized.



Extra info



The only interesting message in the clusterlog from last week appeared to be this one:



 mscs::TopologyPersister::TryGetNetworkPrivateProperties: (2)' because of 'OpenSubKey failed.'


Failover cluster events




Cluster resource 'AG1' of type 'SQL Server Availability Group' in
clustered role 'AG1' failed.



Based on the failure policies for the resource and role, the cluster
service may try to bring the resource online on this node or move the
group to another node of the cluster and then restart it. Check the
resource and group state using Failover Cluster Manager or the
Get-ClusterResource Windows PowerShell cmdlet.



The Cluster service failed to bring clustered role 'AG_SP_2016_New'
completely online or offline. One or more resources may be in a failed
state. This may impact the availability of the clustered role.











share|improve this question
















My explanation will be opinion based as it is based on my findings. If more information is needed, I am always happy to add it.



The setup



We have 4 SQL Server 2016 Instances, and all 4 underlying servers are in a WSFC (In my eyes a strange setup since Always On AG's are 2 by 2. But that is a different question.)



enter image description here



4 nodes + Fileshare witness



Version: 13.0.5026.0 (all nodes)



The instances that are having the failover issues are instances 1 & 2.
They have 3 AG's between them. AG1 is primary on instance 1 and is having the issue. AG2 and AG3 are primary on instance 2.



Instances 3 and 4 have their own separate AG's (2).



AG1's settings



enter image description here



The week before



Last week, after checking for long running queries (none) and the synchronization state (green accross the board) we initiated a failover on AG1 from instance1 (primary) to instance2 (secondary).



The failover failed, leaving both the former primary (instance1) and the former secondary (instance2) in a temporary resolving state. Two databases on the former primary even showed Reverting / In Recovery for a few minutes. In the error log, a message was shown on many occasions:



enter image description here



The databases where not accessible during the outage, the outage took about 10 minutes total.



Before failing over I checked:




  • Current running queries (no long running queries)

  • DBCC OPENTRAN

  • Always On dashboard (Green accross the board)


We found a post by microsoft to enable a traceflag, 3923. We enabled this one on both instances after the outage.



Take #2, today



We tried testing the failover again to see if the issues where resolved.



Spoiler alert, they where not resolved.



enter image description here



Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9da2 0000:078d645c) started at Feb 19 2019 11:17AM in database 'DB2' at LSN (5909:54307:137) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9dee 0000:0017e40a) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (164:21769:2) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141ba1a8 0000:0024e912) started at Feb 19 2019 11:17AM in database 'DB3' at LSN (1246:7285:2) failed.


Again, the same error messages where shown.



The actual question(s)




  1. The beginning question is, am i right to assume that the failover
    not happening is due to the GhostCleanupTask not being able to
    harden the transaction log, or is it a consequence of the failure
    and not the cause.

  2. If it is the cause, what could be done to resolve this?




Something I am looking into, since the problem does not seem to be due to any long running queries rolling back, that the secondary databases are in an inconsistent state? But they show up as Synchronized.



Extra info



The only interesting message in the clusterlog from last week appeared to be this one:



 mscs::TopologyPersister::TryGetNetworkPrivateProperties: (2)' because of 'OpenSubKey failed.'


Failover cluster events




Cluster resource 'AG1' of type 'SQL Server Availability Group' in
clustered role 'AG1' failed.



Based on the failure policies for the resource and role, the cluster
service may try to bring the resource online on this node or move the
group to another node of the cluster and then restart it. Check the
resource and group state using Failover Cluster Manager or the
Get-ClusterResource Windows PowerShell cmdlet.



The Cluster service failed to bring clustered role 'AG_SP_2016_New'
completely online or offline. One or more resources may be in a failed
state. This may impact the availability of the clustered role.








sql-server sql-server-2016 availability-groups failover






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited just now







Randi Vertongen

















asked 10 mins ago









Randi VertongenRandi Vertongen

2,711721




2,711721













  • Just to cover the obvious, the secondary is set to synchronous commit and is synchronised at the time of failover?

    – George.Palacios
    7 mins ago











  • @George.Palacios Added it, Db's where synchronized and are set to sync commit.

    – Randi Vertongen
    6 mins ago





















  • Just to cover the obvious, the secondary is set to synchronous commit and is synchronised at the time of failover?

    – George.Palacios
    7 mins ago











  • @George.Palacios Added it, Db's where synchronized and are set to sync commit.

    – Randi Vertongen
    6 mins ago



















Just to cover the obvious, the secondary is set to synchronous commit and is synchronised at the time of failover?

– George.Palacios
7 mins ago





Just to cover the obvious, the secondary is set to synchronous commit and is synchronised at the time of failover?

– George.Palacios
7 mins ago













@George.Palacios Added it, Db's where synchronized and are set to sync commit.

– Randi Vertongen
6 mins ago







@George.Palacios Added it, Db's where synchronized and are set to sync commit.

– Randi Vertongen
6 mins ago












0






active

oldest

votes











Your Answer








StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f230128%2falways-on-availability-groups-resolving-state-after-failover-remote-harden-of%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Database Administrators Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f230128%2falways-on-availability-groups-resolving-state-after-failover-remote-harden-of%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

SQL Server 17 - Attemping to backup to remote NAS but Access is denied

Restoring from pg_dump with foreign key constraints