Always On Availability groups resolving state after failover - Remote harden of transaction...

Multi tool use
My explanation will be opinion based as it is based on my findings. If more information is needed, I am always happy to add it.
The setup
We have 4 SQL Server 2016 Instances, and all 4 underlying servers are in a WSFC (In my eyes a strange setup since Always On AG's are 2 by 2. But that is a different question.)
4 nodes + Fileshare witness
Version: 13.0.5026.0 (all nodes)
The instances that are having the failover issues are instances 1 & 2.
They have 3 AG's between them. AG1 is primary on instance 1 and is having the issue. AG2 and AG3 are primary on instance 2.
Instances 3 and 4 have their own separate AG's (2).
AG1's settings
The week before
Last week, after checking for long running queries (none) and the synchronization state (green accross the board) we initiated a failover on AG1 from instance1 (primary) to instance2 (secondary).
The failover failed, leaving both the former primary (instance1) and the former secondary (instance2) in a temporary resolving state. Two databases on the former primary even showed Reverting / In Recovery for a few minutes. In the error log, a message was shown on many occasions:
The databases where not accessible during the outage, the outage took about 10 minutes total.
Before failing over I checked:
- Current running queries (no long running queries)
DBCC OPENTRAN
- Always On dashboard (Green accross the board)
We found a post by microsoft to enable a traceflag, 3923. We enabled this one on both instances after the outage.
Take #2, today
We tried testing the failover again to see if the issues where resolved.
Spoiler alert, they where not resolved.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9da2 0000:078d645c) started at Feb 19 2019 11:17AM in database 'DB2' at LSN (5909:54307:137) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9dee 0000:0017e40a) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (164:21769:2) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141ba1a8 0000:0024e912) started at Feb 19 2019 11:17AM in database 'DB3' at LSN (1246:7285:2) failed.
Again, the same error messages where shown.
The actual question(s)
- The beginning question is, am i right to assume that the failover
not happening is due to theGhostCleanupTask
not being able to
harden the transaction log, or is it a consequence of the failure
and not the cause. - If it is the cause, what could be done to resolve this?
Something I am looking into, since the problem does not seem to be due to any long running queries rolling back, that the secondary databases are in an inconsistent state? But they show up as Synchronized.
Extra info
The only interesting message in the clusterlog from last week appeared to be this one:
mscs::TopologyPersister::TryGetNetworkPrivateProperties: (2)' because of 'OpenSubKey failed.'
Failover cluster events
Cluster resource 'AG1' of type 'SQL Server Availability Group' in
clustered role 'AG1' failed.
Based on the failure policies for the resource and role, the cluster
service may try to bring the resource online on this node or move the
group to another node of the cluster and then restart it. Check the
resource and group state using Failover Cluster Manager or the
Get-ClusterResource Windows PowerShell cmdlet.
The Cluster service failed to bring clustered role 'AG_SP_2016_New'
completely online or offline. One or more resources may be in a failed
state. This may impact the availability of the clustered role.
sql-server sql-server-2016 availability-groups failover
add a comment |
My explanation will be opinion based as it is based on my findings. If more information is needed, I am always happy to add it.
The setup
We have 4 SQL Server 2016 Instances, and all 4 underlying servers are in a WSFC (In my eyes a strange setup since Always On AG's are 2 by 2. But that is a different question.)
4 nodes + Fileshare witness
Version: 13.0.5026.0 (all nodes)
The instances that are having the failover issues are instances 1 & 2.
They have 3 AG's between them. AG1 is primary on instance 1 and is having the issue. AG2 and AG3 are primary on instance 2.
Instances 3 and 4 have their own separate AG's (2).
AG1's settings
The week before
Last week, after checking for long running queries (none) and the synchronization state (green accross the board) we initiated a failover on AG1 from instance1 (primary) to instance2 (secondary).
The failover failed, leaving both the former primary (instance1) and the former secondary (instance2) in a temporary resolving state. Two databases on the former primary even showed Reverting / In Recovery for a few minutes. In the error log, a message was shown on many occasions:
The databases where not accessible during the outage, the outage took about 10 minutes total.
Before failing over I checked:
- Current running queries (no long running queries)
DBCC OPENTRAN
- Always On dashboard (Green accross the board)
We found a post by microsoft to enable a traceflag, 3923. We enabled this one on both instances after the outage.
Take #2, today
We tried testing the failover again to see if the issues where resolved.
Spoiler alert, they where not resolved.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9da2 0000:078d645c) started at Feb 19 2019 11:17AM in database 'DB2' at LSN (5909:54307:137) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9dee 0000:0017e40a) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (164:21769:2) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141ba1a8 0000:0024e912) started at Feb 19 2019 11:17AM in database 'DB3' at LSN (1246:7285:2) failed.
Again, the same error messages where shown.
The actual question(s)
- The beginning question is, am i right to assume that the failover
not happening is due to theGhostCleanupTask
not being able to
harden the transaction log, or is it a consequence of the failure
and not the cause. - If it is the cause, what could be done to resolve this?
Something I am looking into, since the problem does not seem to be due to any long running queries rolling back, that the secondary databases are in an inconsistent state? But they show up as Synchronized.
Extra info
The only interesting message in the clusterlog from last week appeared to be this one:
mscs::TopologyPersister::TryGetNetworkPrivateProperties: (2)' because of 'OpenSubKey failed.'
Failover cluster events
Cluster resource 'AG1' of type 'SQL Server Availability Group' in
clustered role 'AG1' failed.
Based on the failure policies for the resource and role, the cluster
service may try to bring the resource online on this node or move the
group to another node of the cluster and then restart it. Check the
resource and group state using Failover Cluster Manager or the
Get-ClusterResource Windows PowerShell cmdlet.
The Cluster service failed to bring clustered role 'AG_SP_2016_New'
completely online or offline. One or more resources may be in a failed
state. This may impact the availability of the clustered role.
sql-server sql-server-2016 availability-groups failover
Just to cover the obvious, the secondary is set to synchronous commit and is synchronised at the time of failover?
– George.Palacios
7 mins ago
@George.Palacios Added it, Db's where synchronized and are set to sync commit.
– Randi Vertongen
6 mins ago
add a comment |
My explanation will be opinion based as it is based on my findings. If more information is needed, I am always happy to add it.
The setup
We have 4 SQL Server 2016 Instances, and all 4 underlying servers are in a WSFC (In my eyes a strange setup since Always On AG's are 2 by 2. But that is a different question.)
4 nodes + Fileshare witness
Version: 13.0.5026.0 (all nodes)
The instances that are having the failover issues are instances 1 & 2.
They have 3 AG's between them. AG1 is primary on instance 1 and is having the issue. AG2 and AG3 are primary on instance 2.
Instances 3 and 4 have their own separate AG's (2).
AG1's settings
The week before
Last week, after checking for long running queries (none) and the synchronization state (green accross the board) we initiated a failover on AG1 from instance1 (primary) to instance2 (secondary).
The failover failed, leaving both the former primary (instance1) and the former secondary (instance2) in a temporary resolving state. Two databases on the former primary even showed Reverting / In Recovery for a few minutes. In the error log, a message was shown on many occasions:
The databases where not accessible during the outage, the outage took about 10 minutes total.
Before failing over I checked:
- Current running queries (no long running queries)
DBCC OPENTRAN
- Always On dashboard (Green accross the board)
We found a post by microsoft to enable a traceflag, 3923. We enabled this one on both instances after the outage.
Take #2, today
We tried testing the failover again to see if the issues where resolved.
Spoiler alert, they where not resolved.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9da2 0000:078d645c) started at Feb 19 2019 11:17AM in database 'DB2' at LSN (5909:54307:137) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9dee 0000:0017e40a) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (164:21769:2) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141ba1a8 0000:0024e912) started at Feb 19 2019 11:17AM in database 'DB3' at LSN (1246:7285:2) failed.
Again, the same error messages where shown.
The actual question(s)
- The beginning question is, am i right to assume that the failover
not happening is due to theGhostCleanupTask
not being able to
harden the transaction log, or is it a consequence of the failure
and not the cause. - If it is the cause, what could be done to resolve this?
Something I am looking into, since the problem does not seem to be due to any long running queries rolling back, that the secondary databases are in an inconsistent state? But they show up as Synchronized.
Extra info
The only interesting message in the clusterlog from last week appeared to be this one:
mscs::TopologyPersister::TryGetNetworkPrivateProperties: (2)' because of 'OpenSubKey failed.'
Failover cluster events
Cluster resource 'AG1' of type 'SQL Server Availability Group' in
clustered role 'AG1' failed.
Based on the failure policies for the resource and role, the cluster
service may try to bring the resource online on this node or move the
group to another node of the cluster and then restart it. Check the
resource and group state using Failover Cluster Manager or the
Get-ClusterResource Windows PowerShell cmdlet.
The Cluster service failed to bring clustered role 'AG_SP_2016_New'
completely online or offline. One or more resources may be in a failed
state. This may impact the availability of the clustered role.
sql-server sql-server-2016 availability-groups failover
My explanation will be opinion based as it is based on my findings. If more information is needed, I am always happy to add it.
The setup
We have 4 SQL Server 2016 Instances, and all 4 underlying servers are in a WSFC (In my eyes a strange setup since Always On AG's are 2 by 2. But that is a different question.)
4 nodes + Fileshare witness
Version: 13.0.5026.0 (all nodes)
The instances that are having the failover issues are instances 1 & 2.
They have 3 AG's between them. AG1 is primary on instance 1 and is having the issue. AG2 and AG3 are primary on instance 2.
Instances 3 and 4 have their own separate AG's (2).
AG1's settings
The week before
Last week, after checking for long running queries (none) and the synchronization state (green accross the board) we initiated a failover on AG1 from instance1 (primary) to instance2 (secondary).
The failover failed, leaving both the former primary (instance1) and the former secondary (instance2) in a temporary resolving state. Two databases on the former primary even showed Reverting / In Recovery for a few minutes. In the error log, a message was shown on many occasions:
The databases where not accessible during the outage, the outage took about 10 minutes total.
Before failing over I checked:
- Current running queries (no long running queries)
DBCC OPENTRAN
- Always On dashboard (Green accross the board)
We found a post by microsoft to enable a traceflag, 3923. We enabled this one on both instances after the outage.
Take #2, today
We tried testing the failover again to see if the issues where resolved.
Spoiler alert, they where not resolved.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9da2 0000:078d645c) started at Feb 19 2019 11:17AM in database 'DB2' at LSN (5909:54307:137) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9dee 0000:0017e40a) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (164:21769:2) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141ba1a8 0000:0024e912) started at Feb 19 2019 11:17AM in database 'DB3' at LSN (1246:7285:2) failed.
Again, the same error messages where shown.
The actual question(s)
- The beginning question is, am i right to assume that the failover
not happening is due to theGhostCleanupTask
not being able to
harden the transaction log, or is it a consequence of the failure
and not the cause. - If it is the cause, what could be done to resolve this?
Something I am looking into, since the problem does not seem to be due to any long running queries rolling back, that the secondary databases are in an inconsistent state? But they show up as Synchronized.
Extra info
The only interesting message in the clusterlog from last week appeared to be this one:
mscs::TopologyPersister::TryGetNetworkPrivateProperties: (2)' because of 'OpenSubKey failed.'
Failover cluster events
Cluster resource 'AG1' of type 'SQL Server Availability Group' in
clustered role 'AG1' failed.
Based on the failure policies for the resource and role, the cluster
service may try to bring the resource online on this node or move the
group to another node of the cluster and then restart it. Check the
resource and group state using Failover Cluster Manager or the
Get-ClusterResource Windows PowerShell cmdlet.
The Cluster service failed to bring clustered role 'AG_SP_2016_New'
completely online or offline. One or more resources may be in a failed
state. This may impact the availability of the clustered role.
sql-server sql-server-2016 availability-groups failover
sql-server sql-server-2016 availability-groups failover
edited just now
Randi Vertongen
asked 10 mins ago
Randi VertongenRandi Vertongen
2,711721
2,711721
Just to cover the obvious, the secondary is set to synchronous commit and is synchronised at the time of failover?
– George.Palacios
7 mins ago
@George.Palacios Added it, Db's where synchronized and are set to sync commit.
– Randi Vertongen
6 mins ago
add a comment |
Just to cover the obvious, the secondary is set to synchronous commit and is synchronised at the time of failover?
– George.Palacios
7 mins ago
@George.Palacios Added it, Db's where synchronized and are set to sync commit.
– Randi Vertongen
6 mins ago
Just to cover the obvious, the secondary is set to synchronous commit and is synchronised at the time of failover?
– George.Palacios
7 mins ago
Just to cover the obvious, the secondary is set to synchronous commit and is synchronised at the time of failover?
– George.Palacios
7 mins ago
@George.Palacios Added it, Db's where synchronized and are set to sync commit.
– Randi Vertongen
6 mins ago
@George.Palacios Added it, Db's where synchronized and are set to sync commit.
– Randi Vertongen
6 mins ago
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f230128%2falways-on-availability-groups-resolving-state-after-failover-remote-harden-of%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Database Administrators Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f230128%2falways-on-availability-groups-resolving-state-after-failover-remote-harden-of%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Gp4p ug 1gWNX0iN
Just to cover the obvious, the secondary is set to synchronous commit and is synchronised at the time of failover?
– George.Palacios
7 mins ago
@George.Palacios Added it, Db's where synchronized and are set to sync commit.
– Randi Vertongen
6 mins ago