Always On Availability groups resolving state after failover

Always On Availability groups resolving state after failover - Remote harden of transaction...

My explanation will be opinion based as it is based on my findings. If more information is needed, I am always happy to add it.

The setup

We have 4 SQL Server 2016 Instances, and all 4 underlying servers are in a WSFC (In my eyes a strange setup since Always On AG's are 2 by 2. But that is a different question.)

enter image description here

4 nodes + Fileshare witness

Version: 13.0.5026.0 (all nodes)

The instances that are having the failover issues are instances 1 & 2.
They have 3 AG's between them. AG1 is primary on instance 1 and is having the issue. AG2 and AG3 are primary on instance 2.

Instances 3 and 4 have their own separate AG's (2).

AG1's settings

enter image description here

The week before

Last week, after checking for long running queries (none) and the synchronization state (green accross the board) we initiated a failover on AG1 from instance1 (primary) to instance2 (secondary).

The failover failed, leaving both the former primary (instance1) and the former secondary (instance2) in a temporary resolving state. Two databases on the former primary even showed Reverting / In Recovery for a few minutes. In the error log, a message was shown on many occasions:

enter image description here

The databases where not accessible during the outage, the outage took about 10 minutes total.

Before failing over I checked:

Current running queries (no long running queries)

DBCC OPENTRAN

Always On dashboard (Green accross the board)

We found a post by microsoft to enable a traceflag, 3923. We enabled this one on both instances after the outage.

Take #2, today

We tried testing the failover again to see if the issues where resolved.

Spoiler alert, they where not resolved.

enter image description here

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.



Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.



Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9da2 0000:078d645c) started at Feb 19 2019 11:17AM in database 'DB2' at LSN (5909:54307:137) failed.



Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9dee 0000:0017e40a) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (164:21769:2) failed.



Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141ba1a8 0000:0024e912) started at Feb 19 2019 11:17AM in database 'DB3' at LSN (1246:7285:2) failed.

Again, the same error messages where shown.

The actual question(s)

The beginning question is, am i right to assume that the failover
not happening is due to the GhostCleanupTask not being able to
harden the transaction log, or is it a consequence of the failure
and not the cause.

If it is the cause, what could be done to resolve this?

Something I am looking into, since the problem does not seem to be due to any long running queries rolling back, that the secondary databases are in an inconsistent state? But they show up as Synchronized.

Extra info

The only interesting message in the clusterlog from last week appeared to be this one:

 mscs::TopologyPersister::TryGetNetworkPrivateProperties: (2)' because of 'OpenSubKey failed.'

Failover cluster events

Cluster resource 'AG1' of type 'SQL Server Availability Group' in
clustered role 'AG1' failed.

Based on the failure policies for the resource and role, the cluster
service may try to bring the resource online on this node or move the
group to another node of the cluster and then restart it. Check the
resource and group state using Failover Cluster Manager or the
Get-ClusterResource Windows PowerShell cmdlet.

The Cluster service failed to bring clustered role 'AG_SP_2016_New'
completely online or offline. One or more resources may be in a failed
state. This may impact the availability of the clustered role.

edited just now

asked 10 mins ago

Randi Vertongen

2,711721

Just to cover the obvious, the secondary is set to synchronous commit and is synchronised at the time of failover?

– George.Palacios
7 mins ago

@George.Palacios Added it, Db's where synchronized and are set to sync commit.

– Randi Vertongen
6 mins ago

add a comment |

My explanation will be opinion based as it is based on my findings. If more information is needed, I am always happy to add it.

The setup

We have 4 SQL Server 2016 Instances, and all 4 underlying servers are in a WSFC (In my eyes a strange setup since Always On AG's are 2 by 2. But that is a different question.)

enter image description here

4 nodes + Fileshare witness

Version: 13.0.5026.0 (all nodes)

The instances that are having the failover issues are instances 1 & 2.
They have 3 AG's between them. AG1 is primary on instance 1 and is having the issue. AG2 and AG3 are primary on instance 2.

Instances 3 and 4 have their own separate AG's (2).

AG1's settings

enter image description here

The week before

Last week, after checking for long running queries (none) and the synchronization state (green accross the board) we initiated a failover on AG1 from instance1 (primary) to instance2 (secondary).

enter image description here

The databases where not accessible during the outage, the outage took about 10 minutes total.

Before failing over I checked:

Current running queries (no long running queries)

DBCC OPENTRAN

Always On dashboard (Green accross the board)

We found a post by microsoft to enable a traceflag, 3923. We enabled this one on both instances after the outage.

Take #2, today

We tried testing the failover again to see if the issues where resolved.

Spoiler alert, they where not resolved.

enter image description here

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.



Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.



Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9da2 0000:078d645c) started at Feb 19 2019 11:17AM in database 'DB2' at LSN (5909:54307:137) failed.



Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9dee 0000:0017e40a) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (164:21769:2) failed.



Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141ba1a8 0000:0024e912) started at Feb 19 2019 11:17AM in database 'DB3' at LSN (1246:7285:2) failed.

Again, the same error messages where shown.

The actual question(s)

The beginning question is, am i right to assume that the failover
not happening is due to the GhostCleanupTask not being able to
harden the transaction log, or is it a consequence of the failure
and not the cause.

If it is the cause, what could be done to resolve this?

Extra info

The only interesting message in the clusterlog from last week appeared to be this one:

 mscs::TopologyPersister::TryGetNetworkPrivateProperties: (2)' because of 'OpenSubKey failed.'

Failover cluster events

Cluster resource 'AG1' of type 'SQL Server Availability Group' in
clustered role 'AG1' failed.

Based on the failure policies for the resource and role, the cluster
service may try to bring the resource online on this node or move the
group to another node of the cluster and then restart it. Check the
resource and group state using Failover Cluster Manager or the
Get-ClusterResource Windows PowerShell cmdlet.

The Cluster service failed to bring clustered role 'AG_SP_2016_New'
completely online or offline. One or more resources may be in a failed
state. This may impact the availability of the clustered role.

edited just now

asked 10 mins ago

Randi Vertongen

2,711721

Just to cover the obvious, the secondary is set to synchronous commit and is synchronised at the time of failover?

– George.Palacios
7 mins ago

@George.Palacios Added it, Db's where synchronized and are set to sync commit.

– Randi Vertongen
6 mins ago

add a comment |

My explanation will be opinion based as it is based on my findings. If more information is needed, I am always happy to add it.

The setup

We have 4 SQL Server 2016 Instances, and all 4 underlying servers are in a WSFC (In my eyes a strange setup since Always On AG's are 2 by 2. But that is a different question.)

enter image description here

4 nodes + Fileshare witness

Version: 13.0.5026.0 (all nodes)

The instances that are having the failover issues are instances 1 & 2.
They have 3 AG's between them. AG1 is primary on instance 1 and is having the issue. AG2 and AG3 are primary on instance 2.

Instances 3 and 4 have their own separate AG's (2).

AG1's settings

enter image description here

The week before

Last week, after checking for long running queries (none) and the synchronization state (green accross the board) we initiated a failover on AG1 from instance1 (primary) to instance2 (secondary).

enter image description here

The databases where not accessible during the outage, the outage took about 10 minutes total.

Before failing over I checked:

Current running queries (no long running queries)

DBCC OPENTRAN

Always On dashboard (Green accross the board)

We found a post by microsoft to enable a traceflag, 3923. We enabled this one on both instances after the outage.

Take #2, today

We tried testing the failover again to see if the issues where resolved.

Spoiler alert, they where not resolved.

enter image description here

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.



Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.



Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9da2 0000:078d645c) started at Feb 19 2019 11:17AM in database 'DB2' at LSN (5909:54307:137) failed.



Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9dee 0000:0017e40a) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (164:21769:2) failed.



Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141ba1a8 0000:0024e912) started at Feb 19 2019 11:17AM in database 'DB3' at LSN (1246:7285:2) failed.

Again, the same error messages where shown.

The actual question(s)

The beginning question is, am i right to assume that the failover
not happening is due to the GhostCleanupTask not being able to
harden the transaction log, or is it a consequence of the failure
and not the cause.

If it is the cause, what could be done to resolve this?

Extra info

The only interesting message in the clusterlog from last week appeared to be this one:

 mscs::TopologyPersister::TryGetNetworkPrivateProperties: (2)' because of 'OpenSubKey failed.'

Failover cluster events

Cluster resource 'AG1' of type 'SQL Server Availability Group' in
clustered role 'AG1' failed.

Based on the failure policies for the resource and role, the cluster
service may try to bring the resource online on this node or move the
group to another node of the cluster and then restart it. Check the
resource and group state using Failover Cluster Manager or the
Get-ClusterResource Windows PowerShell cmdlet.

The Cluster service failed to bring clustered role 'AG_SP_2016_New'
completely online or offline. One or more resources may be in a failed
state. This may impact the availability of the clustered role.

edited just now

asked 10 mins ago

Randi Vertongen

2,711721

My explanation will be opinion based as it is based on my findings. If more information is needed, I am always happy to add it.

The setup

We have 4 SQL Server 2016 Instances, and all 4 underlying servers are in a WSFC (In my eyes a strange setup since Always On AG's are 2 by 2. But that is a different question.)

enter image description here

4 nodes + Fileshare witness

Version: 13.0.5026.0 (all nodes)

The instances that are having the failover issues are instances 1 & 2.
They have 3 AG's between them. AG1 is primary on instance 1 and is having the issue. AG2 and AG3 are primary on instance 2.

Instances 3 and 4 have their own separate AG's (2).

AG1's settings

enter image description here

The week before

Last week, after checking for long running queries (none) and the synchronization state (green accross the board) we initiated a failover on AG1 from instance1 (primary) to instance2 (secondary).

enter image description here

The databases where not accessible during the outage, the outage took about 10 minutes total.

Before failing over I checked:

Current running queries (no long running queries)

DBCC OPENTRAN

Always On dashboard (Green accross the board)

We found a post by microsoft to enable a traceflag, 3923. We enabled this one on both instances after the outage.

Take #2, today

We tried testing the failover again to see if the issues where resolved.

Spoiler alert, they where not resolved.

enter image description here

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.



Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.



Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9da2 0000:078d645c) started at Feb 19 2019 11:17AM in database 'DB2' at LSN (5909:54307:137) failed.



Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9dee 0000:0017e40a) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (164:21769:2) failed.



Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141ba1a8 0000:0024e912) started at Feb 19 2019 11:17AM in database 'DB3' at LSN (1246:7285:2) failed.

Again, the same error messages where shown.

The actual question(s)

The beginning question is, am i right to assume that the failover
not happening is due to the GhostCleanupTask not being able to
harden the transaction log, or is it a consequence of the failure
and not the cause.

If it is the cause, what could be done to resolve this?

Extra info

The only interesting message in the clusterlog from last week appeared to be this one:

 mscs::TopologyPersister::TryGetNetworkPrivateProperties: (2)' because of 'OpenSubKey failed.'

Failover cluster events

Cluster resource 'AG1' of type 'SQL Server Availability Group' in
clustered role 'AG1' failed.

Based on the failure policies for the resource and role, the cluster
service may try to bring the resource online on this node or move the
group to another node of the cluster and then restart it. Check the
resource and group state using Failover Cluster Manager or the
Get-ClusterResource Windows PowerShell cmdlet.

The Cluster service failed to bring clustered role 'AG_SP_2016_New'
completely online or offline. One or more resources may be in a failed
state. This may impact the availability of the clustered role.

sql-server sql-server-2016 availability-groups failover

edited just now

asked 10 mins ago

Randi Vertongen

2,711721

edited just now

asked 10 mins ago

Randi Vertongen

2,711721

edited just now

asked 10 mins ago

Randi Vertongen

2,711721

asked 10 mins ago

Randi Vertongen

2,711721

asked 10 mins ago

Randi Vertongen

2,711721

Just to cover the obvious, the secondary is set to synchronous commit and is synchronised at the time of failover?

– George.Palacios
7 mins ago

@George.Palacios Added it, Db's where synchronized and are set to sync commit.

– Randi Vertongen
6 mins ago

add a comment |

Just to cover the obvious, the secondary is set to synchronous commit and is synchronised at the time of failover?

– George.Palacios
7 mins ago

@George.Palacios Added it, Db's where synchronized and are set to sync commit.

– Randi Vertongen
6 mins ago

Just to cover the obvious, the secondary is set to synchronous commit and is synchronised at the time of failover?

– George.Palacios
7 mins ago

@George.Palacios Added it, Db's where synchronized and are set to sync commit.

– Randi Vertongen
6 mins ago

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f230128%2falways-on-availability-groups-resolving-state-after-failover-remote-harden-of%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Database Administrators Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Gp4p ug 1gWNX0iN

搜尋此網誌

Xrhrft