Friday, March 12, 2010

WebSphere Cluster Member crashed

Have you ever been asked this question in the interview?
how do you find out which cluster member was crashed/down?
The general answer we give is to go to administration console and check the individual server status or the cluster member status.
The other option is to use a third-party monitoring tool such as ITCAM, wily introscope, UniCenter and Nagios etc..
Have you ever checked the system.out log file of any individual server when one of the cluster member was stopped?
WebSphere has Distribution & Consistency Services (DCS), which is a part of the HA architecture. Using these DCS messages we can find which member of the cluster is down.
Here is an example:


I’ve a cell with name Test-Cell, which has a cluster with 6nodes each having 2 servers.
I’ve stopped one of cluster members. Then if you see the System.Out log file, you see message similar to the below:
[3/3/10 18:00:37:758 CET] 00000026 RoleMember    W   DCSV8104W: DCS Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01: Removing member [Test-Cell\node02\server02] because the member was requested to be removed  by member Test-Cell\node02\server01. Internal details VL suspects others: CC-Situation Normal
[3/3/10 18:00:38:176 CET] 00000023 VSyncAlgo1    I   DCSV2004I: DCS Stack DefaultCoreGroup at Member Test-Cell\node01\server01: View synchronization completed successfully. The View Identifier is (22898:0.Test-Cell\node02\server01). The internal details are None.
[3/3/10 18:00:38:207 CET] 00000023 VSyncAlgo1    I   DCSV2004I: DCS Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01: View synchronization completed successfully. The View Identifier is (331:0.Test-Cell\node02\server01). The internal details are None.
[3/3/10 18:00:38:537 CET] 00000024 CoordinatorIm I   HMGR0218I: A new core group view has been installed. The core group is DefaultCoreGroup.
[3/3/10 18:00:39:228 CET] 00000026 DataStackMemb I   DCSV8050I: DCS Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01: New view installed, identifier (332:0.Test-Cell\node02\server01), view size is 11 (AV=11, CD=12, CN=12, DF=12)
[3/3/10 18:00:39:343 CET] 00000021 DRSBuddyManag A   CWWDR0006I:  Replication instance terminated : Test-Cell\node02\server02

So, from the above messages, it is clear that server02 of Node02 was down and is removed from the coregroup.
After some troubleshooting/changes, i started the server which was down earlier. Now, if you observe the SystemOut.log, you can see the following:
[3/3/10 18:17:13:245 CET] 00000026 RoleMember    I   DCSV8051I: DCS Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01: Core group membership set changed. Added: [Test-Cell\node02\server02].
[3/3/10 18:17:13:315 CET] 00000023 MbuRmmAdapter I   DCSV1032I: DCS Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01: Connected a defined member Test-Cell\node02\server02.
[3/3/10 18:17:30:337 CET] 00000023 VSyncAlgo1    I   DCSV2004I: DCS Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01: View synchronization completed successfully. The View Identifier is (333:0.Test-Cell\node02\server01). The internal details are None.
[3/3/10 18:17:30:353 CET] 00000026 DataStackMemb I   DCSV8050I: DCS Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01: New view installed, identifier (334:0.Test-Cell\node02\server01), view size is 12 (AV=12, CD=12, CN=12, DF=12)
[3/3/10 18:17:30:354 CET] 00000027 DRSBuddyManag A   CWWDR0007I:  Replication instance group membership changed: Test-Cell\node02\server02
[3/3/10 18:17:30:356 CET] 00000027 DRSBuddyManag A   CWWDR0002I: Replication instance is active : Test-Cell\node02\server02
[3/3/10 18:17:30:358 CET] 00000010 ViewReceiver  I   DCSV1033I: DCS Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01: Confirmed all new view members in view identifier (334:0.Test-Cell\node02\server01). View channel type is View|Ptp.
You can a meesage which is showing that it added a new member to the coregroup.

About DCS:
There are two main versions of DCS: Core DCS and Data DCS. There is one Core DCS per process and it provides membership services among peer processes. These processes together form a Core Group. A process may be a member in one or more named Core Groups. Applications running on these processes can be members of application groups. Application groups are subsets of a particular named core group. A Data DCS component can be associated with each member of an application group.
DCS provides a mechanism for communicating information (distribution) among members with a given quality of service. Failure detection mechanisms that support and allow guaranteed quality of service are an inherent part of DCS and its services. DCS supports WebSphere components’ state replication requirements (like http session and stateful beans) as well as the distribution and synchronization of WebSphere artifacts for performance, scalability, and availability.
I’ll soon write about ‘Core Groups” of WebSphere to understand the DCS and high availability architecture of the WebSphere.

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

No comments:

Post a Comment