What is a core group?
A core group encapsulates processes in a Network Deployment cell to
create high availability domains.
A core group is a grouping of WebSphere Application Server cell
processes. A core group can contain standalone servers, cluster members,
node agents, and the deployment manager. A core group must contain at
least one node agent or the deployment manager.
DefaultCoreGroup is a core group that is created by
default at installation time and can be used out-of-the-box; that is,
all processes will know about each other.
Note:
1. A core group cannot extend beyond a cell
2. All JVMs in a core group must able to communicate (they use
heartbeat messages to know each other)
Core group coordinator
Once the core group stabilizes at runtime, one of the member will be
elected to act as an coordinator. That member called as Coregroup
coordinator is responsible for managing the high availability with in
that core group.
1. It maintains all group information like group name, members and
policy of the group
2. It maintains a record state of the group members as they start,
stop or fail
3. Assigning singleton services to group members and handling
failover based on policy specified.
We can change the default core group coordinator by going to:
servers –>coregroups->coregroup settings->Default Coregroup
->preferred coordinator servers.
When a member becomes active coordinator, you can see the following
messages in the SystemOut:
[3/3/10 18:00:37:758 CET] 00000013 CoordinatorIm I
HMGR0206I: The Coordinator is an Active Coordinator for core group
DefaultCoreGroup.
If a member was failed/stopped in the core group:
[3/3/10 18:00:37:758 CET] 00000026 RoleMember W
DCSV8104W: DCS Stack DefaultCoreGroup.TestRepln at Member
Test-Cell\node01\server01: Removing member [Test-Cell\node02\server02]
because the member was requested to be removed by member
Test-Cell\node02\server01. Internal details VL suspects others:
CC-Situation Normal
[3/3/10 18:00:38:176 CET] 00000023 VSyncAlgo1 I DCSV2004I: DCS
Stack DefaultCoreGroup at Member Test-Cell\node01\server01: View
synchronization completed successfully. The View Identifier is
(22898:0.Test-Cell\node02\server01). The internal details are None.
[3/3/10 18:00:38:207 CET] 00000023 VSyncAlgo1 I DCSV2004I: DCS
Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01:
View synchronization completed successfully. The View Identifier is
(331:0.Test-Cell\node02\server01). The internal details are None.
[3/3/10 18:00:38:537 CET] 00000024 CoordinatorIm I HMGR0218I: A new
core group view has been installed. The core group is DefaultCoreGroup.
[3/3/10 18:00:39:228 CET] 00000026 DataStackMemb I DCSV8050I: DCS
Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01:
New view installed, identifier (332:0.Test-Cell\node02\server01), view
size is 11 (AV=11, CD=12, CN=12, DF=12)
[3/3/10 18:00:39:343 CET] 00000021 DRSBuddyManag A CWWDR0006I:
Replication instance terminated : Test-Cell\node02\server02
If a new member joins the core group, you can see the following
message
[3/3/10 18:17:13:245 CET] 00000026 RoleMember I
DCSV8051I: DCS Stack DefaultCoreGroup.TestRepln at Member
Test-Cell\node01\server01: Core group membership set changed. Added:
[Test-Cell\node02\server02].
[3/3/10 18:17:13:315 CET] 00000023 MbuRmmAdapter I DCSV1032I: DCS
Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01:
Connected a defined member Test-Cell\node02\server02.
[3/3/10 18:17:30:337 CET] 00000023 VSyncAlgo1 I DCSV2004I: DCS
Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01:
View synchronization completed successfully. The View Identifier is
(333:0.Test-Cell\node02\server01). The internal details are None.
[3/3/10 18:17:30:353 CET] 00000026 DataStackMemb I DCSV8050I: DCS
Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01:
New view installed, identifier (334:0.Test-Cell\node02\server01), view
size is 12 (AV=12, CD=12, CN=12, DF=12)
What happens when coordinator went down?
When
the active coordinator is not available
(stopped/crashed), the HA manager will elect the first inactive server
in the preferred coordinator servers list. If preferred list is not
specified, it will select lexically lowest named server.
The newly selected coordinator initiates a state rebuild by sending a
message to all core group members to report their states.
Core group settings
1. Number of coordinators
Specifies the number of coordinators for this core group. The default
value is one coordinator, although multiple coordinators are advisable
for large core groups. All of the group data must fit in the memory of
the allocated coordinators. One coordinator can run out of memory in a
system with a large core group, which can cause the system to work
improperly.
2. Transport type
Specifies the transport mechanism to use
for communication between members of a core group.
-
Channel framework
-
Channel framework is the default transport type. It uses the
channel framework service to incorporate port reusability and shared
port technology into the communication system.
-
Unicast
-
Unicast is a targeted network model that focuses on a direct
recipient for communication. This type of communication is most suitable
when the intended message is sent to a specific set of recipients.
-
Multicast
-
Multicast consists of a broadcast network model. This model
broadcasts communication across the defined network, depending upon the
values that are provided for the multicast settings. Multicast settings
are suitable when there are many recipients for the intended message;
otherwise broadcast communication tends to overload the network with
traffic, and can impact performance goals.
3. Channel chain name
Specifies the name of the channel chain if you select channel
framework for the transport type.
If you select Multicast transport type
-
Multicast port
The port setting tells the coordinator where to scan for
transmissions. When setting this value, verify that you are specifying a
port that is not used by another network communication device. Setting a
port value that has conflicts causes problems with your high
availability manager infrastructure.
-
Multicast group IP start
Specify the starting Internet Protocol (IP) address of the intended
communication area.
-
Multicast group IP end
Specify the ending IP address of the intended communication area.
Plan the network to accommodate scalability.
4. Additional Properties
Core group servers
Specifies the server processes that belong to the core group. Server
processes include the deployment manager, node agents, application
servers, and cluster members. You can use the panel that displays to
move server processes to a different core group.
Policies
Use to define the policies that determine which members of a high
availability group are made active.
Preferred coordinator servers
Specifies which core group servers are preferred coordinator servers.
Core Group policies:
Servers > Core groups > Core group settings > New or
existing core group > Policies.
Policy types
All active |
The All active policy indicates that the
high availability manager keeps all of the application components that
are running on all of the servers in the high availability group active
at all times |
M of N |
The M of N policy is similar to the One of N
policy. However, it enables you to specify the number (M) of high
availability group members that you want to keep active if it is
possible to do so. The number of active members must be greater than one
and less than or equal to the number of servers in the high
availability group. If the number of active servers is set to one, this
policy is a match for the One of N policy |
No Operation |
The No operation policy indicates that no
high availability group members are made active |
One of N |
The One of N policy keeps one member of the
high availability group active at all times. This is used by groups
that desire singleton failover. If a failure occurs, the high
availability manager starts the singleton on another server |
Static |
The Static policy allows you to statically
define or configure the active members of the high availability group |
Match Criteria
Specifies one or more name-value pairs that are used to associate
this policy with a high availability group. These pairs must match
attributes that are contained in the name of a high availability group
before this policy is associated with that group.
Is alive timer
In seconds, the interval of time at which the high availability
manager will check the health of the active group members that are
governed by this policy. If a group member has failed, the server on
which the group member resides is restarted.
Quorum
Specifies whether quorum checking is enabled for a group governed by
this policy. Quorum is a mechanism that can be used to protect resources
that are shared across members of the group in the event of a failure.
The quorum mechanism is designed to work in conjunction with a hardware
control facility that allows application servers to be shut down if a
failure causes the group to be partitioned.
note: The Quorum setting in the policy will
only have an effect if the following items are true:
* The group members are also cluster members.
* GroupName.WAS_CLUSTER=clustername must be specified as a property in
the group name of any high availability group matching this policy.
Fail back
Specifies whether work items assigned to the failing server are
moved to the server that is designated as the most preferred server for
the group if a failure occurs. This field only applies for M of N and
One of N policies.
Preferred servers only
Specifies whether group members are only activated on servers that are
on the list of preferred servers for this group. This field only applies
for M of N and One of N policies.
Core group servers:
Use this to move servers into a different core group. All members of a
cluster must be in the same core group. If you select one or more
members of a cluster, all of the members of that cluster must be moved.
Preferred coordinator
servers:
Use Add and Remove to move servers into and out of the list of
preferred servers. Use Move up and Move down to adjust the order within
the list of preferred servers. Make sure that the most preferred server
is at the top of the list and the least preferred server is at the
bottom.
Core group member Failure
detection
HA manager monitors all the core group members. It uses 2 settings to
detect the failure
1. Active failure detection
If the heartbeat from a JVM is failing for specified interval of
time, then it will be marked as failed. When using default settings,
heartbeats are sent every 10sec and 20times (200sec) should be failed
before marking the JVM as failed. When a JVM is marked as failed, a new
view is installed and you can see that in the SystemOut log.
2. TCP Keep Alive
If one member is not able to contact other member, and if gets closed
socket error, it will signal the other members to treat that member as
failed. Say, if one jvm is panics or network issue etc, as soon as the
TCP settings allow, the failure will be detected.
Note: TCP Keep alive setting is of the
operating system.
About DCS and finding which core group member crashed/stopped.
Here
############################################################################################