websphere-infra: 03/12/10

Friday, March 12, 2010

New Rules For Cricket

I just got a mail forward from one of my friend.

After watching the test match, someone has written some rules have to be incorporated by ICC to give the other teams a perfect clarification

(1) Ricky Ponting – (THE TRULY GENUINE CRICKETER OF THE CRICKET ERA AND WHOSE INTEGRITY SHOULD NOT BE DOUBTED) should be considered as the FOURTH UMPIRE. As per the new rules, FOURTH UMPIRE decision is final and will over ride any decisions taken by any other umpires. ON-FIELD umpires can seek the assistance of RICKY PONTING even if he is not on the field. This rule is to be made, so that every team should understand the importance of the FOURTH UMPIRE.

(2) While AUSTRALIAN TEAM is bowling, If the ball flies anywhere close to the AUSTRALIAN FIELDER(WITHIN 5 metre distance), the batsman is to be considered OUT irrelevant of whether the catch was taken cleanly or grassed. Any decision for further clarification should be seeked from the FOURTH UMPIRE. This is made to ensure that the cricket is played with SPORTIVE SPIRIT by all the teams.

(3) While BATTING, AUSTRALIAN players will wait for the ON-FIELD UMPIRE decisions only (even if the catch goes to the FIFTH SLIP as the ball might not have touched the bat). Each AUSTRALIAN batsman has to be out FOUR TIMES (minimum) before he can return to the pavilion. In case of THE CRICKETER WITH INTEGRITY, this can be higher.

(4) UMPIRES should consider a huge bonus if an AUSTRALIAN player scores a century. Any wrong decisions can be ignored as they will be paid huge bonus and will receive the backing of the AUSTRALIAN team and board.

(5) All AUSTRALIAN players are eligible to keep commenting about all players on the field and the OPPONENT TEAM should never comment as they will be spoiling the spirit of the AUSTRALIAN team. Any comments made in any other language are to be considered as RACIALISM only.

(6) MATCH REFREE decisions will be taken purely on the AUSTRALIAN TEAM advices only. Player views from the other teams decisions will not be considered for hearing. MATCH REFREES are to be given huge bonus if this rule is implemented.

(7) NO VISITING TEAM should plan to win in AUSTRALIA. This is to ensure that the sportive spirit of CRICKET is maintained.

(8) THE MOST IMPORTANT RULE: If any bowler gets RICKY PONTING – “THE UNDISPUTED CRICKETER WITH INTEGTIRY IN THE GAME OF CRICKET” more than twice in a series, he will be banned for the REST OF THE SERIES. This is to ensure that the best batsman/Captain will be played to break records and create history in the game of CRICKET.

These rules will clarify better to the all the teams VISITING AUSTRALIA.

Custom User Registry

To enable security in WebSphere Application Server, you need to logon to WAS admin console. The default URL for accessing WAS admin console would be http://localhost:9060/admin in your local machine. When the security is not enabled WAS would ask you for a username and you can provide any username you want. When enabling security in WAS 6.0 there are three different types of user registries available.
1. Custom User Registry
2. LDAP
3. Local OS.
Here I will be talking about enabling security using Custom User Registry as the Active User Registry. The first step in enabling security using Custom user registry is to create one user file and a group file. The group file will be of the following format.

name:gid:users:display name
For eg:
admins:1:admin,admin1:Administrative group
operators:2:operator,operator1:Operators group
users:3:user1,user2,user3,bobby:

The groups file will be saved with name groups.props in directory D:\IBM\security.
Then create a users file using the following format.

name:passwd:uid:gids:display name
admin:admin:10:1:Admin
admin1:admin1:11:1:Admin 1
user:user:12:2,3:User

The users file will be saved with name users.props in directory D:\IBM\security. (This cane be any directory)
The next fist step is to click on Global Security link under security menu.

The security configuration page opens up.

Once the security configuration page got opened click on Custom under user registries section. The custom user registry page opens.

Under general properties enter the server admin username and password you want and click on the custom properties link under Additional properties section. Click on new button in the custom properties page.

Enter groupsFile as the name and D:/IBM/security/groups.props as the value. Save the custom property value and create another custom property and enter usersFile as the name D:/IBM/security/users.props as the value. Save the values you will be returned to the custom properties page. Now the custom properties page would look like the one in the below picture.

Save the changes made and return to the security configuration page.
Now select custom user registry under active user registry and select the ‘Enable global security’ checkbox. Uncheck ‘Enforce Java 2 Security’

Save the changes. Restart your server. Once global security is enabled on the WAS server, we need to provide the admin username and password for starting and stopping the server. The command for starting the server once the security is enabled is
startServer.bat server1 -user admin -password admin
where server1 is the name of your server. If your server is integrated with an IDE like RAD, double click on the server from the server perspective and enter the username and password after selecting ’security is enabled on this server’ checkbox under security section in your server configuration window.

#####################################################################

Memory leak detection and analysis in WebSphere Application Server: Part 1: Overview of memory leaks

Overview of memory leaks

############################################################

WebSphere Cluster Member crashed

Have you ever been asked this question in the interview?
how do you find out which cluster member was crashed/down?
The general answer we give is to go to administration console and check the individual server status or the cluster member status.
The other option is to use a third-party monitoring tool such as ITCAM, wily introscope, UniCenter and Nagios etc..
Have you ever checked the system.out log file of any individual server when one of the cluster member was stopped?
WebSphere has Distribution & Consistency Services (DCS), which is a part of the HA architecture. Using these DCS messages we can find which member of the cluster is down.
Here is an example:

I’ve a cell with name Test-Cell, which has a cluster with 6nodes each having 2 servers.
I’ve stopped one of cluster members. Then if you see the System.Out log file, you see message similar to the below:

[3/3/10 18:00:37:758 CET] 00000026 RoleMember    W   DCSV8104W: DCS Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01: Removing member [Test-Cell\node02\server02] because the member was requested to be removed by member Test-Cell\node02\server01. Internal details VL suspects others: CC-Situation Normal
[3/3/10 18:00:38:176 CET] 00000023 VSyncAlgo1    I   DCSV2004I: DCS Stack DefaultCoreGroup at Member Test-Cell\node01\server01: View synchronization completed successfully. The View Identifier is (22898:0.Test-Cell\node02\server01). The internal details are None.
[3/3/10 18:00:38:207 CET] 00000023 VSyncAlgo1    I   DCSV2004I: DCS Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01: View synchronization completed successfully. The View Identifier is (331:0.Test-Cell\node02\server01). The internal details are None.
[3/3/10 18:00:38:537 CET] 00000024 CoordinatorIm I   HMGR0218I: A new core group view has been installed. The core group is DefaultCoreGroup.
[3/3/10 18:00:39:228 CET] 00000026 DataStackMemb I   DCSV8050I: DCS Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01: New view installed, identifier (332:0.Test-Cell\node02\server01), view size is 11 (AV=11, CD=12, CN=12, DF=12)
[3/3/10 18:00:39:343 CET] 00000021 DRSBuddyManag A   CWWDR0006I: Replication instance terminated : Test-Cell\node02\server02

So, from the above messages, it is clear that server02 of Node02 was down and is removed from the coregroup.
After some troubleshooting/changes, i started the server which was down earlier. Now, if you observe the SystemOut.log, you can see the following:

[3/3/10 18:17:13:245 CET] 00000026 RoleMember    I   DCSV8051I: DCS Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01: Core group membership set changed. Added: [Test-Cell\node02\server02].
[3/3/10 18:17:13:315 CET] 00000023 MbuRmmAdapter I   DCSV1032I: DCS Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01: Connected a defined member Test-Cell\node02\server02.
[3/3/10 18:17:30:337 CET] 00000023 VSyncAlgo1    I   DCSV2004I: DCS Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01: View synchronization completed successfully. The View Identifier is (333:0.Test-Cell\node02\server01). The internal details are None.
[3/3/10 18:17:30:353 CET] 00000026 DataStackMemb I   DCSV8050I: DCS Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01: New view installed, identifier (334:0.Test-Cell\node02\server01), view size is 12 (AV=12, CD=12, CN=12, DF=12)
[3/3/10 18:17:30:354 CET] 00000027 DRSBuddyManag A   CWWDR0007I: Replication instance group membership changed: Test-Cell\node02\server02
[3/3/10 18:17:30:356 CET] 00000027 DRSBuddyManag A   CWWDR0002I: Replication instance is active : Test-Cell\node02\server02
[3/3/10 18:17:30:358 CET] 00000010 ViewReceiver I   DCSV1033I: DCS Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01: Confirmed all new view members in view identifier (334:0.Test-Cell\node02\server01). View channel type is View|Ptp.

You can a meesage which is showing that it added a new member to the coregroup.

About DCS:
There are two main versions of DCS: Core DCS and Data DCS. There is one Core DCS per process and it provides membership services among peer processes. These processes together form a Core Group. A process may be a member in one or more named Core Groups. Applications running on these processes can be members of application groups. Application groups are subsets of a particular named core group. A Data DCS component can be associated with each member of an application group.
DCS provides a mechanism for communicating information (distribution) among members with a given quality of service. Failure detection mechanisms that support and allow guaranteed quality of service are an inherent part of DCS and its services. DCS supports WebSphere components’ state replication requirements (like http session and stateful beans) as well as the distribution and synchronization of WebSphere artifacts for performance, scalability, and availability.
I’ll soon write about ‘Core Groups” of WebSphere to understand the DCS and high availability architecture of the WebSphere.

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

WebSphere Core Groups

What is a core group?

A core group encapsulates processes in a Network Deployment cell to create high availability domains.
A core group is a grouping of WebSphere Application Server cell processes. A core group can contain standalone servers, cluster members, node agents, and the deployment manager. A core group must contain at least one node agent or the deployment manager.
DefaultCoreGroup is a core group that is created by default at installation time and can be used out-of-the-box; that is, all processes will know about each other.
Note:
1. A core group cannot extend beyond a cell
2. All JVMs in a core group must able to communicate (they use heartbeat messages to know each other)

Core group coordinator

Once the core group stabilizes at runtime, one of the member will be elected to act as an coordinator. That member called as Coregroup coordinator is responsible for managing the high availability with in that core group.
1. It maintains all group information like group name, members and policy of the group
2. It maintains a record state of the group members as they start, stop or fail
3. Assigning singleton services to group members and handling failover based on policy specified.
We can change the default core group coordinator by going to:
servers –>coregroups->coregroup settings->Default Coregroup ->preferred coordinator servers.
When a member becomes active coordinator, you can see the following messages in the SystemOut:

[3/3/10 18:00:37:758 CET] 00000013 CoordinatorIm I HMGR0206I: The Coordinator is an Active Coordinator for core group DefaultCoreGroup.

If a member was failed/stopped in the core group:

[3/3/10 18:00:37:758 CET] 00000026 RoleMember    W   DCSV8104W: DCS Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01: Removing member [Test-Cell\node02\server02] because the member was requested to be removed by member Test-Cell\node02\server01. Internal details VL suspects others: CC-Situation Normal
[3/3/10 18:00:38:176 CET] 00000023 VSyncAlgo1    I   DCSV2004I: DCS Stack DefaultCoreGroup at Member Test-Cell\node01\server01: View synchronization completed successfully. The View Identifier is (22898:0.Test-Cell\node02\server01). The internal details are None.
[3/3/10 18:00:38:207 CET] 00000023 VSyncAlgo1    I   DCSV2004I: DCS Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01: View synchronization completed successfully. The View Identifier is (331:0.Test-Cell\node02\server01). The internal details are None.
[3/3/10 18:00:38:537 CET] 00000024 CoordinatorIm I   HMGR0218I: A new core group view has been installed. The core group is DefaultCoreGroup.
[3/3/10 18:00:39:228 CET] 00000026 DataStackMemb I   DCSV8050I: DCS Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01: New view installed, identifier (332:0.Test-Cell\node02\server01), view size is 11 (AV=11, CD=12, CN=12, DF=12)
[3/3/10 18:00:39:343 CET] 00000021 DRSBuddyManag A   CWWDR0006I: Replication instance terminated : Test-Cell\node02\server02

If a new member joins the core group, you can see the following message

[3/3/10 18:17:13:245 CET] 00000026 RoleMember    I   DCSV8051I: DCS Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01: Core group membership set changed. Added: [Test-Cell\node02\server02].
[3/3/10 18:17:13:315 CET] 00000023 MbuRmmAdapter I   DCSV1032I: DCS Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01: Connected a defined member Test-Cell\node02\server02.
[3/3/10 18:17:30:337 CET] 00000023 VSyncAlgo1    I   DCSV2004I: DCS Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01: View synchronization completed successfully. The View Identifier is (333:0.Test-Cell\node02\server01). The internal details are None.
[3/3/10 18:17:30:353 CET] 00000026 DataStackMemb I   DCSV8050I: DCS Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01: New view installed, identifier (334:0.Test-Cell\node02\server01), view size is 12 (AV=12, CD=12, CN=12, DF=12)

What happens when coordinator went down?
When the active coordinator is not available (stopped/crashed), the HA manager will elect the first inactive server in the preferred coordinator servers list. If preferred list is not specified, it will select lexically lowest named server.
The newly selected coordinator initiates a state rebuild by sending a message to all core group members to report their states.

Core group settings

1. Number of coordinators

Specifies the number of coordinators for this core group. The default value is one coordinator, although multiple coordinators are advisable for large core groups. All of the group data must fit in the memory of the allocated coordinators. One coordinator can run out of memory in a system with a large core group, which can cause the system to work improperly.

2. Transport type

Specifies the transport mechanism to use for communication between members of a core group.

Channel framework: Channel framework is the default transport type. It uses the channel framework service to incorporate port reusability and shared port technology into the communication system.
Unicast: Unicast is a targeted network model that focuses on a direct recipient for communication. This type of communication is most suitable when the intended message is sent to a specific set of recipients.
Multicast: Multicast consists of a broadcast network model. This model broadcasts communication across the defined network, depending upon the values that are provided for the multicast settings. Multicast settings are suitable when there are many recipients for the intended message; otherwise broadcast communication tends to overload the network with traffic, and can impact performance goals.

3. Channel chain name

Specifies the name of the channel chain if you select channel framework for the transport type.
If you select Multicast transport type

Multicast port
The port setting tells the coordinator where to scan for transmissions. When setting this value, verify that you are specifying a port that is not used by another network communication device. Setting a port value that has conflicts causes problems with your high availability manager infrastructure.
Multicast group IP start
Specify the starting Internet Protocol (IP) address of the intended communication area.
Multicast group IP end
Specify the ending IP address of the intended communication area. Plan the network to accommodate scalability.

4. Additional Properties
Core group servers
Specifies the server processes that belong to the core group. Server processes include the deployment manager, node agents, application servers, and cluster members. You can use the panel that displays to move server processes to a different core group.

Policies

Use to define the policies that determine which members of a high availability group are made active.

Preferred coordinator servers

Specifies which core group servers are preferred coordinator servers.

Core Group policies:

Servers > Core groups > Core group settings > New or existing core group > Policies.

Policy types

All active	The All active policy indicates that the high availability manager keeps all of the application components that are running on all of the servers in the high availability group active at all times
M of N	The M of N policy is similar to the One of N policy. However, it enables you to specify the number (M) of high availability group members that you want to keep active if it is possible to do so. The number of active members must be greater than one and less than or equal to the number of servers in the high availability group. If the number of active servers is set to one, this policy is a match for the One of N policy
No Operation	The No operation policy indicates that no high availability group members are made active
One of N	The One of N policy keeps one member of the high availability group active at all times. This is used by groups that desire singleton failover. If a failure occurs, the high availability manager starts the singleton on another server
Static	The Static policy allows you to statically define or configure the active members of the high availability group

Match Criteria
Specifies one or more name-value pairs that are used to associate this policy with a high availability group. These pairs must match attributes that are contained in the name of a high availability group before this policy is associated with that group.

Is alive timer
In seconds, the interval of time at which the high availability manager will check the health of the active group members that are governed by this policy. If a group member has failed, the server on which the group member resides is restarted.
Quorum
Specifies whether quorum checking is enabled for a group governed by this policy. Quorum is a mechanism that can be used to protect resources that are shared across members of the group in the event of a failure. The quorum mechanism is designed to work in conjunction with a hardware control facility that allows application servers to be shut down if a failure causes the group to be partitioned.
note: The Quorum setting in the policy will only have an effect if the following items are true:
* The group members are also cluster members.
* GroupName.WAS_CLUSTER=clustername must be specified as a property in the group name of any high availability group matching this policy.
Fail back
Specifies whether work items assigned to the failing server are moved to the server that is designated as the most preferred server for the group if a failure occurs. This field only applies for M of N and One of N policies.
Preferred servers only
Specifies whether group members are only activated on servers that are on the list of preferred servers for this group. This field only applies for M of N and One of N policies.

Core group servers:

Use this to move servers into a different core group. All members of a cluster must be in the same core group. If you select one or more members of a cluster, all of the members of that cluster must be moved.

Preferred coordinator servers:

Use Add and Remove to move servers into and out of the list of preferred servers. Use Move up and Move down to adjust the order within the list of preferred servers. Make sure that the most preferred server is at the top of the list and the least preferred server is at the bottom.

Core group member Failure detection

HA manager monitors all the core group members. It uses 2 settings to detect the failure
1. Active failure detection
If the heartbeat from a JVM is failing for specified interval of time, then it will be marked as failed. When using default settings, heartbeats are sent every 10sec and 20times (200sec) should be failed before marking the JVM as failed. When a JVM is marked as failed, a new view is installed and you can see that in the SystemOut log.
2. TCP Keep Alive
If one member is not able to contact other member, and if gets closed socket error, it will signal the other members to treat that member as failed. Say, if one jvm is panics or network issue etc, as soon as the TCP settings allow, the failure will be detected.
Note: TCP Keep alive setting is of the operating system.
About DCS and finding which core group member crashed/stopped. Here

############################################################################################

Quick hit JVM heap size change

I was tasked to change the JVM heap size in approximately 2,500 servers today. I created a quick little script that I will share for you.

#
# Update the Heap size
# Mike Huffsteder 5/5/2008
#
# ./wsadmin.sh -lang jython -f updateJVM.py
#

as = AdminConfig.getid('/Cell:YOUR_CELL_NAME/Node:YOUR_NODE_NAME/Server:YOUR_SERVER_NAME/')
jvm = AdminConfig.list('JavaVirtualMachine', as)
AdminConfig.modify(jvm, [['initialHeapSize', '512'], ['maximumHeapSize', "768"]])
AdminConfig.save()
# set the newly saved config to variables to place entries in logging
i = AdminConfig.showAttribute(jvm, "initialHeapSize")
m = AdminConfig.showAttribute(jvm, "maximumHeapSize")
print "The initial heap size is now" + i
print "The max heap size is now" + m

Of course, you can change the heap to whatever size you would like.

wsadmin rotate jvm log files

Simple, working, rotates logs at 10mb saving max 5 files.

# rotateWasLogEPRN.py
# Setup WAS Log file rotations
# for yourserver
# 9/10/2008
# Mike Huffsteder - WTA
#
# Implement:
# ./wsadmin.sh -lang jython -f $FILE_LOCATION/rotateWasLogEPRN.py >> $TO_LOG_FILE
#
# You can add this as you have other scripts in the deployment script
# if you need help please let me know.
#
#

print "Changing the SystemOut & SystemErr log file rotation settings"
var1 = AdminConfig.getid('/Cell:yourcell/Node:yournode/Server:yourserver/')

log = AdminConfig.showAttribute(var1, 'outputStreamRedirect')
log2 = AdminConfig.showAttribute(var1, 'errorStreamRedirect')

AdminConfig.modify(log, '[[rolloverSize 10] [maxNumberOfBackupFiles 5]]')
AdminConfig.modify(log2,'[[rolloverSize 10] [maxNumberOfBackupFiles 5]]')

AdminConfig.save()

print "New SystemOut settings 9-10-08"
AdminConfig.show(log).split("\n")
print "New SystemErr settings 9-10-08"
AdminConfig.show(log2).split("\n")