HA (high availability) Fundamentals for TeamForge

How to prevent disruption when hardware failure occurs

As a consultant for CollabNet I have been asked on many occasions to recommend a high availability solution for TeamForge. There is nothing unique in terms of the architecture of TeamForge that would not allow us to take a typical approach to redundancy. The fronted is Apache HTTPD serving to a JBoss mid tier with either PostgreSQL (default) or Oracle as the database.

A typical approach taken by countless HA vendors is to clone and duplicate the infrastructure, install the application and data on shared storage, and to configure the solution on such a way that the identity of the system can transition from one system to another. To the latter point when transitioning an application from one system to another the following resources should be considered vital, this is typically called a Service or Resource group and can consist of the following:

  • Primary IP address, that “floats” between the active system. Be aware of the impact caused to ARP caches that may have to be flushed. Some the TTL (Time to Live) on ARP entries can be lowered to minimize impact caused by IP to MAC address changes
  • Shared Storage. The application and application data. The underlying infrastructure could be SAN or ISCSI
  • Services. Start and stop scripts for the application

Figure (1)

A generic solution

Teamforge_failover_generic

The following are instructions on implementing a “generic” approach to High Availability

Scope

This is a guide on creating a primary and secondary failover for TeamForge. The scope of this document was written based on a two node environment, i.e. a single TeamForge system that hosts the application, database and data (TeamForge + SCM).

Environment consists of

2 x CentOS 5.4, each machine has 2 network intefaces (nics)

(1 nic will be used for primary IP and 2nd nic will be used for service address)

NODE1: primary IP (onboot=no): 10.168.1.190

NODE1: Secondary IP (onboot=yes): 10.168.1.191

NODE2: Primary IP (onboot=no): 10.168.1.190

NODE2: Primary IP (onboot=yes): 10.168.1.194

1 ISCSI target used as shared storage

A fundamental rule

This is basic primary / secondary failover that uses a shared ext3 filesystem for the TeamForge application and data. Ext3 is not a clustered filesystem and should only ever be mounted on one system or corruption will occur

What needs doing on both nodes

Install and configuration of iscsi tools/utils

(1) install iscsi tools

yum install iscsi-initiator-utils

(2) start services

/etc/init.d/iscsi start

(3) discover target

[root@node1 ~]# iscsiadm -m discovery -t sendtargets -p 10.168.1.155

10.168.1.155:3260,1 iqn.2004-04.com.qnap:ts-219:iscsi.purgatory.8ce5a6

(4) restart services

/etc/init.d/iscsi restart

(5) get details of new disk device

fdisk -l

dmesg

On node1 only

fdisk /dev/<new disk device>

and set appropriately

mkfs –t ext3 /dev/<new disk device>

then

mount -t ext3 /dev/sda1 /media/iscsi

You have now created ext3 filesystem on an iscsi device, and have it mounted

Time to setup networking

[root@node1 network-scripts]# more ifcfg-eth0

# Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+

DEVICE=eth0

BOOTPROTO=none

BROADCAST=10.168.1.255

HWADDR=54:52:00:06:A6:72

IPADDR=10.168.1.190

NETMASK=255.255.255.0

NETWORK=10.168.1.0

ONBOOT=no

GATEWAY=10.168.1.1

TYPE=Ethernet

VERY IMPORTANT! Set “ONBOOT=no”

This is so the primary IP does not activate at boot time and is only active on the primary node when required

[root@node1 network-scripts]# more ifcfg-eth1

# Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+

DEVICE=eth1

BOOTPROTO=none

ONBOOT=yes

HWADDR=54:52:00:45:96:16

NETMASK=255.255.255.0

IPADDR=10.168.1.191

GATEWAY=10.168.1.1

TYPE=Ethernet

IMPORTANT this is the service address, i.e. the address used for SSH and remote access


On both NODES

1)Add to /etc/fstab

“/dev/sda1       /media/iscsi                    ext3    noauto          0 0”

This will mount storage on /media/iscsi, but will not auto mount or fsck, import in the case of a SAN

2) Make sure that you nodes are aware of each other, and that /etc/hosts is set according for the TeamForge installation

[root@node2 conf]# more /etc/hosts

# Do not remove the following line, or various programs

# that require network functionality will fail.

127.0.0.1               localhost.localdomain localhost

::1             localhost6.localdomain6 localhost6

10.168.1.194    node2.e-securenetworks.net node2

10.168.1.190    node1.e-securenetworks.net node1

ON Node2 ONLY

[root@node2 network-scripts]# more ifcfg-eth0

# Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+

DEVICE=eth0

BOOTPROTO=none

BROADCAST=10.168.1.255

HWADDR=54:52:00:50:FE:83

IPADDR=10.168.1.190

NETMASK=255.255.255.0

NETWORK=10.168.1.0

ONBOOT=no

GATEWAY=10.168.1.1

TYPE=Ethernet

[root@node2 network-scripts]# more ifcfg-eth1

# Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+

DEVICE=eth1

BOOTPROTO=none

ONBOOT=yes

HWADDR=54:52:00:3b:51:d0

NETMASK=255.255.255.0

IPADDR=10.168.1.194

GATEWAY=10.168.1.1

TYPE=Ethernet

BACK ON NODE1

1)Install CTF

I used this in site-options.conf, so that the identity in never confused

SITE_DIR=/opt/collabnet/teamforge/

HOST_localhost=app database cvs subversion

NODE_NAME=localhost

DOMAIN_localhost=node1.e-securenetworks.net

2) start Teamforge to make sure all is working, then import/create test data, such as users, source code – this will be used for later testing

service collabnet start all

3) Stop all collabnet service are all is good

service collabnet stop all

4) We do not want collabnet service to start at boot time, it will be control by the node start/stop scripts

chkconfig –level 345 collabnet off

5) Time to move all relevant CTF data to share storage !!!

mount share storage on node2

mount /media/iscsi/

6) I used symbolic links to map to shared storage

[root@node1 opt]# mv /opt/collabnet <to shared storage>

[root@node1 opt]# ln -s /media/iscsi/collabnet/ collabnet

7) Do the same for other data

[root@node1 /]# mv sf-svnroot svnroot cvsroot /media/iscsi/

[root@node1 /]# ln -s /media/iscsi/svnroot/ /svnroot

[root@node1 /]# ln -s /media/iscsi/cvsroot/ /cvsroot

[root@node1 /]# ln -s /media/iscsi/sf-svnroot/ /sf-svnroot

[root@node1 /]# chown sf-admin.sf-admin /cvsroot /sf-svnroot/

[root@node1 /]# chown apache.apache /svnroot/

[root@node1 lib]# mv pgsql /media/iscsi/

[root@node1 lib]# ln -s /media/iscsi/pgsql/ /var/lib/pgsql

8) service collabnet start all

Check all works. Check that you test data is intact

9) Now clean up node1 so we can configure node2

stop all services and umount data

[root@node1 ~]# service collabnet stop all

[root@node1 ~]# umount /media/iscsi/

[root@node1 ~]# ifdown eth0

shutdown primary address


CONFIGURE NODE2

1) ifup eth0

this will bring up eth0 with primary address (10.168.1.190)

2) install TeamForge on node 2

#————————

# Required configuration

#————————

SITE_DIR=/opt/collabnet/teamforge/

HOST_localhost=app database cvs subversion

NODE_NAME=localhost

DOMAIN_localhost=node1.e-securenetworks.net

3) start teamforge to test

service collabnet start

Make sure TeamForge is functioning as expected.

4) stop teamforge

service collabnet stop

5) mount share storage on node2

mount /media/iscsi/

6) Now move original CTF directories out of the way and create symbolic links to shared storage

(in /opt)

root@node2 opt]# mv collabnet collabnet.orig

[root@node2 opt]# ln -s /media/iscsi/collabnet/ collabnet

[root@node2 opt]# ls -l

total 12

lrwxrwxrwx 1 root root   23 Oct 14 14:43 collabnet -> /media/iscsi/collabnet/

drwxr-xr-x 5 root root 4096 Oct 14 14:28 collabnet.orig

[root@node2 /]# mv sf-svnroot sf-svnroot.orig

[root@node2 /]# mv  svnroot svnroot.orig

[root@node2 /]# mv cvsroot cvsroot.orig

[root@node2 /]# ln -s /media/iscsi/sf-svnroot/ sf-svnroot

[root@node2 /]# ln -s /media/iscsi/svnroot/ svnroot

[root@node2 /]# ln -s /media/iscsi/cvsroot cvsroot

[root@node2 lib]# mv pgsql/ pgsql.orig

[root@node2 lib]# ln -s /media/iscsi/pgsql/ pgsql

7) start CTF and test

[root@node2 lib]# service collabnet start all

Service Update: 2010-10-14 15:15:00 EDT —

Starting all services “localhost”

apache (subversion app) (localhost:80) (ext)                                     OK

pgsql (database) (localhost:5432) (ext)                                          OK

jboss (app) (localhost:8080)                                                     ………OK

tomcat (subversion cvs) (localhost:7080)                                         OK

Should have test data that we created!!!

Time to deactivate node2

1) stop all collabnet services

service collabnet stop all

2) umount shared disk

umount /media/iscsi/

3) disable primary IP

ifdown eth0

move all services back to node1 (primary)

Back on node 1

1) Bring up primary address

[root@node1 network-scripts]# ifup eth0

[root@node1 network-scripts]# ifconfig

eth0      Link encap:Ethernet  HWaddr 54:52:00:06:A6:72

inet addr:10.168.1.190  Bcast:10.168.1.255  Mask:255.255.255.0

inet6 addr: fe80::5652:ff:fe06:a672/64 Scope:Link

UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

RX packets:1002352 errors:0 dropped:554 overruns:0 frame:0

TX packets:1843171 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:1000

RX bytes:370119813 (352.9 MiB)  TX bytes:2125282827 (1.9 GiB)

Interrupt:11 Base address:0x4000

2) mount shared disk

mount /media/iscsi

3) start services

service collabnet start all

We now need to automate the node_up and node_down.

Add shell scripts to both nodes.  Therefore to bring CTF on node1 execute “node_up” on node1. If TeamForge is required on node2, execute “node_down” on node1 and “node_up” on node2

A example start node script – needs way more error checks and logic before it is prime time!!!!!

[root@node1 bin]# more node_start.sh

#!/bin/bash

set -x

# Bring up primary interface

ifup eth0

# mount share disk

mount /media/iscsi/

# start collabnet services

/etc/init.d/collabnet start all

Essentially there is only 3 steps in activating the primary node (1) bring up the primary IP (2) mount the data (3) start the collabnet services

* additional error checking would comprise of (a) confirming that the shared data is not mounted on other system (through ssh)

A stop script

[root@node1 bin]# more node_stop.sh

#!/bin/bash

set -x

#stop all services

/etc/init.d/collabnet stop all

#

# Add additional error checking to determine that service are down

# service collabnet status all

# ps to check no java processes

# fuser on shared disk

#

#umount shared drive

umount /media/iscsi/

# shutdown eth0 primary interface

ifdown eth0

The stop script is the reverse of the start script.

Summary and conclusion

There is obvious room here for improvement and I would enjoy the feedback!

Improvement would include the (1) utilization of a HA cluster manager to automated the failover with the primary node is down (2) utilization of disk fencing to protect data from being corrupted (3) the use of a clustered file system (4) replication of the shared storage to remote location

I hope to hear improvements on these instructions but they do represent the basis of a failover solution for TeamForge

Tagged with: , , ,
Posted in TeamForge

Leave a Reply

Your email address will not be published. Required fields are marked *

*