Continuous Intuition...: May 2010

Friday, May 28, 2010

Open-Source and Free Software Tools

1. Website Vulnerabilities and Nikto

Nikto is an Open-Source Web Server vulnerability scanner that performs comprehensive tests for over 6,100 potentially dangerous files/CGIs, Checks for outdated versions of over 950 Servers and for version-specific problems on over 260 Servers. This article outlines a scenario where Nikto is used to test a company's Web Server for vulnerabilities...

2. Bandwidth Throttling with NetEM

NetEM (Network Emulation) provides functionality for testing protocols,by emulating the network properties of wide-area networks. This article describes the use of NetEM in benchmarking the performance of a Web application,simulating Internet-like speeds on a Lan

How-To

#############How To Install GUI On Ubuntu Server################

1. First of all make sure you have an IP address on the server. Run ifconfig and check for a valid IP address (like 192.168.0.101), subnet mask (like 255.255.255.0), gateway (like 192.168.0.1) and DNS server (like 192.168.0.1). When all these fields are valid try to ping google.com. The output should show the name google.com being resolved to an IP address and then certain outputs telling if it succeeded or not. If any part of this is not working consult Ubuntu server networking configuration docs and troubleshoot.
2. The next step is to update aptitude. Most likely right after install the aptitude component would not be up to date. To do this run sudo aptitude update and wait for it to completely finish.
3. The next step of the process is to actually install the Gnome environment. This step is the most time consuming and give yourself an hour for it to run. The easiest way is to run sudo aptitude install ubuntu-desktop and wait.
4. When the install finishes reboot the system using sudo reboot The server should load the GUI on the reboot. If not then run sudo /etc/init.d/gdm start to bring up the Gnome interface.

########How to use rsync for transferring files under Linux or UNIX#######

rsync is a free software computer program for Unix and Linux like systems which synchronizes files and directories from one location to another while minimizing data transfer using delta encoding when appropriate. An important feature of rsync not found in most similar programs/protocols is that the mirroring takes place with only one transmission in each direction.

So what is unique about rsync?
It can perform differential uploads and downloads (synchronization) of files across the network, transferring only data that has changed. The rsync remote-update protocol allows rsync to transfer just the differences between two sets of files across the network connection.

Always use rsync over ssh
Since rsync does not provide any security while transferring data it is recommended that you use rsync over ssh . This allows a secure remote connection. Now let us see some examples of rsync.

rsync command common options

* --delete : delete files that don't exist on sender (system)
* -v : Verbose (try -vv for more detailed information)
* -e "ssh options" : specify the ssh as remote shell
* -a : archive mode
* -r : recurse into directories
* -z : compress file data

Task : Copy file from a local computer to a remote server
Copy file from /www/backup.tar.gz to a remote server called openbsd.nixcraft.in
$ rsync -v -e ssh /www/backup.tar.gz jerry@openbsd.nixcraft.in:~

Task : Copy file from a remote server to a local computer

Copy file /home/jerry/webroot.txt from a remote server openbsd.nixcraft.in to a local computer /tmp directory:
$ rsync -v -e ssh jerry@openbsd.nixcraft.in:~/webroot.txt /tmp

Task: Synchronize a local directory with a remote directory
$ rsync -r -a -v -e "ssh -l jerry" --delete openbsd.nixcraft.in:/webroot/ /local/webroot

Task: Synchronize a remote directory with a local directory
$ rsync -r -a -v -e "ssh -l jerry" --delete /local/webroot openbsd.nixcraft.in:/webroot

Note:
1. Normally, the -a option can be used to perfectly mirror the files. However, if the target filesystem does not support permissions, a different set of options should be used to avoid warnings from rsync.
2. Note! Do not forget to add the trailing slashes after the folder names otherwise sensible data can be lost!

Tips & Tricks

Listing running processes
You can list/view every process by memory and/or CPU on your system,using any of the following commands:
ps -ef
ps -e
ps -eF
ps -ely
List information for particular PIDs:
ps -p 1,2
List paths that the PID has opened:
lsof -p $

DNS cache cleaning
There is a simple command to quickly clean the DNS cache for every OS. On Linux, Make sure you have the nscd tool installed and running in the background as a daemon. In order to clear the DNS cache, simply restart the daemon as follows -- You need root privileges:
/etc/init.d/nscd restart

Live Interrupts Details
To watch the live interrupt changes in your system run:
watch -d 'cat /proc/interrupts'

Access a Windows Share from Bash
Ever wanted to access a Windows share from your terminal? Well, using mount and cifs/samba, this is possible. Make sure you have smbfs/cifs support. We need to make a directory on our hard disk where we can mount the Windows share.
mkdir /mnt/location
we are now ready to mount the filesystem on our newly created directory /mnt/location
To mount using cifs, use the following code:
mount -t cifs //server-ip-or-name/share /mnt/location -o username=user,password=pass,domain=DOMAIN
When we're done working on the share, We should exit the directory or close any programs that are accessing it, and then umount the windows share by using the following commands:
cd /
umount /mnt/location

Data Recovery in Ubuntu
Install ddrescue tools:
sudo apt-get install ddrescue
Connect the failed disk to your system:
Wait for a while...
We can now mount this image on our system and take a look at the files:
mount -t ext3 -o loop disk-image.img /mnt/tmp

How to install vmware/vmware Player on Ubuntu version*
sudo apt-get install build-essential linux-headers-$(uname -r)
sudo chmod +x VMware-Player*.bundle
sudo chmod +x VMware-Workstation-7.0.0-203739_i386-NoTools.bundle
sh VMware-Workstation-7.0.0-203739_i386-NoTools.bundle

Tunnel your SSH connection via intermediate host
$ ssh -t reachable_host ssh unreachable_host
This one-liner creates an ssh connection to unreachable_host via reachable_host. It does it by executing the ssh unreachable_host on reachable_host. The -t forces ssh to allocate a pseudo-tty, which is necessary for working interactively in the second ssh to unreachable_host.
This one-liner can be generalized. You can tunnel through arbitrary number of ssh servers:
$ ssh -t host1 ssh -t host2 ssh -t host3 ssh -t host4 ...

Clear the terminal screen
$ CTRL+l

Hear when the machine comes back online
$ ping -a IP
Ever had a situation when you need to know when the system comes up after a reboot? Up until now you probably launched ping and either followed the timeouts until the system came back, or left it running and occasionally checked its output to see if the host is up. But that is unnecessary, you can make ping -a audible! As soon as the host at IP is back, ping will beep!

Shutdown a Windows machine Remotely
$ net rpc shutdown -I IP_ADDRESS -U username%password

mtr - traceroute and ping combined
$ mtr google.com
MTR, bettern known as "Matt's Traceroute" combines both traceroute and ping command. After each successful hop, it sends a ping request to the found machine, this way it produces output of both traceroute and ping to better understand the quality of link. If it finds out a packet took an alternative route, it displays it, and by default it keeps updating the statistics so you knew what was going on in real time.

Copy your public-key to remote-machine for public-key authentication
$ ssh-copy-id remote-machine
This one-liner copies your public-key, that you generated with ssh-keygen (either SSHv1 file identity.pub or SSHv2 file id_rsa.pub) to the remote-machine and places it in ~/.ssh/authorized_keys file. This ensures that the next time you try to log into that machine, public-key authentication (commonly referred to as "passwordless authentication.") will be used instead of the regular password authentication.
If you wished to do it yourself, you'd have to take the following steps:
your-machine$ scp ~/.ssh/identity.pub remote-machine:
your-machine$ ssh remote-machine
remote-machine$ cat identity.pub >> ~/.ssh/authorized_keys
This one-liner saves a great deal of typing. Actually I just found out that there was a shorter way to do it:
your-machine$ ssh remote-machine 'cat >> .ssh/authorized_keys' < .ssh/identity.pub

Linux: Recovering deleted /etc/shadow password file

Sometimes by accident we may delete /etc/shadow file. If you boot into single user mode, system will ask root password for maintenance, and just imagine you do not have a backup of /etc/shadow file. How do you fix such problem in a production environment where time is critical factor? Below is the explaination how to recover deleted /etc/shadow file in five easy steps. It will take around 10 min. to fix the problem.

Boot server into single user mode

1) Reboot server

2) Next, you will see grub-boot loader screen. Select Recovery mode the version of the kernel that you wish to boot and type e for edit. Select the line that starts with kernel and type e to edit the line.

3) Go to the end of the line and type init=/bin/bash as a separate one word (press the spacebar and then type init=/bin/bash). Press enter key to exit edit mode.
init=/bin/bash

4) Back at the GRUB screen, type b to boot into single user mode. This causes the system to boot the kernel and run /bin/bash instead of its standard init. This will allow us gain root privileges (w/o password) and a root shell.

Make sure you can access system partition(s)

1) Mount partitions in read write mode
Since / is currently mounted read-only and many disk partitions have not been mounted yet, you must do the following to have a reasonably functioning system.
# mount -rw -o remount /
Do not forget to (re)mount your rest of all your partitions in read/write (rw) mode such as /usr /var etc (if any)

Rebuild /etc/shadow file from /etc/passwd

1) You need to use pwconv command; it creates /etc/shadow from /etc/passwd and an optionally existing shadow.
# pwconv

2) Use passwd command to change root user password:
# passwd

Note you may need to type same password twice with passwd command. If you have an admin account, then setup password for that account. On most production, servers direct root login is disabled. In our situation, admin was the only account allowed to use su and sudo command.
# passwd admin

3) Now root and admin accounts are ready to go in multi-user mode. Reboot the system in full multiuser mode:
# sync
# reboot

Note:
* Some time /etc/shadow- file can be use to replace /etc/shadow
* If you have a backup of /etc/shadow on tape or cdrom then you can copy back /etc/shadow file

How to Step Block all non-root login

Block all non-root (normal) users until we fix all password related problems. Since rest of account do not have any password, it is necessary to prevent non-root users from logging into the system. You need to create /etc/nologin file, it will allow access only to root. Other users will be shown the contents of this file and their logins will denied (refused)

1) Login as root user (terminal login only)

2) Create /etc/nologin file
cat > /etc/nologin
System is down due to temporary problem. We will restore your access
within 30 minutes time. If you have any questions please contact tech
support at XXX-XXXX or techsupport@mycorp.com

Update all users password in batch mode

1) Create random password for each non-root user using chpasswd utility. It update passwords in batch mode. chpasswd reads a list of user name and password pairs from file and uses this information to update a group of existing users. Each line is of the format:

user_name:password

Remember by default the supplied password must be in clear-text format. This command is intended to be used in a large system environment where many accounts are created at a single time or in emergency like this. First, we need to find out all non-root accounts using awk command:
awk -F: '{ if ( $3 >1000 ) print $1}' /etc/passwd > /root/tmp.pass

Make sure /root/tmp.pass file contains non-root usernames only.

#############################################################################
How to figure out Differences between files and folders

Diff Between folders : diff --brief --recursive
Diff Between Files: diff

## Iptables Internet Access to Private Network ###

On Mgmt Server
iptables -t nat -A POSTROUTING -o bond1 -j MASQUERADE
echo "1"> /proc/sys/net/ipv4/ip_forward

On Compute Node
route add default gw 10.10.20.254
cat /etc/resolv.conf
nameserver 10.10.20.254

Thursday, May 20, 2010

Provide Internet to a Server behind firewall/NAT & Proxy Server in One Command

Provide Internet to a Server behind firewall/NAT

Provide Internet to a Server (Destination) behind firewall/NAT == SUCCESS
(Note : Dont use port 80. There may be apache running on Destination server and causing issues)
Requirement == Provide internet access to a server (Destination) which is Behind a Firewall/NAT.

ONly mode of access to Destination is SSH to Public IP ==> NAT ==> Private IP (Destination)

on Proxy server (Source)
apt-get install tinyproxy
ssh username@publicIp(Destination IP) -R 8888:127.0.0.1:8888

Keep this shell open/running.

Log in into Destination Server :

a)
root@Node:~# diff /etc/wgetrc /etc/wgetrc_orig
< http_proxy =" http://127.0.0.1:8888/" use_proxy =" on"> #use_proxy = on

b)
tail /etc/bash.bashrc
export http_proxy=http://127.0.0.1:8888/

c)
root@Node:~# more /etc/apt/apt.conf
ACQUIRE
{
http::proxy "http://127.0.0.1:8888/";
}

I hope you enjoyed this !!! Now the node should be able to reach Internet via ssh Tunnel

Proxy with SSH

This article will be interesting for those who didn't know it already -- you can turn any Linux computer into a SOCKS5 (and SOCKS4) proxy in just one command:

ssh -N -D 0.0.0.0:1080 localhost

And it doesn't require root privileges. The ssh command starts up dynamic -D port forwarding on port 1080 and talks to the clients via SOCSK5 or SOCKS4 protocols, just like a regular SOCKS5 proxy would! The -N option makes sure ssh stays idle and doesn't execute any commands on localhost.
If you also wish the command to go into background as a daemon, then add -f option:

#ssh -f -N -D 0.0.0.0:1080 localhost

To use it, just make your software use SOCKS5 proxy on your Linux computer's IP, port 1080, and you're done, all your requests now get proxied.
Access control can be implemented via iptables. For example, to allow only people from the ip 1.2.3.4 to use the SOCKS5 proxy, add the following iptables rules:

#iptables -A INPUT --src 1.2.3.4 -p tcp --dport 1080 -j ACCEPT
#iptables -A INPUT -p tcp --dport 1080 -j REJECT

The first rule says, allow anyone from 1.2.3.4 to connect to port 1080, and the other rule says, deny everyone else from connecting to port 1080.

Another possibility is to use another computer instead of your own as exit node. What I mean is you can do the following:

#ssh -f -N -D 1080 other_computer.com

This will set up a SOCKS5 proxy on localhost:1080 but when you use it, ssh will automatically tunnel your requests (encrypted) via other_computer.com. This way you can hide what you're doing on the Internet from anyone who might be sniffing your link. They will see that you're doing something but the traffic will be encrypted so they won't be able to tell what you're doing.

SSH Port Forwarding

ssh -L Local_port:Host_IP:Host_port username@IP
If you give the -G switch, The port will be forward to the entire subnet

Howto Linux / UNIX setup SSH with DSA/RSA public key authentication (password less login)

#1 machine : your laptop called Gladiator
#2 machine : your remote server called vxdatacenter
@@@Command to type on Gladiator/Laptop@@@
- ssh-keygen -t dsa/rsa
- make sure chmod 755 ~/.ssh
@@@Copy Public key @@@
scp ~/.ssh/id_dsa.pub user@google.com.ssh/authorized_keys
@@@ Login Test ssh/scp from your Laptop/Gladiator @@@
Make sure the server/vxdatacenter chmod 600 ~/.ssh/authorized_keys
@@@How do i login from client without typing the passphrase @@@
exec /usr/bin/ssh-agent $SHELL
ssh-add
Enter the passphrase... So that it won't ask when you login to remote server with public_key

I hope you got this article little informative !!!

Thank you for visiting my blog...

Saturday, May 15, 2010

Building a Scalable High-Availability E-Mail System with Active Directory and More

In early 2006, Marshall University laid out a plan to migrate HOBBIT (Figure 1), an HP OpenVMS cluster handling university-wide e-mail services. Plagued with increasing spam attacks, this cluster experienced severe performance degradation. Although our employee e-mail store was moved to Microsoft Exchange in recent years, e-mail routing, mailing list and student e-mail store (including IMAP and POP3 services) were still served by OpenVMS with about 30,000 active users. HOBBIT's e-mail software, PMDF, provided a rather limited feature set while charging a high licensing fee. A major bottleneck was discovered on its external disk storage system: the dated storage technology resulted in a limited disk I/O throughput (40MB/second at maximal) in an e-mail system doing intensive I/O operations.
To resolve the existing e-mail performance issues, we conducted brainstorming sessions, requirements analysis, product comparison and test-lab prototyping. We then came up with the design of our new e-mail system: it is named MUMAIL (Figure 2) and uses standard open-source software (Postfix, Cyrus-IMAP and MySQL) installed on Red Hat Enterprise Linux. The core system consists of front-end e-mail hub and back-end e-mail store. The front-end e-mail hub uses two Dell blade servers running Postfix on Linux. Network load balancing is configured to distribute load between them. The back-end e-mail store consists of two additional blade servers running a Cyrus-IMAP aggregation setup. Each back-end node is then attached to a different storage group on the EMC Storage Area Network (SAN). A fifth blade server is designated as a master node to store centralized user e-mail settings. Furthermore, we use LDAP and Kerberos to integrate the e-mail user identities with Windows Active Directory (AD).

illustrates our new e-mail system architecture and the subsystem interactions with existing services, which include Webmail, AD and SMTP gateway. The block diagrams highlighted in red are the components to be studied in detail.

the ever-increasing system load. The price of a mid-range server with four CPUs is usually much higher than the total price of three or more entry-class servers. Furthermore, a single-node architecture reduces system scalability and creates a single point of failure.
The Cyrus-IMAP package is proven to be robust and suitable in large settings. It differs from other Maildir or mbox IMAP servers in that it is intended to run as a “sealed” mailbox server—the Cyrus mailbox database is stored in parts of the filesystem that are private to the Cyrus-IMAP system. More important, a multiple server setup using Cyrus Murder aggregation is supported. It scales out the system's load by using multiple front-end IMAP proxies to direct IMAP/POP3 traffic to multiple back-end mail store nodes. Although we found other ways to scale out Cyrus-IMAP—for example, Cambridge University's pair-wise replication approach, mentioned in the Related Solutions section of this article, or using a clustered filesystem to share IMAP storage partitions between multiple servers with products like Red Hat's Global File System (GFS)—compared with the aggregation approach, these solutions either are too customized to support (the Cambridge approach) or involve extra cost (GFS is sold separately by Red Hat, Inc.).
So, the Cyrus-IMAP Aggregation approach was adopted. Figure 4 illustrates the setup: two Cyrus back-end servers were set up, and each handles half the user population. Two Postfix MTA front-end nodes are designated to serve the proxy functions. When e-mail clients connect through SMTP/IMAP/POP3 to the front-end servers, the Cyrus Proxy service will communicate with the Cyrus Master node using the MUPDATE protocol, so that it gets the information about which Cyrus back-end node stores e-mail for the current client. Furthermore, the back-end Cyrus nodes will notify the Master node about the mailbox changes (creating, deleting and renaming mailboxes or IMAP folders) in order to keep the Master updated with the most current mailbox location information. The Master node replicates these changes to the front-end proxy nodes, which direct the incoming IMAP/POP3/LMTP traffic. The MUPDATE protocol is used to transmit mailbox location changes.

Cyrus-IMAP Aggregation Setup
Although it is not a fully redundant solution (the Master node is still a single point of failure), and half our users will suffer a usage outage if either one of the back-end nodes is down, the aggregator setup divides the IMAP processing load across multiple servers with each taking 50% of the load. As a result of this division of labor, the new mail store system is now scalable to multiple servers and is capable of handling a growing user population and increasing disk usage. More back-end Cyrus nodes can join with the aggregator to scale up the system.

A large-scale implementation of a scalable Linux e-mail system with Active Directory.
Integration with Active Directory
One of the requirements of our new e-mail system is to integrate user identities with the university directory service. Because Microsoft Active Directory services have been made a standard within our centralized campus IT environment, Cyrus (IMAP/POP3) and Postfix (SMTP) are architected to obtain user authentication/authorization from AD. After the integration, all e-mail user credentials can be managed from AD. Most directory services are constructed based on LDAP. AD uses LDAP for authorization, and it has its own Kerberos implementation for authentication. The goal of an integrated AD authentication is to allow the Linux e-mail servers to use AD to verify user credentials. The technology used to support the AD integration scheme is based mainly on the Kerberos and LDAP support, which come with native Linux components,

Linux Authentication and Authorization Against AD
Here is how it works. First, we use AD Kerberos to authenticate Linux clients. Pluggable Authentication Module (PAM) is configured to get the user credentials and pass them to the pam_krb5 library, which is then used to authenticate users using the Linux Kerberos client connection to the Key Distribution Center (KDC) on Active Directory. This practice eliminates the need for authentication administration on the Linux side. However, with only the Kerberos integration, Linux has to store authorization data in the local /etc/passwd file. To avoid managing a separate user authorization list, LDAP is used to retrieve user authorization information from AD. The idea is to let authorization requests processed by Name Service Switch (NSS) first. NSS allows the replacement of many UNIX/Linux configuration files (such as /etc/passwd, /etc/group and /etc/hosts) with a centralized database or databases, and the mechanisms used to access those databases are configurable. NSS then uses the Name Service Caching Dæmon (NSCD) to improve query performance. (NSCD is a dæmon that provides a cache for the most common name service requests.) This can be very important when used against a large AD user container. Finally, NSS_LDAP is configured to serve as an LDAP client to connect to Active Directory to retrieve the authorization data from the AD users container. (NSS_LDAP, developed by PADL, is a set of C library extensions that allow LDAP directory servers to be used as a primary source of aliases, ethers, groups, hosts, networks, protocol, users, RPCs, services and shadow passwords.) Now, with authorization and authentication completely integrated with AD using both LDAP and Kerberos, no local user credentials need to be maintained.
In order to support LDAP authorization integration with Linux, Windows Server 2003 Release 2 (R2), which includes support for RFC 2307, is installed on each of the AD domain controllers. R2 introduces new LDAP attributes used to store UNIX or Linux user and group information. Without an extended AD LDAP schema, like the one used by R2, the Linux automatic authorization integration with AD is not possible. It is also important to mention that the SASL AuthenticationMy passion to work with Free Software and Open Sources Technologies Drives me...
layer shown in Figure 3 is using Cyrus-SASL, which is distributed as a standard package by Carnegie Mellon University. The actual setup uses PAM for authenticating IMAP/POP3 users. It requires the use of a special Cyrus dæmon, saslauthd, which the SASL mechanism uses to communicate via a Linux-named socket.
Conclusion
Our new e-mail system is mostly based on open-source software. The incorporation of Postfix, Cyrus-IMAP and MySQL helped fulfill most of the system requirements. From the hardware perspective, the technologies used, such as Storage Area Network (SAN), blade server and the Intel x86_64 CPUs, helped to meet the requirements of fast access, system scalability and high availability. However, the use of open-source software and new hardware technologies may introduce new management overhead. Although all the open-source software packages used on the new system are mature products, compared with commercial software, they typically lack a GUI for system management. Their configuration and customization are completely based on a set of plain-text configuration files. Initially, this may present a learning curve, as the syntax of these configuration files must be studied. But, once the learning curve is passed, future management easily can be automated, as scripts can be written to manage the configuration parameters and store them in a centralized location. On the hardware side, complex settings also may imply complex network and server management settings, which also may introduce overhead during system management. However, the benefits of using the technologies discussed outweigh the complexities and learning curves involved. It is easy to overcome the drawbacks through proper design, configuration management and system automation.
At the time of this writing, our new Linux e-mail system (MUMAIL) has been running in production for ten months. The entire system has been running in a stable state with minimal downtime throughout this period. All user e-mail messages originally on HOBBIT were moved successfully to MUMAIL in a three-day migration window with automated and non-disruptive migration processes. Users now experience significantly faster IMAP/POP3 access speed. Their e-mail storage quota is raised from 20MB to 200MB, and there is potential to increase the quota to a higher number (1GB). With the installation of gateway-level spam/virus firewalls as well as increased hardware speed, no e-mail backlog has been experienced on MUMAIL during recent spam/virus outbreaks. With an Active Directory integrated user authentication setup, user passwords or other sensitive information are no longer stored on the e-mail system. This reduces user confusion and account administration overhead and increases network security. Mail store backup speed is improved significantly with faster disk access in the SAN environment. Finally, the new system has provided a hardware and software environment that supports future growth with the adoption of a scalable design. More server nodes—both front end and back end—and storage can be added when system usage grows in the future.

An Introduction to Using Linux as a Multipurpose Firewall

Feeling insecure? Here's a guide for getting the protection you need.
High-speed Internet connections are becoming more readily available and popular for home computer users. ADSL (Asymmetric Digital Subscriber Line), Nortel's 1MB modem and cable modems all offer connection speeds many times faster than that of a standard 56K POTS (plain old telephone service) modem that most of us know all too well. The other big advantage of these new services is that they are always connected. That is, you don't need to dial your service provider with your modem to start up your Internet connection. When you turn on your computer, the connection is already there, and your operating system will establish a link as it boots up.
Like the standard modem, these connections allow only one computer to connect to the Internet at a time. In some cases, additional IP addresses can be assigned to additional computers, but there is usually a monthly cost involved in providing this service.
By installing Linux on that old 486 you have sitting in the corner collecting dust, you can create a firewall so all the computers on your local LAN can see the Internet, and at the same time, transfer data back and forth between each other, (see Figure 1). You don't even need a dedicated PC. A faster PC can simultaneously be used for other purposes while acting as the firewall; however, there are two main drawbacks with this approach:
Users on your LAN may experience a slower connection to the Internet.
You could inadvertently open a security hole, allowing someone on the Internet to get in and play havoc with your system or files.

Figure 1. Generic Diagram of Small LAN Configuration
I will be discussing two different types of Linux firewalls. The first type consists of a 486 with 12MB of RAM, and a 200MB hard drive using either Red Hat 6.0 or Slackware 3.6. The second, called the Linux Router Project (LRP), uses a 486, 12MB of RAM, a 1.44MB floppy and no hard drive. Two Ethernet network interface cards (NICs) will be required, regardless of which firewall configuration you install.
Security
Someone is always watching, and people are always on the lookout for computers on the Internet with poor security. Their motivation can be as simple as boredom, or more seriously, a need to find a system to penetrate so they can use it to hide behind while they continue breaking into other systems, leaving evidence that points to you.
If you are running a standard Windows installation, you probably don't have the means to see who is trying to check out your machine. As long as “File and Print Sharing” is turned off inside of Windows, for the most part, you are safe. However, it is possible someone may find a new security hole in your PC and exploit it.
If you have Linux running, you can check out your system logs. Upon doing an informal survey with friends who run Linux firewalls, I found on average five attempts by outsiders each day to use TELNET or FTP to break into their Linux boxes. In the case of a firewall, you can turn off or restrict most services. In general, the strength of your firewall security decreases for each service you open up to the Internet, since each service is an invitation for someone to try and sneak in to your system. For example, if you open TELNET, someone can use it to break in. A safer alternative is to restrict TELNET to certain incoming IP addresses, such as the IP addresses you might use to access your home system from work. If you have no plans to TELNET or FTP into your firewall from the Internet and all your traffic is originated inside your local LAN, you can lock your firewall
fairly tightly. It is always a good idea to stay caught up on new security holes and the fixes for them. Check out http://www.cert.org/ for more information.
Theory of Operation
There are many reasons for having a firewall, some of which I have already mentioned. They include:
Ensuring that local traffic on your intranet does not spill out to the Internet.
Allowing the full use of file and print sharing in your LAN without having to worry about unwanted intrusions.
Providing security for your LAN.
Allowing yourself and authorized users access to your LAN to read e-mail, listen to MP3s or access file backups.

LAN and Internet Traffic Routing
When you copy a file from another local PC on your LAN using Windows “Network Neighbourhood”, or when you FTP a file from another PC on your LAN, that traffic has no reason to go to the Internet. If you had a high-speed modem directly connected to your LAN, it would send out that data, because it has no way of knowing it should not be sent there. By default, it sends all traffic it sees, and although it won't likely get past the next router in the chain, it is sending out data that does not need to be there. This may impact the overall speed of your local LAN. You probably don't want this particular traffic to go out over the Internet, so the firewall prevents it.
One of the TCP/IP settings on our PCs, regardless of the operating system, is the “default route”. When the destination IP address cannot be found on our local LAN (this is determined by the subnet mask), then the default route is used. The default route in this example will point to the IP address of the NIC on the local LAN side of the firewall (Ethernet 0 in Figure 1). This IP address usually ends in 1. For example, if you have a local LAN with a network address of 192.168.0.0 and a subnet mask of 255.255.255.0, you have 192.168.0.1 to 192.168.0.254 available for local IP addresses (see Resources for more information on the Linux NET-3-HOWTO ). In this case, 192.168.0.1 would normally be assigned to the NIC on the firewall.
Any traffic intended for an IP address outside our local LAN will go into the firewall. The firewall will replace (masquerade) the source address of the PC in the local LAN that originated the packet with the firewall's IP address (assigned by your ISP), so that to the Internet, the traffic looks as though it originates from the firewall and is coming from a valid IP address. Any return packets relating to this originating packet will go through the reverse transformation, so the traffic finds its way back to the originating PC.
Rules can be set up to allow certain packets to make it through the firewall or to be stopped dead. By default, nothing gets passed. A small set of rules are needed to support features such as TELNET, HTTP, IMAP and POP3, and a few extra rules are needed to allow other features such as RealAudio or on-line gaming to function. (Gaming can be a little more difficult to set up, as each game is different.)
Designing the Network
Table 1
In Figure 1, you can see how a typical small LAN/firewall configuration might look. You need to determine how many PCs will be in your network, and how many of them will be connecting to the Internet. The IP addresses chosen for your internal network will be determined by the size of the network. Table 1 shows which groups of IP addresses have been reserved for private LANs, such as the one we are designing. For the most part, a class C network address will be sufficient, as it will allow up to 253 hosts or PCs in our LAN, leaving one IP for the firewall. Table 2 shows an example configuration.
Table 2
The complete firewall will be built over several stages. These include building and configuring the hardware, installing and configuring Linux, configuring the network cards, building a new kernel, establishing routing between the networks, then introducing security and locking down the PC and the local LAN.
Building Up the Hardware
First, you must decide what type of system you want to build. If you want to use your firewall only for firewall/routing purposes, then once it is set up and running, it does not need a keyboard or a monitor. In fact, many systems will run without a video card; however, you might want to keep one handy in the event of a system failure. Software changes can be done by either connecting to the firewall over your local LAN via TELNET, or using a modem program on a laptop (such as Hyperterm) and connecting to the firewall via the serial port. This type of configuration is well-suited to LRP. If you actually want to have a few users on your machine reading things like e-mail (either locally or via TELNET), you will need a hard drive and RAM sufficient to handle that. A 200MB hard drive and 16MB of RAM will work for this if you don't load unneeded packages, such as the X server or source code, and your users keep space constraints in mind. If you plan on the LAN using your firewall PC for additional functions, you will need to upgrade it appropriately in all respects: memory, hard drive size and processor speed.
You will need at least two network cards in your firewall. One card will face the Internet and the other will face your local network (Figure 1). If you can't afford a hub and you have only a few PCs to connect, you can put multiple cards in your firewall, one for each PC, and wire an Ethernet cable as a “turn-around” cable. ISA network cards can be found inexpensively in some markets these days, and may be less expensive than an Ethernet hub. The use of more than two network cards in your firewall machine will require more rules in your firewall, but that is easily handled.
You will need the DOS configuration disks for your network cards if they are jumperless cards which use non-volatile RAM (NVRAM) to remember their settings (I/O address, IRQ, etc.). The configuration software for most cards can be found on the Internet at the card manufacturer's web page or at some of those helpful Windows driver repository sites.
Make a DOS floppy boot disk, and have the configuration program for each card handy on floppy.
Table 3
Install one network card at a time and boot your PC. Run the configuration software for that card, and set the I/O address and IRQ settings. Make sure you don't configure the card to a setting already in use by some other card. For I/O addresses, the only item you may have trouble with is an old CD-ROM drive with a proprietary controller (see Table 3). Once configured, remove the card, insert the next network card and repeat the procedure. Once you have your network cards individually configured, you can install them all in the firewall. In my firewall, my first network card is set to an I/O address of 300 and an IRQ of 12, while my second network card is set to an I/O address of 320 and an IRQ of 15.
It is now time to install Linux. The sample configuration that follows is based on Linux kernel 2.2.9. If you install a Linux distribution from the Net or from a recent CD, you may find this kernel included. If not, you can get it from http://www.kernel.org/. The more recent the distribution, the less likely it is that you will run into outdated libraries or utilities. One of the utilities we will be using to control the firewall is called ipchains. This program runs only on kernel version 2.2.x and higher. If you plan on using an earlier version of the kernel, you will need to find ipfwadm. It is always best to use a recent (but not necessarily the most recent) kernel version. Follow the
instructions provided with your distribution, and install the distribution. If the default kernel on the CD is not of the 2.2.x variety, don't worry; you will need to build a new kernel later anyway. If you are building a small system, you will want to install as little of the distribution as possible. At a minimum, you will need to install the base files and networking support.

Configuring the Network
At one point during the installation, you will be asked to configure the Ethernet interfaces (ports). Generally, you will be able to configure only one of the interfaces during the installation. The remaining interfaces can be configured by editing the configuration files. Alternatively, Red Hat 6 offers a GUI-based application called netcfg; however, it requires you to install the X server, something I don't recommend if you are tight on hard drive space or don't plan on leaving a monitor connected to the PC. When you do come across the configuration request for the first interface (generally called eth0), you should enter the information for your local LAN. In our example as per Table 2, we configure this interface as:
IP address: 192.168.0.1
Subnet Mask: 255.255.255.0
Listing 1
The default gateway of the firewall (not the PCs inside your LAN) is that of the gateway provided by your ISP. If the subnet mask provided by your ISP ends in a .0, your ISP gateway IP address will generally end in .1, for example 193.181.132.1. After the setup and installation of the distribution is complete, you will have to add the additional information on your second Ethernet interface (generally called eth1). We will need to edit or create configuration files for both Slackware (Listing 1) and Red Hat 6.0 (Red Hat sidebar).
Red Hat Configuration Files
If your NIC cards are all of the same type or all use the same driver, you must tell Linux to search for more than one card of that type at boot time. LILO provides a nice way of doing this that works for most Ethernet drivers I have tried. Edit the file /etc/lilo.conf and add the line
append="ether=0,0,eth1"
anywhere in the LILO global section near the top of the file. If you have more than two Ethernet cards, you would add
append="ether=0,0,eth1 ether=0,0,eth2"
You can also explicitly define all the cards instead of just telling the system to look for additional cards by using the following on one line:
append="ether=irq_card0,io=0xaddress0,eth0 ether=irq_card1,io=0xaddress1,eth1"
In the example of my configured cards above, I could then use
append="ether=12,0x300,eth0 ether=15,0x320,eth1"
Don't forget to type lilo after you have finished editing the file so the new lilo parameters are read and installed, but, more importantly at this point, so you know you have not created any errors in the LILO configuration file.
Do not reboot yet, as we still need to build a kernel to support our various hardware and firewalling needs.
Building the Kernel
Listing 2
A variety of configurations are required to make the kernel run. Listing 2 shows the settings I have used in my system. If you have never built a kernel before, see “Linux Kernel Installation” by David Bandel in the November 1997, issue 43 of Linux Journal. A quick summary is as follows:
cd /usr/src/linux
make menuconfig
Look at the many screens, read the help and any other reference documentation it points to. This will help you determine which options you need. After you have finished choosing your options, save the kernel, then type:
make dep
make bzlilo
make modules
make modules_install
The resulting files, the new kernel file called vmlinuz and a new System.map file, will likely be located in the root directory /. You will need to copy the System.map file to the root directory, like this:
cp /System.map root
Also make sure the file /etc/lilo.conf and the line inside it which reads image=IMAGENAME (where IMAGENAME is the name and location of your kernel used at boot time) is correct. If it does not point to the correct location, change it and re-execute the lilo command to complete the process of setting up the new kernel.
This will build and install the Linux kernel, update LILO to reflect the new kernel and install a variety of modules, such as support for RealAudio which by default is blocked by the firewall.
Listing 2 includes only those options required to make the firewall function. Other options such as processor type are left out, since these are specific to the hardware you are using for this project. As a rule, I put as little in the kernel as required, and I minimize the use of modules. If you are not sure how an option I have shown in the table is used, or where it shows up in the kernel-configuration program, you can match up the item by clicking on help for the items in that section. You will find its configuration file name at the top of the help page. Similarly, if it's not shown in Listing 2 and you don't need it to make your hardware run or support some other feature, then it should be set to off.
If you are building a bare-bones system and are going to compile the kernel elsewhere, be sure to save your kernel on that machine first, and also save that PC's kernel configuration in an alternate file (see the bottom of the kernel configuration program menu). After it is built, you need to copy the files over to the firewall PC via sneaker net or LAN. Don't forget to copy the modules installed in /lib/modules/2.2.9 as well.
It is now time to reboot the PC and cross your fingers. If everything works correctly, both Ethernet cards will be recognized, and they will both be configured. When the system is fully booted, log in as root and type ifconfig. It should show detailed information about three interfaces: lo0: the local loop interface
eth0: the Ethernet port pointing to your local LAN
eth1: the Ethernet port pointing to the Internet
You can also type route and see what default routes are up. It should show a default route to the Internet, as well as some information about your local LAN. At this point, the firewall should be able to see both your local LAN and the Internet. If you hook your high-speed modem to the eth1
port at this point, you should be able to ping sites on the Internet (e.g., ping www.linuxjournal.com) and see an answer coming back about once per second. Press CTRL-c to stop the pings. Some high-speed modems need to learn your Ethernet card's MAC address, and only do so each time they are turned on. Therefore, if you are connecting your modem to a different Ethernet card than it was previously connected to, you will have to power off both the modem and your firewall PC, power the modem on, wait a few seconds, then turn the PC on. If you don't do this, you may find you can't see the Internet at all.

Locking Down the Firewall
We currently have a Linux PC, connected in the middle of two networks. It can see both, and both can see it. The PC is also wide open with all the default ports turned on. We want to restrict this as much as possible. People are always looking for new ways of breaking into systems. The more we lock down this firewall to the outside, the less vulnerable we are to attacks. Nothing is perfect, and the only true way to be sure people are kept out is to unplug your Ethernet connection when you are not there. Since that's undesirable for most of us, this is the next best thing.
What needs to be done now is disabling all services we don't need. If you are making this a true firewall, you can disable almost everything except TELNET and FTP, and these two will be limited to ports from only inside your LAN and trusted outside IP addresses.
Listing 3
The file /etc/inetd.conf, as shown in Listing 3, is where these ports are configured. This file affects traffic terminating at the firewall, not passing through it. Disabling something like POP3 or IMAP is acceptable, since when you go to get your mail from a PC inside your network, this traffic will pass through the firewall (but not stop) on its way to your ISP's POP3 or IMAP mail server.
Remember, the more ports and addresses you choose to leave open, the more closely you will need to watch your firewall for break-in attempts. We have left TELNET and FTP open, so we'll want to restrict the originating IP addresses on both networks to those we want to let in.
Setting Services
This is done by editing the files /etc/hosts.deny and /etc/hosts.allow. By editing these files, you can deny access to everyone except a few specific addresses or range of addresses, or you can allow everyone in by default and disable problem IP addresses down the road when you discover unwanted access from those points. If this is the case, be sure to watch your system logs closely. See the “Setting Services” sidebar for more details. In one sense, we could have left inetd.conf alone and restricted people from those ports via the /etc/hosts.deny table; however, it is always best to lock down ports in multiple ways.
By default, most UNIX systems do not allow root to log in from anywhere but the console. If your system is not set up that way, it should be. You will at least want to slow down someone who might want in your system. If they can't log in directly as root, this is an additional security benefit. Check the file /etc/securetty. In Red Hat 6.0, look for pty1, pty2, etc. entries in the table. In Slackware, look for ttyp0, ttyp1, etc. entries in the table. If these entries are in place, root login is allowed on those TELNET ttys; therefore, remove the entries. The other remaining entries in the table cover your various consoles and serial ports.
Since you can't log in remotely as root and you do not have a console with a monitor and keyboard, it would be best to add a second user to the firewall to ensure you can “su to root” to do work on the firewall.
useradd -g 100 -d /home/USER -s /bin/tcsh -c\
'YOURNAME' USER passwd USER
The -g controls which group this user will belong to. In this example, 100 was used, as this is the user's group in Red Hat 6.0. If this does not work for you, check out /etc/group to find a suitable group. YOURNAME is whatever you want to put in the Name field of the user account, and USER is the ID chosen for the user, i.e., I may choose to use jeff as my ID.
Stopping Extra Processes
In a small system, the only processes we want running are ones that pertain to the operation of the firewall. This means disabling processes: all but one or two consoles, Sendmail and anything else you don't need. You can see what is running right now by typing:
ps -xa
To keep Sendmail from starting next time, you will need to move or edit the file where it starts. Linux usually starts up in runlevel 3. In Red Hat 6.0, you can check that by looking at /etc/inittab and looking for the line that reads id:3:initdefault:. The 3 indicates runlevel 3. Therefore, in /etc/rc.d/rc3.d, there is a file called S80sendmail. Move this file to 80sendmail, as follows:
mv /etc/rc.d/rc3.d/S80sendmail\
/etc/rc.d/rc3.d/80sendmail
Some programs like elm require that sendmail be running to operate properly. This opens up a potential hole to to the outside world since it also means port 25 will be open to possible attacks and possibly even mail relaying—allowing others on the Internet to use your firewall to send out spam mail. Turning off port 25 access is the easiest way to prevent this problem. Other solutions can be found at http://www.sendmail.org/.
In Slackware, edit /etc/rc.d/rc.M and change the line:
/usr/sbin/sendmail -bd -q15m
to:
/usr/sbin/sendmail -bm -q15m
In Red Hat 6.0, edit /etc/rc.d/rc3.d/S80sendmail and change the line:
daemon /usr/sbin/sendmail $([ "$DAEMON" = yes ] && echo -bd) \
to:
daemon /usr/sbin/sendmail $([ "$DAEMON" = yes ] && echo -bm) \
Creating the Firewall
Currently, we have a reasonably secure PC quite incapable of passing the network traffic from the local LAN to the Internet. It is now time to set up and configure the rules that will make our firewall function. As mentioned earlier, these rules allow acceptable packets to pass through the firewall, while still offering various levels of security to unacceptable packets.
Download (with FTP) the ipchains package from http://www.rustcorp.com/ipchains/. Follow the installation instructions you obtained with the package to install it on your system.
Listing 4
Listing 4 shows the /etc/rc.d/rc.local file which is used to start any process not normally started as part of the distribution's installation. It is here where we set the rules for our firewall. Since our firewall is fairly straightforward, all we need to do is set up forwarding of masqueraded packets. To be able to use the full functionality of FTP, RealAudio, IRC and others, we need to support their port requirements as well. Many of these can be supported using the ipchains command above, but there are loadable modules that will take care of this, such as those shown in the sample rc.local file in Listing 4. See /lib/modules/2.2.9/ipv4 for a list of modules supported in your kernel. This directory should have been created earlier when you built the kernel.
That should do it. You are now ready to test your network firewall. Set one of your PCs inside your local LAN to one of the sample settings shown in Figure 1. For example, on Windows 95, you will need to enter a local LAN IP (such as 192.168.0.10), a subnet mask of 255.255.255.0, a gateway IP of 192.168.0.1 and DNS entries given to you by the ISP. If the high-speed modem was originally connected to this PC, the DNS entries in the PC should already be set.
To test out your new firewall, try connecting to a web site with one of the PCs on your internal LAN. Try using RealAudio, FTP and other functions you regularly use. If none of these work, try using TELNET to get to the firewall PC. If you can do so, and you can ping a site on the Internet (or get to it via TELNET) from the firewall PC, check your rules in the /etc/rc.d/rc.local file, as you might not have turned on IP forwarding. If web access works, but (say) IRC does not, check to see if you loaded the IRC module correctly. Use the command lsmod to show which modules are loaded.
Building a Firewall Using the Linux Router Project
The configuration of LRP I will describe also uses the setup in Figure 1. It was set up on a 486 with 12MB of memory, a 1.44MB floppy drive, two Western Digital ISA network cards and no hard drive. For your system, install and configure the network cards in the same way as for the full firewall build earlier in this article. LRP version 2.9.4 is based on kernel version 2.0.36. This kernel is older than the 2.2.9 used above, and as a result, does not offer some features you may require if you want an advanced firewall. By the time you read this, there will likely be a new version available based on version 2.2.x of the kernel. I will describe setting version 2.9.4, and if you need some of the 2.2.x features, you have a foundation from which to work.
LRP uses a DOS-formatted floppy, either formatted as a standard 1.44MB disk or larger. (A utility called 2m can squeeze additional, usable storage space out of a floppy.) During boot time, a RAM disk is created, which is used as the live file system. Various portions of the system are created from compressed archive files (tar) that end in .lrp and are found on the floppy. In general, the floppy can run with write protect on. This means if someone were to find a way in to your firewall, any changes they made would disappear when the system is rebooted.
LRP is available in many forms. The hard way is to create a disk, make it bootable using a program called syslinux, and install the kernel and various LRP files required. However, at ftp://ftp.linuxrouter.org/linux-router/dists/2.9.4/, you will find in the download section a file called idiot-image_1440KB_2.9.4. The name might not be flattering, but it is the easiest way to start building an LRP disk. After you get the file via FTP, copy it to the floppy in one of two ways. In DOS, use the rawrite utility that came with your Linux distribution. In Linux, type:
cp idiot-image_1440KB_2.9.4 /dev/fd0
I have assumed /dev/fd0 is your 1.44MB floppy, but if it is not, change fd0 to the correct device name.
Now go to http://www.linuxrouter.org/modmaker/ and make a kernel that includes hardware support for our network cards and includes any modules required to support FTP, RealAudio, etc. This web site is a very nice way to generate a kernel. Click on 2.0.36final and tick off the modules you require. Unless you know you don't want support for one of the few masquerading modules in this list (like IRC), tick off all options that start with ip_masq such as ip_masq_irc and ip_masq_ftp. Then go down the list and find the drivers for your hardware. You may have to do some research as to the driver your NIC cards require. If you don't know which driver to pick, run make menuconfig on a working full Linux system and look at the devices under Network Device Support. When you find your card, look at the help and find out its module name. This module name is what you need You can also increase the overall security by implementing additional rules to prevent IP spoofing in the full firewall. These rules are already included in the LRP.

Use Linux as a SAN Provider

At one-tenth the cost of the typical commercial appliance, Linux can deliver storage with speed and redundancy. Make the move toward a full-featured iSCSI SAN solution with what you already have in your server room.
Storage Area Networks (SANs) are becoming commonplace in the industry. Once restricted to large data centers and Fortune 100 companies, this technology has dropped in price to the point that small startups are using them for centralized storage. The strict definition of a SAN is a set of storage devices that are accessible over the network at a block level. This differs from a Network Attached Storage (NAS) device in that a NAS runs its own filesystem and presents that volume to the network; it does not need to be formatted by the client machine. Whereas a NAS usually is presented with the NFS or CIFS protocol, a SAN running on the same Ethernet often is presented as iSCSI, although other technologies exist.
iSCSI is the same SCSI protocol used for local disks, but encapsulated inside IP to allow it to run over the network in the same way any other IP protocol does. Because of this, and because it is seen as a block device, it often is almost indistinguishable from a local disk from the point of view of the client's operating system and is completely transparent to applications.
The iSCSI protocol is defined in RFC 3720 and runs over TCP ports 860 and 3260. In addition to the iSCSI protocol, many SANs implement Fibre Channel as a mechanism. This is an improvement
over Gigabit Ethernet, mainly because it is 4 or 8Gb/s as opposed to 1Gb/s. In the same vein, 10 Gigabit Ethernet would have an advantage over Fibre Channel.
The downside to Fibre Channel is the expense. A Fibre Channel switch often runs many times the cost of a typical Ethernet switch and comes with far fewer ports. There are other advantages to Fibre Channel, such as the ability to run over very long distances, but these aren't usually the decision-making factors when purchasing a SAN.
In addition to Fibre Channel and iSCSI, ATA over Ethernet (AoE) also is starting to make some headway. In the same way that iSCSI provides SCSI commands over an IP network, AoE provides ATA commands over an Ethernet network. AoE actually is running directly on Ethernet, not on top of IP the way iSCSI does. Because of this, it has less overheard and often is faster than iSCSI in the same environment. The downside is that it cannot be routed. AoE also is far less mature than iSCSI, and fewer large networking companies are looking to support AoE. Another disadvantage of AoE is that it has no built-in security outside of MAC filtering. As it is relatively easy to spoof a MAC address, this means anyone on the local network can access any AoE volumes.
Should You Use a SAN?
The first step in moving down the road to a SAN is the choice of whether to use it. Although a SAN often is faster than a NAS, it also is less flexible. For example, the size of or the filesystem of a NAS usually can be changed on the host system without the client system having to make any changes. With a SAN, because it is seen as a block device like a local disk, it is subject to a lot of the same rules as a local disk. So, if a client is running its /usr filesystem on an iSCSI device, it would have to be taken off-line and modified not just on the server side, but also on the client side. The client would have to grow the filesystem on top of the device.
There are some significant differences between a SAN volume and a local disk. A SAN volume can be shared between computers. Often, this presents all kinds of locking problems, but with an application aware that its volume is shared out to multiple systems, this can be a powerful tool for failover, load balancing or communication. Many filesystems exist that are designed to be shared. GFS from Red Hat and OCFS from Oracle (both GPL) are great examples of the kinds of these filesystems.
The network is another consideration in choosing a SAN. Gigabit Ethernet is the practical minimum for running modern network storage. Although a 100- or even a 10-megabit network theoretically would work, the practical results would be extremely slow. If you are running many volumes or requiring lots of reads and writes to the SAN, consider running a dedicated gigabit network. This will prevent the SAN data from conflicting with your regular IP data and, as an added bonus, increase security on your storage.
Security also is a concern. Because none of the major SAN protocols are encrypted, a network sniffer could expose your data. In theory, iSCSI could be run over IPsec or a similar protocol, but without hardware acceleration, doing so would mean a large drop in performance. In lieu of this, at the very least, keeping the SAN data on its own VLAN is required.
Because it is the most popular of the various SAN protocols available for Linux, I use iSCSI in the examples in this article. But, the concepts should transfer easily to AoE if you've selected that for your systems. If you've selected Fibre Channel, things still are similar, but not as similar. You will need to rely more on your switch for most of your authentication and routing. On the positive side, most modern Fibre Channel switches provide excellent setup tools for doing this.
To this point, I have been using the terms client and server, but that is not completely accurate for iSCSI technology. In the iSCSI world, people refer to clients as initiators and servers or other iSCSI storage devices as targets. Here, I use the Open-iSCSI Project to provide the initiator and the iSCSI Enterprise Target (IET) Project to provide the target. These pieces of software are available in the default repositories of most major Linux distributions. Consult your distribution's documentation for the package names to install or download the source from www.open-iscsi.org and iscsitarget.sourceforge.net. Additionally, you'll need iSCSI over TCP/IP in your kernel, selectable in the low-level SCSI drivers section.

Setting Up the Initiator and Target
In preparation for setting up the target, you need to provide it with a disk. This can be a physical disk or you can create a disk image. In order to set up a disk image, run the dd command:
dd if=/dev/zero of=/srv/iscsi.image.0 bs=1 seek=10M count=1
This command creates a file about 10MB called /srv/iscsi.image.0 filled with zeros. This is going to represent the first iscsi disk. To create another, do this:
dd if=/dev/zero of=/srv/iscsi.image.1 bs=1 seek=10M count=1
Configuration for the IET software is located in /etc/ietd.conf. Though a lot of tweaks are available in the file, the important lines really are just the target name and LUN. For each target, exported disks must have a unique LUN. Target names are formatted specially. The official term for this name is the iSCSI Qualified Name (IQN).
The format is:
iqn.yyyy-mm.(reversed domain name):label
where iqn is required, yyyy signifies a four-digit year, followed by mm (a two-digit month) and a reversed domain name, such as org.michaelnugent. The label is a user-defined string in order to better identify the target.
Here is an example ietd.conf file using the images created above and a physical disk, sdd:
Target iqn.2009-05.org.michaelnugent:iscsi-target
IncomingUser michael secretpasswd
OutgoingUser michael secretpasswd
Lun 0 Path=/srv/iscsi.images.0,Type=fileio
Lun 1 Path=/srv/iscsi.images.1,Type=fileio
Lun 2 Path=/dev/sdd,Type=blockio
The IncomingUser is used during discovery to authenticate iSCSI initiators. If it is not specified, any initiator will be allowed to connect to open a session. The OutgoingUser is used during discovery to authenticate the target to the initiator. For simplicity, I made them the same in this example, but they don't need to be. Note that both of these are required by the RFC to be 12 characters long. The Microsoft initiator enforces this strictly, though the Linux one does not.
Start the server using /etc/init.d/iscsitarget start (this may change depending on your distribution). Running ps ax | grep ietd will show you that the server is running.
Now you can move on to setting up the initiator to receive data from the target. To set up an initiator, place its name (in IQN format) in the /etc/iscsi/initiatorname.iscsi file (or possibly /etc/initiatorname.iscsi). An example of a well-formatted file would be the following:
InitiatorName=iqn.2009-05.org.michaelnugent:iscsi-01
In addition, you also need to modify the /etc/iscsi/iscsid.conf file to match the user names and passwords set in the ietd.conf file above:
node.session.auth.authmethod = CHAP
node.session.auth.username = michael
node.session.auth.password = secretpasswd
node.session.auth.username_in = michael
node.session.auth.password_in = secretpasswd
discovery.sendtargets.auth.authmethod = CHAP
discovery.sendtargets.auth.username = michael
discovery.sendtargets.auth.password = secretpasswd
discovery.sendtargets.auth.username_in = michael
discovery.sendtargets.auth.password_in = secretpasswd
Once this is done, run the iscsiadm command to discover the target.
iscsiadm -m discovery -t sendtargets -p 192.168.0.1 -P 1
This should output the following:
Target: iqn.2009-05.org.michaelnugent:iscsi-target
Portal: 192.168.0.1:32360,1
IFace Name: default
Now, at any time, you can run:
iscsiadm -m node -P1
which will redisplay the target information.
Now, run /etc/init.d/iscsi restart. Doing so will connect to the new block devices. Run dmesg and fdisk -l to view them. Because these are raw block devices, they look like physical disks to Linux. They'll show up as the next SCSI device, such as /dev/sdb. They still need to be partitioned and formatted to be usable. After this is done, mount them normally and they'll be ready to use.
This sets up the average iSCSI volume. Often though, you may want machines to run entirely diskless. For that, you need to run root on iSCSI as well. This is a bit more involved. The easiest, but more expensive way is to employ a network card with iSCSI built in. That allows the card to mount the volume and present it without having to do any additional work. On the downside, these cards are significantly more expensive than the average network card.
To create a diskless system without an iSCSI-capable network card, you need to employ PXE boot. This requires that a DHCP server be available in order for the initiator to receive an address. That DHCP server will have to refer to a TFTP server in order for the machine to download its kernel and initial ramdisk. That kernel and ramdisk will have iSCSI and discovery information in it. This enables the average PXE-enabled card to act as a more expensive iSCSI-enabled network card.
Multipathing
Another feature often run with iSCSI is multipathing. This allows Linux to use multiple networks at once to access the iSCSI target. It usually is run on separate physical networks, so in the event that one fails, the other still will be up and the initiator will not experience loss of a volume or a system crash. Multipathing can be set up in two ways, either active/passive or active/active. Active/active generally is the preferred way, as it can be set up not only for redundancy, but also for load balancing. Like Fibre Channel, multipath assigns World Wide Identifiers (WWIDs) to devices. These are guaranteed to be unique and unchanging. When one of the paths is removed, the other one continues to function. The initiator may experience slower response time, but it will continue to function. Re-integrating the second path allows the system to return to its normal state.
RAID
When working with local disks, people often turn to Linux's software RAID or LVM systems to provide redundancy, growth and snapshotting. Because SAN volumes show up as block devices, it is possible to use these tools on them as well. Use them with care though. Setting up RAID 5 across three iSCSI volumes causes a great deal of network traffic and almost never gives you the results you're expecting. Although, if you have enough bandwidth available and you aren't doing many
writes, a RAID 1 setup across multiple iSCSI volumes may not be completely out of the question. If one of these volumes drops, rebuilding may be an expensive process. Be careful about how much bandwidth you allocate to rebuilding the array if you're in a production environment. Note that this could be used at the same time as multipathing in order to increase your bandwidth.
To set up RAID 1 over iSCSI, first load the RAID 1 module:
modprobe raid1
After partitioning your first disk, /dev/sdb, copy the partition table to your second disk, /dev/sdc. Remember to set the partition type to Linux RAID autodetect:
sfdisk -d /dev/sdb | sfdisk /dev/sdc
Assuming you set up only one partition, use the mdadm command to create the RAID group:
mdadm --create /dev/md0 --level=1 --raid-disks=2 /dev/sdb1 /dev/sdc1
After that, cat the /etc/mdstat file to watch the state of the synchronization of the iSCSI volumes. This also is a good time to measure your network throughput to see if it will stand up under production conditions.
Conclusion
Running a SAN on Linux is an excellent way to bring up a shared environment in a reasonable amount of time using commodity parts. Spending a few thousand dollars to create a multiterabyte array is a small budget when many commercial arrays easily can extend into the tens to hundreds of thousands of dollars. In addition, you gain flexibility. Linux allows you to manipulate the underlying technologies in ways most of the commercial arrays do not. If you're looking for a more-polished solution, the Openfiler Project provides a nice layout and GUI to navigate. It's worth noting that many commercial solutions run a Linux kernel under their shell, so unless you specifically need features or support that isn't available with standard Linux tools, there's little reason to look to commercial vendors for a SAN solution.

DRBD in a Heartbeat

About three years ago, I was planning a new server setup that would run our new portal as well as e-mail, databases, DNS and so forth. One of the most important goals was to create a redundant solution, so that if one of the servers failed, it wouldn't affect company operation.
I looked through a lot of the redundant solutions available for Linux at the time, and with most of them, I had trouble getting all the services we needed to run redundantly. After all, there is a very big difference in functionality between a Sendmail dæmon and a PostgreSQL dæmon.
In the end, though, I did find one solution that worked very well for our needs. It involves setting up a disk mirror between machines using the software DRBD and a high-availability monitor on those machines using Heartbeat.
DRBD mirrors a partition between two machines allowing only one of them to mount it at a time. Heartbeat then monitors the machines, and if it detects that one of the machines has died, it takes control by mounting the mirrored disk and starting all the services the other machine is running.
I've had this setup running for about three years now, and it has made the inevitable hardware failures unnoticeable to the company.
In this tutorial, I show you how to set up a redundant Sendmail system, because once you do that, you will be able to set up almost any service you need. We assume that your master server is called server1 and has an IP address of 192.168.1.1, and your slave server is called server2 and has an IP address of 192.168.1.2.
And, because you don't want to have to access your mail server on any of these addresses in case they are down, we will give it a virtual address of 192.168.1.5. You can, of course, change this to whatever address you want in the Heartbeat configuration that I discuss near the end of this article.
How It Works
This high-availability solution works by replicating a disk partition in a master/slave mode. The server that is running as a master has full read/write access to that partition; whereas the server running as slave has absolutely no access to the partition but silently replicates all changes made by the master server.
Because of this, all the processes that need to access the replicated partition must be running on the master server. If the master server fails, the Heartbeat dæmon running on the slave server will tell DRBD that it is now the master, mount the replicated partition, and then start all the processes that have data stored on the replicated partition.
How to Get It Running
The first step for running a redundant system is having two machines ready to try it out. They don't need to have identical specs, but they should meet the following requirements:
Enough free space on both machines to create an equal-sized partition on each of them.
The same versions of the dæmons you want to run across both machines.
A network card with crossover cable or a hub/switch.
An optional serial port and serial port crossover cable for additional monitoring.
You also should think carefully about which services you want running on both machines, as this will affect the amount of hard disk you will need to dedicate to replication across them and how you will store the configuration and data files of these services.
It's very important that you have enough space on this shared partition, because it will be the main data storage location for all of these services. So, if you are going to be storing a large Sendmail spool or a database, you should make sure it has more than enough space to run for a long time before having to repartition and reconfigure DRBD for a larger disk size.
Setting Up the Basics on Your Servers
Once you've made sure your machines are ready, you can go ahead and create an equal-sized partition on both machines. At this stage, you do not need to create a filesystem on that partition, because you will do that only once it is running mirrored over DRBD.
For my servers, I have one DRBD replicated drive that looks like this on my partition tables:
/dev/sda5 7916 8853 7534453+ 83 Linux
Note: type fdisk -l at your command prompt to view a listing of your partitions in a format similar to that shown here. Also, in my case, the partition table is identical on both redundant machines.
The next step after partitioning is getting the packages for Heartbeat version 1.2+ and DRBD version 0.8+ installed and the DRBD kernel module compiled. If you can get these prepackaged for your distribution, it will probably be easier, but if not, you can download them from www.linux-ha.org/DownloadSoftware and www.drbd.org/download.html.
Now, go to your /etc/hosts file and add a couple lines, one for your primary and another for your secondary redundant server. Call one server1, the other server2, and finally, call one mail, and set the IP addresses appropriately. It should look something like this:
192.168.1.1 server1
192.168.1.2 server2
192.168.1.5 mail
Finally, on both your master and slave server, make a folder called /replicated, and add the following line to the /etc/fstab file:
/dev/drbd0 /replicated ext3 noauto 0 0
Configuring DRBD
After you've done that, you have to set up DRBD before moving forward with Heartbeat. In my setup, the configuration file is /etc/drbd.conf, but that can change depending on distribution and compile time options, so try to find the file and open it now so you can follow along. If you can't find it, simply create one called /etc/drbd.conf.
Listing 1 is my configuration file. I go over it line by line and add explanations as comments that begin with the # character.
Listing 1. /etc/drbd.conf
# Each resource is a configuration section for a
# mirrored disk.
# The drbd0 is the name we will use to refer
# to this disk when starting or stopping it.

resource drbd0 {
protocol C;
handlers {
pri-on-incon-degr "echo 'DRBD: primary requested but inconsistent!'
↪| wall; /etc/init.d/heartbeat stop"; #"halt -f";
pri-lost-after-sb "echo 'DRBD: primary requested but lost!'
↪| wall; /etc/init.d/heartbeat stop"; #"halt -f";
}

startup {
degr-wfc-timeout 120; # 2 minutes.
}

disk {
on-io-error detach;
}
# These are the network settings that worked best for me.
# If you want to play around with them, go
# ahead, but take a look in the man pages of drbd.conf
# and drbdadm to see what each does.

net {
timeout 120;
connect-int 20;
ping-int 20;
max-buffers 2048;
max-epoch-size 2048;
ko-count 30;

# Remember to change this shared-secret on both the master
# and slave machines.

cram-hmac-alg "sha1";
shared-secret "FooFunFactory";
}

syncer {
rate 10M;
al-extents 257;
}

# This next block defines the settings for the server
# labeled as server1. This label should be in your
# /etc/hosts file and point to a valid host.

on server1 {

# The following device will be created automatically by
# the drbd kernel module when the DRBD
# partition is in master mode and ready to write.
# If you have more than one DRBD resource, name
# this device drbd1, drbd2 and so forth.

device /dev/drbd0

# Put the partition device name you've prepared here.

disk /dev/sda5;

# Now put the IP address of the primary server here.
# Note: you will need to use a unique port number for
# each resource.

address 192.168.1.3:7788;
meta-disk internal;
}

# This next block is identical to that of server1 but with
# the appropriate settings of the server called
# server2 in our /etc/hosts file.

on server2 {
device /dev/drbd0;
disk /dev/sda5;
address 192.168.1.2:7788;
meta-disk internal;
}
}
Now, let's test it by starting the DRBD driver to see if everything works as it should. On your command line on both servers type:
drbdadm create-md drbd0; /etc/init.d/drbd restart; cat /proc/drbd
If all goes well, the output of the last command should look something like this:
0: cs:Connected st:Secondary/Secondary ds:Inconsistent/Inconsistent r---
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
resync: used:0/7 hits:0 misses:0 starving:0 dirty:0 changed:0
act_log: used:0/257 hits:0 misses:0 starving:0 dirty:0 changed:0
Note: you always can find information about the DRBD status by typing:
cat /proc/drbd
Now, type the following command on the master system:
drbdadm -- --overwrite-data-of-peer primary drbd0; cat /proc/drbd
The output should look something like this:
0: cs:SyncSource st:Primary/Secondary ds:UpToDate/Inconsistent r---
ns:65216 nr:0 dw:0 dr:65408 al:0 bm:3 lo:0 pe:7 ua:6 ap:0
[>...................] sync'ed: 2.3% (3083548/3148572)K
finish: 0:04:43 speed: 10,836 (10,836) K/sec
resync: used:1/7 hits:4072 misses:4 starving:0 dirty:0 changed:4
act_log: used:0/257 hits:0 misses:0 starving:0 dirty:0 changed:0
This means it is syncing your disks from the master computer that is set as the primary one to the slave computer that is set as secondary.
Next, create the filesystem by typing the following on the master system:
mkfs.ext3 /dev/drbd0
Once that is done, on the master computer, go ahead and mount the drive /dev/drbd0 on the /replicated directory we created for it. We'll have to mount it manually for now until we set up Heartbeat.
Preparing Your Services
An important part of any redundant solution is properly preparing your services so that when the master machine fails, the slave machine can take over and run those services seamlessly. To do that, you have to move not only the data to the replicated DRBD disk, but also move the configuration files.
Let me show you how I've got Sendmail set up to handle the mail and store it on the replicated drives. I use Sendmail for this example as it is one step more complicated than the other services, because even if the machine is running in slave mode, it may need to send e-mail notifications from internal applications, and if Sendmail can't access the configuration files, it won't be able to do this.
On the master machine, first make sure Sendmail is installed but stopped. Then create an etc directory on your /replicated drive. After that, copy your /etc/mail directory into the /replicated/etc and create a symlink from /replicated/etc/mail to /etc/mail.
Next, make a var directory on the /replicated drive, and copy /var/mail, /var/spool/mqueue and any other mail data folders into that directory. Then, of course, create the appropriate symlinks so that the new folders are accessible from their previous locations.
Your /replicated directory structure should now look something like:
/replicated/etc/mail
/replicated/var/mail
/replicated/var/spool/mqueue
/replicated/var/spool/mqueue-client
/replicated/var/spool/mail
And, on your main drive, those folders should be symlinks and look something like:
/etc/mail -> /replicated/etc/mail
/var/mail -> /replicated/var/mail
/var/spool/mqueue -> /replicated/var/spool/mqueue
/var/spool/mqueue-client -> /replicated/var/spool/mqueue-client
/var/spool/mail -> /replicated/var/spool/mail
Now, start Sendmail again and give it a try. If all is working well, you've successfully finished the first part of the setup.
The next part is to make sure it runs, even on the slave. The trick we use is copying the Sendmail binary onto the mounted /replicated drive and putting a symlink to the binary ssmtp on the unmounted /replicated folder.
First, make sure you have ssmtp installed and configured on your system. Next, make a directory
/replicated/usr/sbin, and copy /usr/sbin/sendmail to that directory. Then, symlink from /usr/sbin/sendmail back to /replicated/usr/sbin/sendmail.
Once that's done, shut down Sendmail and unmount the /replicated drive. Then, on both the master and slave computers, create a folder /replicated/usr/sbin and a symlink from /usr/sbin/ssmtp to /replicated/usr/sbin/sendmail.
After setting up Sendmail, setting up other services like Apache and PostgreSQL will seem like a breeze. Just remember to put all their data and configuration files on the /replicated drive and to create the appropriate symlinks.
Configuring Heartbeat
Heartbeat is designed to monitor your servers, and if your master server fails, it will start up all the services on the slave server, turning it into the master. To configure it, we need to specify which servers it should monitor and which services it should start when one fails.
Let's configure the services first. We'll take a look at the Sendmail we configured previously, because the other services are configured the same way. First, go to the directory /etc/heartbeat/resource.d. This directory holds all the startup scripts for the services Heartbeat will start up.
Now add a symlink from /etc/init.d/sendmail to /etc/heartbeat/resource.d.
Note: keep in mind that these paths may vary depending on your Linux distribution.
With that done, set up Heartbeat to start up services automatically on the master computer, and turn the slave to the master if it fails. Listing 2 shows the file that does that, and in it, you can see we have only one line, which has different resources to be started on the given server, separated by spaces.
Listing 2. /etc/heartbeat/haresources
server1 IPaddr::192.168.1.5/24 datadisk::drbd0 sendmail
The first command, server1, defines which server should be the default master of these services; the second one, IPaddr::192.168.1.5/24, tells Heartbeat to configure this as an additional IP address on the master server with the given netmask. Next, with datadisk::drbd0 we tell Heartbeat to mount this drive automatically on the master, and after this, we can enter the names of all the services we want to start up—in this case, we put sendmail.
Note: these names should be the same as the filename for their startup script in /etc/heartbeat/resource.d.
Next, let's configure the /etc/heartbeat/ha.cf file (Listing 3). The main things you would want to change in it are the hostnames of the master/slave machine at the bottom, and the deadtime and initdead. These specify how many seconds of silence should be allowed from the other machine before assuming it's dead and taking over.
If you set this too low, you might have false positives, and unless you've got a system called STONITH in place, which will kill the other machine if it thinks it's already dead, you can have all kinds of problems. I set mine at two minutes; it's what has worked best for me, but feel free to experiment.
Also keep in mind the following two points: for the serial connection to work, you need to plug in a crossover serial cable between the machines, and if you don't use a crossover network cable between the machines but instead go through a hub where you have other Heartbeat nodes, you have to change the udpport for each master/slave node set, or your log file will get filled with warning messages.
Listing 3. /etc/heartbeat/ha.cf
debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility local0
keepalive 2
deadtime 120
initdead 120
serial /dev/ttyS1
baud 9600
udpport 694
udp eth0
nice_failback on
node server1
node server2
Now, all that's left to do is start your Heartbeat on both the master and slave server by typing:
/etc/init.d/heartbeat start
Once you've got that up and running, it's time to test it. You can do that by stopping Heartbeat on the master server and watching to see whether the slave server becomes the master. Then, of course, you might want to try it by completely powering down the master server or any other disconnection tests.
Congratulations on setting up your redundant server system! And, remember, Heartbeat and DRBD are fairly flexible, and you can put together some complex solutions, including having one server being a master of one DRBD partition and a slave of another. Take some time, play around with them and see what you can discover.

The Beowulf Evolution

Second-generation Beowulf clusters offer single-process I/O space, thin slave nodes, GUI utilities and more for adaptability and manageability.

Imagine, for a moment, if you will, driving your car into a full-service gas station—a near anachronism—pulling up to the attendant and saying, “Fill'er up, check the oil and wipers, and...give me 20 more horsepower, would you?” The attendant, not phased by the request, looks at you and says, “Would you like four-wheel drive with that? I hear it might snow tonight.” You think for a moment and respond positively—four-wheel drive would be good to have.
If only automobiles, and Beowulf clusters, were so adaptable. Yet, the single most important distinguishing feature of Beowulf 2 technology is adaptability—the ability to add more computing power to meet changing needs. To understand and appreciate how Beowulf technology has become so adaptable, an understanding of Beowulf 1 is in order.
The Roots of Beowulf
As we all know by now, the original concept for Beowulf clusters was conceived by Donald Becker while he was at NASA Goddard in 1994. The premise was that commodity computing parts could be used, in parallel, to produce an order of magnitude leap in computing price/performance for a certain class of problems. The proof of concept was the first Beowulf cluster, Wiglaf, which was operational in late 1994. Wiglaf was a 16-processor system with 66MHz Intel 80486 processors that
were later replaced with 100MHz DX4s, achieving a sustained performance of 74Mflops/s (74 million floating-point operations per second). Three years later, Becker and the CESDIS (Center of Excellence in Space Data and Information Services) team won the prestigious Gordon Bell award. The award was given for a cluster of Pentium Pros that were assembled for SC'96 (the 1996 SuperComputing Conference) that achieved 2.1Gflops/s (2.1 billion floating-point operations per second). The software developed at Goddard was in wide use by then at many national labs and universities.
First-Generation Beowulf
The first generation of Beowulf clusters had the following characteristics: commodity hardware, open-source operating systems such as Linux or FreeBSD and dedicated compute nodes residing on a private network. In addition, all of the nodes possessed a full operating system installation, and there was individual process space on each node.
These first-generation Beowulfs ran software to support a message-passing interface, either PVM (parallel virtual machine) or MPI (message-passing interface). Message-passing typically is how slave nodes in a high-performance computing (HPC) cluster environment exchange information.
Some common problems plagued the first-generation Beowulf clusters, largely because the system management tools to control the new clusters did not scale well because they were more platform- or operating-specific than the parallel programming software. After all, Beowulf is all about running high-performance parallel jobs, and far less attention went into writing robust, portable system administration code. The following types of problems hampered early Beowulfs:
Early Beowulfs were difficult to install. There was either the labor-intensive, install-each-node-manually method, which was error-prone and subject to typos, or the more sophisticated install-all-the-nodes-over-the-network method using PXE/TFTP/NFS/DHCP—clearly getting all one's acronyms properly configured and running all at once is a feat in itself.
Once installed, Beowulfs were hard to manage. If you think about a semi-large cluster with dozens or hundreds of nodes, what happens when the new Linux kernel comes out, like the 2.4 kernel optimized for SMP? To run a new kernel on a slave node, you have to install the kernel in the proper space and tell LILO (or your favorite boot loader) all about it, dozens or hundreds of times. To facilitate node updates the r commands, such as rsh and rcp, were employed. The r commands, however, require user account management accessibility on the slave nodes and open a plethora of security holes.
It was hard to adapt the cluster: adding new computing power in the form of more slave nodes required fervent prayers to the Norse gods. To add a node, you had to install the operating system, update all the configuration files (a lot of twisty little files, all alike), update the user space on the nodes and, of course, all the HPC code that had configuration requirements of its own—you do want PBS to know about the new node, don't you?.
It didn't look and feel like a computer; it felt like a lot of little independent nodes off doing their own thing, sometimes playing together nicely long enough to complete a parallel programming job.
In short, for all the progress made in harnessing the power of commodity hardware, there was still much work to be done in making Beowulf 1 an industrial-strength computing appliance. Over the last year or so, the Rocks and OSCAR clustering software distributions have developed into the epitome of Beowulf 1 implementations [see “The Beowulf State of Mind”, LJ May 2002, and “The OSCAR Revolution”, LJ June 2002]. But if Beowulf commodity computing was to become more sophisticated and simpler to use, it was going to require extreme Linux engineering. Enter Beowulf 2, the next generation of Beowulf.
Second-Generation Beowulf
The hallmark of second-generation Beowulf is that the most error-prone components have been eliminated, making the new design far simpler and more reliable than first-generation Beowulf. Scyld Computing Corporation, led by CTO Don Becker and some of the original NASA Beowulf staff, has succeeded in a breakthrough in Beowulf technology as significant as the original Beowulf itself was in 1994. The commodity aspects and message-passing software remain constant from Beowulf 1 to Beowulf 2. However, significant modifications have been made in node setup and process space distribution.
BProc
At the very heart of the second-generation Beowulf solution is BProc, short for Beowulf Distributed Process Space, which was developed by Erik Arjan Hendriks of Los Alamos National Lab. BProc consists of a set of kernel modifications and system calls that allows a process to be migrated from one node to another. The process migrates under the complete control of the application itself—the application explicitly decides when to move over to another node and initiates the process via an rfork system call. The process is migrated without its associated file handles, which makes the process lean and quick. Any required files are re-opened by the application itself on the destination node, giving complete control to the application process.
Of course, the ability to migrate a process from one node to another is meaningless without the ability to manage the remote process. BProc provides such a method by putting a “ghost process” in the master node's process table for each migrated process. These ghost processes require no memory on the master—they merely are placeholders that communicate signals and perform certain operations on behalf of the remote process. For example, through the ghost process on the master node, the remote process can receive signals, including SIGKILL and SIGSTOP, and fork child processes. Since the ghost processes appear in the process table of the master node, tools that display the status of processes work in the same familiar ways.
The elegant simplicity of BProc has far-reaching effects. The most obvious effect is the Beowulf cluster now appears to have a single-process space managed from the master node. This concept of a single, cluster-wide process space with centralized management is called single-system image or, sometimes, single-system illusion because the mechanism provides the illusion that the cluster is a single-compute resource. In addition, BProc does not require the r commands (rsh and rlogin) for process management because processes are managed directly from the master. Eliminating the r commands means there is no need for user account management on the slave nodes, thereby reducing a significant portion of the operating system on the slaves. In fact, to run BProc on a slave node, only a couple of dæmons are required to be present on the slave: bpslave and sendstats.
The Scyld Implementation
Scyld has completely leveraged BProc to provide an expandable cluster computing solution, eliminating everything from the slave nodes except what is absolutely required in order to run a BProc process. The result is an ultra-thin compute node that has only a small portion of Linux running—enough to run BProc. The power of BProc and the ultra-thin Scyld node, taken in conjunction, has great impact on the way the cluster is managed. There are two distinguishing features of the Scyld distribution and of Beowulf 2 clusters. First, the cluster can be expanded by simply adding new nodes. Because the nodes are ultra-thin, installation is a matter of booting the node with the Scyld kernel and making it a receptacle for BProc migrated processes. Second, version skew is eliminated. Version skew is what happens on clusters with fully installed slave nodes. Over time, because of nodes that are down during software updates, simple update failures or programmer doinking, the software on the nodes that is supposed to be in lockstep shifts out of phase, resulting in version skew. Since only the bare essentials are required on the nodes to run BProc, version skew is virtually eliminated.
Of course, having the ability to migrate processes to thin nodes is not a solution in itself. Scyld provides the rest of the solution as part of the special Scyld Beowulf distribution, which includes
features such as:
BeoMPI: a message-passing library that meets the MPI standard, is derived from the MPICH (MPI Chameleon) Project from Argonne National Lab and is modified specifically for optimization with BProc.
BeoSetup: a GUI for creating BeoBoot floppy boot images for slave nodes.
Beofdisk: a utility for partitioning slave node hard drives.
BeoStatus: a GUI for monitoring the status of the cluster.
Let's take a look at how to use these tools while building a Scyld Beowulf cluster.
You can purchase the Scyld Beowulf Professional Edition (www.scyld.com) that comes with a bootable master node installation CD, documentation and one year of support. The Professional Edition is spectacular and supports many advanced cluster software tools such as the parallel virtual filesystem (PVFS). Alternatively, you can purchase a Scyld Basic Edition CD for $2.95 at Linux Central (www.linuxcentral.com). The Basic Edition is missing some of the features present in the Professional Edition and arrives without documentation or support. I've built clusters with both without any problems.
It's important that you construct your Beowulf similar to Figure 1, which illustrates the general Beowulf (1 and 2) layout. The master node has two network interfaces that straddle the public network and the private compute node LAN. Scyld Beowulf assumes you've configured the network so that eth0 on the master is the public network interface and eth1 is the interface to the private compute node network. To begin the installation, take your Scyld CD, put it in the master node's CD drive and power-cycle the machine.

Figure 1. General Beowulf Layout
You'll discover that the Scyld Beowulf master node installation is almost identical to a Red Hat Linux installation. At the boot prompt, type install to trigger a master node installation. Allowing the boot prompt to time out will initiate a slave node installation by default.
Step through the simple installation procedure as you would for Red Hat Linux. For first-time cluster builders, we're going to recommend (and assume here) that you select a GNOME controller installation instead of a text-console-only installation. Choosing the GNOME installation will give you access to all the slick GUI Beo* tools integrated into the GNOME desktop environment that make building the rest of the cluster a snap.
After the typical configuration of eth0, you'll come upon the key difference with the Scyld installation: the configuration of eth1 on the master and the IP addresses of the compute nodes. The installation will prompt you for an IP address (like 192.168.1.1) for eth1 and an IP address range (such as, 192.168.1.2-192.168.x) for your compute nodes. Simple enough, but make sure you select an IP range large enough to give each of your compute nodes its own address.
Continue through the remaining installation steps, such as X configuration. For simplicity's sake, select the graphical login option. Wrap up the master node installation by creating a boot disk, removing the CD (and the boot disk) and rebooting the master node.
Log in to the master as root and the Scyld-customized GNOME desktop is fired up for you, including the BeoSetup and BeoStatus GUIs and a compute node quick install guide.
Initially, all compute nodes require a BeoBoot image to boot, either on a floppy or the Scyld CD. Rather than move the Scyld CD from node to node, I prefer to create several slave node boot images on floppies, one for each slave node. The BeoBoot images are created with the BeoSetup tool by clicking the Node Floppy button in BeoSetup. Insert a blank formatted floppy into the master's floppy drive; click OK to create the BeoBoot boot image and write it to the floppy. Repeat this for as many floppies as you like. Insert the boot floppies into the slave node floppy drives and turn on the power.
What happens next is pretty cool but is hidden from the user (unless you hook up a monitor to a slave node). Each slave node boots the BeoBoot image, autodetects network hardware, installs network drivers and then sends out RARP requests. These RARP requests are answered by the Beoserv dæmon on the master, which in turn sends out an IP address, kernel and RAM disk to each slave node. This process, where the slave node bootstraps itself with a minimal kernel on a floppy disk, which is then replaced with a final, more sophisticated kernel from the master, is dubbed the Two Kernel Monte. The slave node then reboots itself with the final kernel and repeats the hardware detection and RARP steps. Then the slave node contacts the master to become integrated into BProc.
During the kernel monte business, slave node Ethernet MAC addresses will appear in the Unknown Addresses window in BeoSetup on the master. You can integrate them into your cluster by highlighting the addresses, dragging them into the central Configured Nodes column and clicking the Apply button. Once the master finishes integrating the slave nodes into BProc the nodes will be labeled “up”. Node status will appear in BeoStatus as well.
You can partition the slave node hard drives with the default configuration in /etc/beowulf/fdisk:
beofdisk -d
beofdisk -w
The -d option tells beofdisk to use the default configuration in /etc/beowulf/fdisk and the -w option writes the tables out to all the slave nodes. You then need to edit /etc/beowulf/fstab to map the swap and / filesystems to the new partitions. Simply comment out the $RAMDISK line in /etc/beowulf/fstab that was used to mount a / filesystem in the absence of a partitioned hard drive, and edit the next two lines to map the swap and / filesystems to /dev/hda2 and /dev/hda3 (/dev/hda1 is reserved as a bootable partition). If you would like to boot from the hard drive, you can write the Beoboot floppy image to the bootable partition like this:
beoboot-install -a /dev/hda1
You'll have to add a line in /etc/beowulf/fstab after doing this:
/dev/hda1 beoboot ext2 defaults 0 0
Reboot all slave nodes for the partition table changes to take effect:
bpctl -S all -s reboot
It doesn't get much easier than that. Unlike Beowulf 1, building a Scyld Beowulf requires a full Linux installation on only the master node. Nothing is written out to permanent storage on the slave nodes during their installation, making them ultra-thin, easily maintained and quick to reboot.
To test your cluster you can run the high-performance Linpack benchmark included with the
distribution from the command line: linpack.
For a little flashier demonstration, launch a visual Mandelbrot set with the included mpi-madel application. Starting mpi-mandel on five nodes from the command line would look like:
NP=5 mpi-mandel
Collectively, the single-process ID space, the ability to migrate quickly processes under control of the application, thin slave nodes and the GUI utilities for building and monitoring a Scyld cluster, provide a cluster solution that distinguishes itself from Beowulf 1 by its completeness, adaptability and manageability. So, the answer is yes, you really can add more horsepower to that cluster.
Acknowledgements
The authors would like to thank Donald Becker, Tom Quinn and Rick Niles of Scyld Computing Corporation and Erik Arjan Hendriks of Los Alamos National Lab for patiently answering all our questions related to second-generation Beowulf.

Continuous Intuition...