Hi,
I have been working for the last months on making a system that supports global failover and no single points of failure without any special hardware or special hosting.
Here is our current setup:
3 (or 4) servers in 2 or more data warehouses (i.e. England and the netherlands):
2 servers running full version of cpanel. (M(aster)1 and M2)
1 or 2 servers running nagios and dns cluster version of cpanel. (S(urvailance)3 and S4)
The current plan consists of multiple levels of setups:
1. Database replication (MySql and PgSql)
2. Filesystem and e-mail replication.
3. Configuration replication.
4. Detection of failure
5. Domain switchover
Before i get into details I would like to tell you a little about the background for our setup:
We run a sofisticated online software package, that provides a broad array of tools and applications as well as the usual hosting of websites and e-mail. Our customers have front-end access only, but the system we designed is made to facilitate users that can upload their own scripts as well.
Our service is made available from multiple locations so that in case one datawarehouse goes down the customer can immediatly switch to another one. The customer may also switch manually as their particular route to our hosts may be "jammed".
Our setup is built then in two manors: 1 primary domain for all applications + 1 domain pr. customer for their website etc.
The primary domain must allways be available and accesseble and have 99.99%+ uptime regardless of major backbone failures etc.
So we have decided on the following concourse:
1: 2 servers actively replicating themselves to allways be in sync with eachother. The primary domain example.org has three pointers in it:
1. www1.example.org (manual access to M1)
2. www2.example.org (manual access to M2)
3. (www.)example.org (chooses M1 if both are available and M2 if M1 is down.)
This allows for a customer to manually choose another server if dns caching is a bit on the eager side. And for automatic updating for those who access first.
It should be mentioned that this is meant as a HA setup NOT a cluster or a load-balancing solution.
The above is for our primary application domain, our customers may choose to have the same sort of setup for an extra charge. If not they are only accessible through the automated switchover (no manual choices)
So enough about the background. Lets cut to the chase:
1. Database replication (MySql and PgSql)
2. Filesystem and e-mail replication.
3. Configuration replication. (c-panel)
4. Detection of failure
5. Domain switchover
1. This one is fairly easy. In terms of mysql a simple Master-Master replication scheme is set up (thanks howtoforge.com) and with PGsql PgsqlCluster will be set up. If you need more details about these processes feel free to ask. All applications allways use localhost when they access the db, to make sure that no problems arise with dns etc.
2. Filesystem and mail would result in being the most difficult to solve and we spend many months looking for a solution.
Before I tell you how, please have a look at what we have tried and denied:
1. NAS - To costly if you need it without a single point of failure, requires much bandwith and low latency as all files must be read from an external server.
2. NFS - as above
3. Distributed filesystem (DRBD). Again a problem with latency and the share ammount of data that is sent back and fourth (especially with large files)
4. Rsync - not realtime enough (as standard). Caused very high loads when replicating 100k+ files every x minutes.
So our requirements are:
Close to "realtime" syncing.
Minimal CPU and bandwith use.
Must travel securly over the open net.
Must not delay or slow down applications.
Must allow long latency (50-100ms+) and still work quickly (10 s or less for a small file)
Must be two way.
After a lot of work we finally managed to get a script together that uses a combination of iNotify - perl - Rsync as well as some nifty ssh to get it all working.
Basically the perl script uses iNotify to alert it whenever a file is changed. when a file is changed it triggers rsync to sync that file to the other server and vice-versa.
As the system only inspects and syncs changed files it keeps bandwith and cpu use to a minimum.
As both servers have a local copy of each and every file security is high and performance is unscaved.
I will be putting my script on sourceforge soon, it was loosly based on the brilliant iWatch script.
3. This one is what i see as the hardest part and to be frank this is why i am writing this post. Cpanel has a wealth of files it edits whenever a new user or a new e-mail is made. Syncing these fiels between two servers is very hard as each file contains unique ips and other server specific postings.
I see three different ways of doing this:
1. Have cpanel inc, change the cpanel code so that it replicates to multiple servers (pretty please
)
2. Create a syncing app that syncs the files and changes ips etc in the process.
3. Create a php config frontend that uses CURL to performe all management actions on both servers. This would however require you to disable customer access to cpanel.
If anybody has done something like this before please let me know as this is the last piece of the puzzle i need!
Points 4 and 5 have two different solution. The simplest is using an external dns failover provider, while the other makes you set up a monitor system and do it yourself, with much more flexibility and features as well as better warings.
4. Detecting the failure
We use nagios for monitoring all our services. We have multiple plugins that chech both for ping, service abailability, disc space, as well as run real tests on real web pages.
If a service fails we are warned by an SMS message and we can have the server trigger various failsafe programs.
Nagios can be set up in a 1 server mode or a failsafe 2 server mode (perhaps a bit overkill, but if you need 100% then be my guest)
One of these programs is a dns switchover program.
5. Switching over dns names.
Lets face it. When a user wants to use our application they go to example.com if it is down they call support, and they do probably not try www1 and www2 or any other variants. hence we must make example.com point the right way.
We do this in a two step process:
1. Change the info in the dns.
2. Replicate the info out into the world.
1. This is done using a specially crafted script that runs on the nagios server (which is also a cpanel dns only server) opens all .db files and replaces the old servers ip address with the new one. It also changes the serial number and all subdomains.
2. The script nexts triggers the cpanel dns cluster sync script. With low enough dns TTLs the domain is switched over in less then 5 min.
3. When the problem resolves itself the system returns the domain to point to master 1.
I have not gone into detail in this post. The reason for this is that we are not entierly done. Right now we are still looking for a solution to problem three. Feel free however to post replies to this and ask questions and i will nudge you in the right direction.
But again if anyone has a suggestion for problem three please let me know.
Cheers
I have been working for the last months on making a system that supports global failover and no single points of failure without any special hardware or special hosting.
Here is our current setup:
3 (or 4) servers in 2 or more data warehouses (i.e. England and the netherlands):
2 servers running full version of cpanel. (M(aster)1 and M2)
1 or 2 servers running nagios and dns cluster version of cpanel. (S(urvailance)3 and S4)
The current plan consists of multiple levels of setups:
1. Database replication (MySql and PgSql)
2. Filesystem and e-mail replication.
3. Configuration replication.
4. Detection of failure
5. Domain switchover
Before i get into details I would like to tell you a little about the background for our setup:
We run a sofisticated online software package, that provides a broad array of tools and applications as well as the usual hosting of websites and e-mail. Our customers have front-end access only, but the system we designed is made to facilitate users that can upload their own scripts as well.
Our service is made available from multiple locations so that in case one datawarehouse goes down the customer can immediatly switch to another one. The customer may also switch manually as their particular route to our hosts may be "jammed".
Our setup is built then in two manors: 1 primary domain for all applications + 1 domain pr. customer for their website etc.
The primary domain must allways be available and accesseble and have 99.99%+ uptime regardless of major backbone failures etc.
So we have decided on the following concourse:
1: 2 servers actively replicating themselves to allways be in sync with eachother. The primary domain example.org has three pointers in it:
1. www1.example.org (manual access to M1)
2. www2.example.org (manual access to M2)
3. (www.)example.org (chooses M1 if both are available and M2 if M1 is down.)
This allows for a customer to manually choose another server if dns caching is a bit on the eager side. And for automatic updating for those who access first.
It should be mentioned that this is meant as a HA setup NOT a cluster or a load-balancing solution.
The above is for our primary application domain, our customers may choose to have the same sort of setup for an extra charge. If not they are only accessible through the automated switchover (no manual choices)
So enough about the background. Lets cut to the chase:
1. Database replication (MySql and PgSql)
2. Filesystem and e-mail replication.
3. Configuration replication. (c-panel)
4. Detection of failure
5. Domain switchover
1. This one is fairly easy. In terms of mysql a simple Master-Master replication scheme is set up (thanks howtoforge.com) and with PGsql PgsqlCluster will be set up. If you need more details about these processes feel free to ask. All applications allways use localhost when they access the db, to make sure that no problems arise with dns etc.
2. Filesystem and mail would result in being the most difficult to solve and we spend many months looking for a solution.
Before I tell you how, please have a look at what we have tried and denied:
1. NAS - To costly if you need it without a single point of failure, requires much bandwith and low latency as all files must be read from an external server.
2. NFS - as above
3. Distributed filesystem (DRBD). Again a problem with latency and the share ammount of data that is sent back and fourth (especially with large files)
4. Rsync - not realtime enough (as standard). Caused very high loads when replicating 100k+ files every x minutes.
So our requirements are:
Close to "realtime" syncing.
Minimal CPU and bandwith use.
Must travel securly over the open net.
Must not delay or slow down applications.
Must allow long latency (50-100ms+) and still work quickly (10 s or less for a small file)
Must be two way.
After a lot of work we finally managed to get a script together that uses a combination of iNotify - perl - Rsync as well as some nifty ssh to get it all working.
Basically the perl script uses iNotify to alert it whenever a file is changed. when a file is changed it triggers rsync to sync that file to the other server and vice-versa.
As the system only inspects and syncs changed files it keeps bandwith and cpu use to a minimum.
As both servers have a local copy of each and every file security is high and performance is unscaved.
I will be putting my script on sourceforge soon, it was loosly based on the brilliant iWatch script.
3. This one is what i see as the hardest part and to be frank this is why i am writing this post. Cpanel has a wealth of files it edits whenever a new user or a new e-mail is made. Syncing these fiels between two servers is very hard as each file contains unique ips and other server specific postings.
I see three different ways of doing this:
1. Have cpanel inc, change the cpanel code so that it replicates to multiple servers (pretty please
2. Create a syncing app that syncs the files and changes ips etc in the process.
3. Create a php config frontend that uses CURL to performe all management actions on both servers. This would however require you to disable customer access to cpanel.
If anybody has done something like this before please let me know as this is the last piece of the puzzle i need!
Points 4 and 5 have two different solution. The simplest is using an external dns failover provider, while the other makes you set up a monitor system and do it yourself, with much more flexibility and features as well as better warings.
4. Detecting the failure
We use nagios for monitoring all our services. We have multiple plugins that chech both for ping, service abailability, disc space, as well as run real tests on real web pages.
If a service fails we are warned by an SMS message and we can have the server trigger various failsafe programs.
Nagios can be set up in a 1 server mode or a failsafe 2 server mode (perhaps a bit overkill, but if you need 100% then be my guest)
One of these programs is a dns switchover program.
5. Switching over dns names.
Lets face it. When a user wants to use our application they go to example.com if it is down they call support, and they do probably not try www1 and www2 or any other variants. hence we must make example.com point the right way.
We do this in a two step process:
1. Change the info in the dns.
2. Replicate the info out into the world.
1. This is done using a specially crafted script that runs on the nagios server (which is also a cpanel dns only server) opens all .db files and replaces the old servers ip address with the new one. It also changes the serial number and all subdomains.
2. The script nexts triggers the cpanel dns cluster sync script. With low enough dns TTLs the domain is switched over in less then 5 min.
3. When the problem resolves itself the system returns the domain to point to master 1.
I have not gone into detail in this post. The reason for this is that we are not entierly done. Right now we are still looking for a solution to problem three. Feel free however to post replies to this and ask questions and i will nudge you in the right direction.
But again if anyone has a suggestion for problem three please let me know.
Cheers