Network Monitoring and Management (Not just PING anymore)

How times have changed. In the past network management was all about polling or pinging a device to determine if it is alive or reachable. Though this is still a requirement there is so  much more that needs to be done to ensure your network is running optimally. A perfect example of this is the ever popular dual internet circuit scenario. More and more companies today have two internet connections and attempt to utilize both of them by techniques such as load balancing/sharing (see previous post on load sharing).

There are two popular methods that we run into quite often. The first of these being the Multilink connection – where a carrier provides two identical circuits and distributes the load across them equally. In this configuration your network is not completely impervious to failure as you are most likely utilizing the same cable run to reach the carriers POP (point of presence) and you are using the same carrier and any failures within that carrier would affect both circuits. It does however, allow you to maximize the use of the available bandwidth. In this case the traditional method of polling (pinging/SNMP poll) is not good enough. If only one of the circuits go down, the poll is still successful and therefore no alarms are generated. Users may experience slower response, but many times this goes unreported.

The second popular method is utilizing BGP (Border Gateway Protocol). In short BGP can allow you to run in a fully redundant configuration, where the IP Addresses assigned to one circuit will be routed over a second circuit in the event that the first circuit fails. I won’t go into all the details of BGP in this post as it is a relatively complex protocol and requires quite a bit of carrier involvement to deploy. As with the Multilink connection, when one circuit goes down, polling is still successful and no alarms are generated.

This issue in both of these cases is that if the failure goes unnoticed, you are down to having only one circuit, for days, weeks, even months and eventually the second circuit will fail and you will be completely down. Something that you were trying to avoid in the first place.

The solution to both of these is quite simple to deploy and depending on the hardware involved there is certainly more than one method of achieving this. The way in which the circuits have been deployed by the carrier will most certainly determine which of these methods you use.

A very rudimentary method is to poll the WAN interface IP’s of each of the connections from an external source. Do not attempt to do this from inside the network, as it is not guaranteed that the poll with fail when the circuit goes down.  

Using an SNMP management system and enabling traps can be an effective method, provided the hardware supports the generation of  specific traps that can alarm on Route changes, Interface State changes, BGP or Multilink state changes. In many devices traps cannot be limited to forward only the information you require and your management system will get flooded with messages that are of no use. If  traps are not supported on your device or if traps cannot be limited to forward only the information you require look into using SNMP to poll the device for the same type of information.

Be sure to test not just your failover configuration, but the effectiveness of you Monitoring and Reporting system. There is nothing worse than investing in a redundant architecture only to find yourself explaining to your Management team why it didn’t work as promised.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s