Friday, May 25, 2012

The Case of the Missing Switches

One of my clients had me setup a new Cisco network side by side with their existing HP ProCurve network last fall.  The two networks are linked by a gigabit Ethernet link and the Cisco core (4507R+E) is serving as the default gateway for all of the VLANs on the Cisco side and the one legacy VLAN on the HP ProCurve side.  Everything is working fine for normal network clients on the various VLANs.

Recently, the network administrator at the client site got time to connect to the switches to learn more about the config.  In doing so he discovered that the access layer switches (2960S) were not accessible from any device not on the same management VLAN as their management IPs.  The core switch which is on the same VLAN is accessible from any other VLAN.  Right now the only way to contact the access switches is to ssh from the core or to place a workstation on the management VLAN.

So far I have ruled out/checked out the following:


  • Duplicate IPs
  • Network Loops
  • The management VLAN is trunked properly across the 10Gbps uplinks to the access switches.
  • The management VLAN is in the core's routing table.
  • The access switches are listed in both the mac address table and the ARP table of the core switch.
  • The SVI for the management VLAN is UP/UP

I'm looking for ideas of where else to check for the cause of this behavior.  Thanks in advance for any help and I promise to post the solution when it comes.

Friday, March 30, 2012

Backup Your VLAN Database

A junior admin at XYZ corporation was tasked with adding a switch to the XYZ network.  He grabbed a spare switch out of stock that had been previously used.  After he plugged in the switch, most users were complaining that they couldn't connect to company resources over the network.  Your manager has tasked you with determining the cause of the problems and fixing them.

Sounds like a test question doesn't it?  Well unfortunately it happens often enough in real production networks.  A new switch is added with VTP server mode turned on and a higher revision number than the current VLAN database.  This can cause a totally bogus VLAN database to be propagated to the network via VTP if it is enabled on the production switches.  While there are plenty of ways to prevent this from happening, even the best network team can occasionally have a bad day.

Cisco's EEM provides a handy way of backing up your vlan.dat file so that you can quickly and relatively easily restore your VLAN database.

event manager session cli username "user" ! Determines the user that the script runs as.  If you use TACACS+ command authentication this is important.
event manager applet backup-vlan
 event timer cron cron-entry "0 23 * * *" maxrun 60000 ! Schedules the script to run at 23:00 every day.
 action 1 cli command "enable"
 action 2 cli command "configure terminal"
 action 3 cli command "file prompt quiet" ! Eliminates the "Are you sure?" prompts.
 action 4 cli command "end"
 action 5 cli command "copy const_nvram:/vlan.dat scp://user:password@FQDN/vlan.dat" ! Copies vlan.dat to a SCP server.
 action 6 cli command "configure terminal"
 action 7 cli command "no file prompt quiet" ! Restores the "Are you sure?" prompts.
 action 8 cli command "end"

Sunday, March 11, 2012

Automatic Recovery for Err-disabled Interfaces

There are four primary states for interfaces on Cisco switches: up, down, administratively disabled and err-disabled.  Up and down are fairly self explanatory.  Administratively disabled means that the port is configured to be shutdown by the administrator using the CLI.  Err-disabled though can be a bit baffling to a new network engineer.

The err-disabled interface state can be caused by many situations including:

  • Bad cabling
  • Duplex mismatch
  • BPDU guard violation
  • Port-Security violation
  • Link-flap detection
The complete list is on Cisco's site.

An engineer can recover an interface by entering configuration mode for the interface and issuing the shutdown and then no shutdown commands.  By default the interface will remain err-disabled until a human intervenes because auto recovery is disabled as is shown by the following show command.

SWITCH#show errdisable recovery
ErrDisable Reason            Timer Status
-----------------            --------------
arp-inspection               Disabled
bpduguard                    Disabled
channel-misconfig (STP)      Disabled
dhcp-rate-limit              Disabled
dtp-flap                     Disabled
gbic-invalid                 Disabled
inline-power                 Disabled
l2ptguard                    Disabled
link-flap                    Disabled
mac-limit                    Disabled
loopback                     Disabled
pagp-flap                    Disabled
port-mode-failure            Disabled
pppoe-ia-rate-limit          Disabled
psecure-violation            Disabled
security-violation           Disabled
sfp-config-mismatch          Disabled
small-frame                  Disabled
storm-control                Disabled
udld                         Disabled
vmps                         Disabled

Timer interval: 300 seconds

Interfaces that will be enabled at the next timeout:

 In some cases, it would be safe to allow the switch to auto recover the interface to up if the condition that caused the err-disabled state has cleared.  For this example, let's assume that a port-security violation caused the error (psecure-violation).  This is a relatively benign error to auto recover because if the violation still exists, port security will rapidly trip again putting the interface back into err-disabled.  The default is that the switch will clear the state after 5 minutes.  So to have the switch auto recover the interface the following configuration would need to be added.

SWITCH# configure terminal
SWITCH(conf)#errdisable recovery interval 300 ! Default setting shown for completeness.
SWITCH(conf)#errdisable recovery cause psecure-violation
SWITCH(conf)#end
SWITCH#copy running-config startup-config
 Similar commands can be entered for the other reasons listed above in the show command or you can set all reasons to recover by using the keyword all.  Be careful where you enable the auto recovery, it might not be your friend on all switches.  For example, you wouldn't want a link on a core switch having a problem to start flapping because of auto recovery causing a network convergence every 5 minutes (or whatever you set the timer to).