Showing posts with label troubleshooting. Show all posts
Showing posts with label troubleshooting. Show all posts

Monday, July 8, 2013

How to Waste an Entire Day Troubleshooting One Phone... and Then Fix It in Less Than 5 Seconds

So last week was a short one for me with the combination of the 4th of July holiday and some vacation time.  Unfortunately the last working day ended up being a rather long and frustrating day.  

Around 9am a ticket came in that one of the managers' phones was dead.  Keeping in mind that my phone system is an ancient Rolm/Siemens 9751 system, I started with the assumption of hardware failure.  My first step was to try a known good phone on the line.  Nope, still dead as a door nail.  I also tried changing the line to a different port on the system fearing that the line card could be the problem, but that also failed to change anything.

By this time I was figuring on a cabling issue.  Over the next 4 hours or so I toned and re-terminated the cabling from the phone to the PBX.  Normally this wouldn't take this long, but I kept getting tone to the phone, but the phone still refused to work so I kept repeating and segmenting the process.  Finally I believed that I had isolated the problem as being between the last IDF and the phone jack itself.

At this point in the day, the manager had already left for the day so I had to get the plant engineering director to unlock the office for me.  My plan was to re-terminate the phone jack hoping that there was a short in the wall jack.  The plant engineering director was a bit intrigued by my day of troubleshooting so he stuck around to help me move the desk out of the way.  As we were doing this, he started to trace where the line left the room and noticed that it when through a box on the wall with a switch.  I flipped the switch and bam, the phone worked again.  He explained that he recognized the switch as an old line switch (to switch a single line phone from one line to another) because his father used to work for the phone company.  

So now, that others may learn from my day of troubleshooting, here is what the switch looked like, after I added some labels for the future.  My lesson, expect the unexpected.

Phone Switch

Wednesday, June 5, 2013

Neuron: Using the ESXi CLI to Fix a VMK0 Mistake

In VMWare ESX, the management traffic for the host is sent to the interface vmk0 which is a virtual interface.  This morning while troubleshooting another vmk* interface because of a vMotion problem, I accidentally changed the dvsPortGroup (VLAN) on vmk0.  As soon as that took effect, the host was not able to be seen by vCenter.  Thankfully the guest VMs continued to run without any failure.  

Now came a chicken and the egg problem.  I needed to change the dvsPortGroup on vmk0 back, but I couldn't access the host using vCenter until vmk0 was back online.  This led me to Google to find a way to accomplish the same thing using the CLI on the individual host.  This article pointed me in the right direction for the commands.

What I ended up doing was the following:

1. Lookup the DVPort number using esxcfg-vmknic -l command.  As you can see below the DVPort currently used by a VMK* interface is easily found in the output.

2. Lookup the DVPort of a free port in the distributed vSwitch (in our case a Nexus 1000V) in the proper port group using vCenter. 
3. Delete the existing vmk* nic by using the command:

esxcfg-vmknic -d -s DVSwitch_name -p DVPort

4. Recreate the vmk* nic by using the command below with the DVPort found in step 2.

esxcfg-vmknic -a -s DVSwitch_name -p DVPort -i IPAddress -n NetMask

At this point I had my vmk0 back with the proper IP and VLAN so I was able to reconnect the host to vCenter and all was well.  The moral of the story is be careful what you're clicking on.

Friday, June 29, 2012

Case of the Missing Switch -- Solved

A month or so ago I posted a plea for help in reference to a strange problem I was having with a customer's switches.  The basic problem was that the network worked fine except that the management interfaces on the IDF switches could only be pinged or connected to from the MDF switch.

One of my astute readers, Eric Lund, hit the nail on the head with his comment:

Is ip default-gateway x.x.x.x set on the switch? It should be on the same segment as your management SVI.
This lead me to look again at the IDF configurations.  Sure enough the ip default-gateway was set to the default gateway of the old network.  We had been using the management ports on the switches to access and configure them from the old HP Procurve production network.  When I cut over, I forgot to change those default gateway statements.  Thankfully it didn't impact production, but it was one of those head scratchers when it came to managing the network.

Friday, May 25, 2012

The Case of the Missing Switches

One of my clients had me setup a new Cisco network side by side with their existing HP ProCurve network last fall.  The two networks are linked by a gigabit Ethernet link and the Cisco core (4507R+E) is serving as the default gateway for all of the VLANs on the Cisco side and the one legacy VLAN on the HP ProCurve side.  Everything is working fine for normal network clients on the various VLANs.

Recently, the network administrator at the client site got time to connect to the switches to learn more about the config.  In doing so he discovered that the access layer switches (2960S) were not accessible from any device not on the same management VLAN as their management IPs.  The core switch which is on the same VLAN is accessible from any other VLAN.  Right now the only way to contact the access switches is to ssh from the core or to place a workstation on the management VLAN.

So far I have ruled out/checked out the following:


  • Duplicate IPs
  • Network Loops
  • The management VLAN is trunked properly across the 10Gbps uplinks to the access switches.
  • The management VLAN is in the core's routing table.
  • The access switches are listed in both the mac address table and the ARP table of the core switch.
  • The SVI for the management VLAN is UP/UP

I'm looking for ideas of where else to check for the cause of this behavior.  Thanks in advance for any help and I promise to post the solution when it comes.

Tuesday, February 28, 2012

Neuron: Sharepoint Slow via UNC

Microsoft's Sharepoint can be accessed by a web browser, or via a UNC path as if it is a file server.  We store a lot of documentation on Sharepoint and I find it easiest to manage via UNC path.  Lately I have noticed that my access to the UNC path could take almost a minute.  After digging around a bit, I found that this is a known issue with Windows 7.  Going to Internet Explorer's properties (even if you don't use IE as your browser) and shutting off automatic proxy detection will speed things back up.


Friday, February 10, 2012

Hey Window! You Make a Better Door Than You Do a Window.

Over the past two months, I have been battling with a Cisco Aironet 1310 wireless bridge between our main building and a physician's office that we rent across the street.  This physician's office is one building out of a cluster of 15 in a "Doctor's Park".

A few years ago when we put in this link, we were the only wireless in the area.  Within the last 6 months several of the physician's offices and a neighboring nursing home have added wireless infrastructure.

When our problems started, I noticed that our signal strength had gone from an average of -69dBm to -82dBm and our signal to noise ratio was around 10 dBm.  My first thought was obstruction having recently had a different link knocked out by a new HVAC project on the roof of one of buildings interfering too much with the Fresnel zone of another link.  I went up to where the radio was on the main building and took a look.  Nothing new was in the line of site.

Having ruled out the obvious, I fired up the Cisco Spectrum Expert on my laptop and sat up near the radio.  Almost immediately I saw something that wasn't quite right.  All of my APs and bridges are set to use channels 1, 6, or 11.  On the laptop though I was seeing active APs on channels 3 and 8.  Of course since the 2.4Ghz channels overlap, this was causing cross channel interference on all three of the channels that my bridge was configured to be able to use.  As I looked more closely at the Cisco Spectrum Expert information I found the SSIDs associated with these interfering APs.  This is where I got a break, I recognized one of the SSIDs as being from another healthcare entity from another town.  I went back to looking outside and thinking where this might be coming from.  Eureka, one of the physician's offices had a new sign out front with the other entity's logo.  Thankfully the network engineers on their side were gracious enough to reconfigure their stand alone APs to use channels 1 and 11 leaving me 6 for my bridge.

You're probably asking, what does this have to do with a window being a door?  Well yesterday after having resolved our bridge issues for almost a month by my detective work, the problem came back.  This time though, I couldn't find any interference.   Instead, the screen showed something on channel 6 with a very high duty cycle.  I confirmed it was my bridge that was on channel 6.  This puzzled me so I drove over to the far end of the bridge to see if something was amiss there.  I walked up to the radio with my laptop and saw the same type of high duty cycle, but again no interference.

As I did a visual sweep of the room I saw that the window in front of the AP (I know this isn't ideal, but we rent the building.) was partially open such that the metal frame of the window was just about centered in front of the radio.  I asked the clinician whose office it was how long the window had been open.  They told me it had been open for about 2 hours... exactly the amount of time that the bridge had been having problems.  I promptly closed the window and presto the signal strength went back to normal and the duty cycle returned to normal as well.  Once again, the network was undone by the Human Network.

Friday, May 6, 2011

Fun with Port-Channels

This week I did what I've done hundreds of times before.  I configured a new port-channel on a Cisco switch for a server.  In doing so I ended up with two new scenarios that I had never seen before.


The first issue was more of a head scratcher than a production problem, but nonetheless it caused me to stop and wonder why.


PAHA-1A-6509E-SW1#show etherchannel 403 summary
Flags:  D - down        P - bundled in port-channel
        I - stand-alone s - suspended
        H - Hot-standby (LACP only)
        R - Layer3      S - Layer2
        U - in use      N - not in use, no aggregation
        f - failed to allocate aggregator

        M - not in use, no aggregation due to minimum links not met
        m - not in use, port not aggregated due to minimum links not met
        u - unsuitable for bundling
        d - default port

        w - waiting to be aggregated
Number of channel-groups in use: 34
Number of aggregators:           35

Group  Port-channel  Protocol    Ports
------+-------------+-----------+-----------------------------------------------
403    Po403(SD)       LACP
403    Po403A(SU)      LACP      Gi1/3/45(P)    Gi1/4/14(P)    Gi2/3/45(P)

Last applied Hash Distribution Algorithm: Fixed
Obviously it's not exactly normal to see a port-channel identifier listed twice with one of them having an alphabetic character on the end.  Digging around on the Internet I finally found a document on Cisco's support community that explained the issue. What I had done was create the port-channel with Gi1/3/45 and Gi2/3/45.  Later when the server guys had asked me to add Gi1/4/14 I added it to the port-channel before making the configuration identical.  The document's resolution of shutting down the ports and defaulting the configurations and then recreating them did remove the crazy A from my show commands.

My second issue was that a server connected to a port-channel was disabling one of the NICs saying that it was faulted.  Again defaulting the configurations and recreating the port-channel cleared up that issue.  

The moral of my story is that even though your configuration may be 100% correct, you may have confused IOS so try starting over.