I was swapping out a router today. Our old ones have served us long and well, but the vendor is no longer supporting them as they used to.
We also had some creatures that I wanted to get rid of. On the wall in the telecommunications shelter there were three cheap 100Base-FX to 100BaseT converters, with wall transformers (we call them wall-warts) on an outlet strip. The outlet strip was plugged in to an outlet that was backed up with a generator transfer switch but no UPS.
The station had a complete telecommunications battery supply which the router used. The new router had a small switch with extra SFP units in it.
By the way, for those of you who aren’t familiar with SFP sockets, they’re more of a suggestion than a standard. Vendors would like you to think that they’re actually a “standard” but there are a lot of stupid compatibility games played by many vendors to try and lock you in to their name brand SFP modules that they sell for literally ten times the price of an after-market vendor. Tread with care.
Nevertheless, I plowed in to this project thinking that it would be easy. We’ve done this before, I thought. This should work, I thought (famous last words).
First problem: how long does a unit wait to time out an ARP cache? For some embedded devices, the answer is never. Others might be hours or more.
The problem is that the original fiber converters were configured not to shut down the 100BaseT connector if the optical link went away. If I had to guess why this was, it may have been for recovery time issues. Remember what I wrote about about the converter power supplies being plugged in to a generator-backed outlet strip? There would be a brief power outage while the generator starts and the transfer switch puts the load online. Someone knew this would add another two to three minutes to the recovery time. It could have been fixed with appropriate configuration changes to the switches and the router. But that’s work. So they probably chose a cheap and dirty trick where the media converter maintains the port state regardless of what the remote end does. The down side is that the remote end doesn’t know that the link went away and so it does know that it needs to renegotiate the ARP cache entry.
Well, there we were. The router was replaced, new SFP modules inserted in to the switch, new patch cables with LC connectors on one end and ST connectors on the other… and WTF?? The brand new fiber patch cable was very clearly kinked about 1 cm down from the connector. Then I looked at the other cables. They also had some evidence of kinking, but not as bad. It had those little wire twist ties holding them in a neat loop so it could be packaged in a nice plastic bag. The vendor had carefully assembled and tested these things and then promptly destroyed them with a twist tie that was twisted too tightly. It’s a good thing that it is only running at 100 Mbps, otherwise the reflections and distortion would make this thing useless.
Picking through the pile for the least damaged cables, I replaced the patch cable. The port lights up, amber, then green. But I can’t ping anything. Let it sit for a bit. Finally we got tired of waiting. We went to each of the other three ends of these fiber cables and reset the far ends to force the interface down and then back up again.
Okay. Now I can ping some of the addresses. But where is the third PLC? I ping for it. Nothing.
Our engineer walks up the hill to the far end. Crap. Someone installed the media converter to a cabinet door and zip tied the fiber too tightly so that it actually pulled the ST connector block off of the circuit board. It worked intermittently. We tried using one of the units that we replaced in the telecom shelter. Nope. It doesn’t work well either. So we called down to our other engineers at a nearby plant. Got any spare ST 100 Mbps fiber to 100BaseT converters? Yup. They’ll bring a couple up to the site. Half an hour later it was in the panel and…
Finally! Things are working. Mostly. Kinda sorta. It’s good enough for what we want to do. The other two links are at the bleeding edge of not working. But the error rate is acceptable enough to leave overnight. I suspect there is a dirty connector somewhere, but I have no patience to go looking for it right now. We’re hungry and running out of energy because it was late in the afternoon and we haven’t had lunch, so we’ll come back and look at this tomorrow. Fortunately, the facility is down for other construction activities.
We wouldn’t have known the link was this bad anyway because the external media converter had no diagnostics. In other words, it’s probably been like this for a while. With all the construction activity taking place, the fact that dust got in somewhere it didn’t belong is not a surprise to me.
Rules for working with fiber:
1. Do not bind fiber tightly to anything. It can easily kink and break.
2. Keep connectors clean and have a cleaning kit handy at all times.
3. A power meter for the 850 and 1310 nm wavelengths is very useful.
4. SFP Transceiver diagnostics are cheap and easy ways to diagnose things without disassembling anything. Beware, not all transceivers have those features built in.
Finally, a reminder for those who are new to networks in ICS: TURN OFF ALL AUTONEGOTIATIONS! Turn off trunk/access negotiations. Turn off speed negotiations. Turn off duplex negotiations. And set the ARP timeout to some reasonable period like, no more than five minutes. Some vendors default to values of around four hours for an ARP Cache Timeout.
When a line goes down you want it back as soon as possible. Do not wait for the switch to try dozens of iterations one at a time before it hits a magic sweet spot. Force things so that the port comes up quickly. Then you won’t have to do the cheap tricks that our other staff did. I mean, it sort of worked, so I’m not complaining too loudly, but it does make for a very confusing situation for diagnostics and upgrades.