Recently ran into an issue where a Ceph cluster started showing intermittent instability that initially looked like a higher-level networking or service problem, but ultimately turned out to be a failing network interface.

Wanted to share the troubleshooting process because the symptoms can easily send you in the wrong direction.

Symptoms Observed

The environment showed several intermittent issues including:

Ceph HEALTH_WARN messages
Slow ops
Storage latency spikes
VM responsiveness issues
Packet loss/intermittent communication
OSD recovery/rebalancing activity
Interfaces occasionally flapping up/down

At first glance, it looked like:

DNS instability
DHCP issues
Ceph service problems
cluster communication bugs

However, the actual issue ended up being at the physical network layer.

First Step: Check Ceph Health

Start by validating the cluster state:

ceph -s

Look for:

degraded OSDs
heartbeat warnings
slow ops
network-related health warnings

This helps determine whether the issue is isolated or affecting cluster-wide communication.

Check Network Interface Status

Validate all interfaces are behaving normally:

ip link

or:

nmcli device status

Things to watch for:

interfaces repeatedly disconnecting/reconnecting
unexpected failovers
interfaces stuck in degraded states

The Step That Actually Identified the Problem

The issue became obvious after checking interface counters:

ip -s link

Specifically:

dropped packets
RX/TX errors
overruns
carrier errors

In this case, the dropped packet counter was rapidly increasing during normal operation.

Example:

RX errors: 15234
TX errors: 0
dropped: 9821

That strongly pointed toward:

failing NIC
bad cable
bad transceiver
switch port issue
hardware instability

Monitor for Link Flapping

Another useful step:

journalctl -f

or:

dmesg -w

Watch for:

Link is Down
Link is Up
carrier lost
NIC reset messages

Repeated flapping is usually a major clue that the issue is physical/network related instead of Ceph itself.

Monitor Interface Statistics Live

This was also helpful:

watch -n 1 cat /proc/net/dev

This makes it easy to watch:

dropped packets
interface errors
abnormal traffic behavior

in real time while the cluster is under load.

Isolating the Faulty Port

After identifying the suspected interface, temporarily disabling it helped confirm the issue.

Example:

sudo ip link set <interface> down

Once the interface was disabled:

dropped packets stopped increasing
cluster communication stabilized
Ceph recovery behavior normalized
storage latency improved

Additional Validation

After disabling the interface:

ip link show <interface>

Expected:

state DOWN

Then re-check cluster health:

ceph -s

Look for:

stabilized OSD communication
reduced slow ops
recovery progressing normally

Main Takeaway

The biggest lesson from this issue:

Ceph instability is not always a Ceph problem.

It’s easy to immediately start troubleshooting:

DNS
DHCP
authentication
Ceph services
storage daemons

But physical network instability can create symptoms that look like higher-level application failures.

Checking interface counters early during troubleshooting can save a huge amount of time.

Especially:

dropped packets
RX/TX errors
link flapping

before diving deeper into Ceph-specific troubleshooting.

Bo Morgan Tech

Troubleshooting Intermittent Ceph Instability Caused by a Faulty Network Interface