Troubleshooting Intermittent Ceph Instability Caused by a Faulty Network Interface

Recently ran into an issue where a Ceph cluster started showing intermittent instability that initially looked like a higher-level networking or service problem, but ultimately turned out to be a failing network interface.

Wanted to share the troubleshooting process because the symptoms can easily send you in the wrong direction.

Symptoms Observed

The environment showed several intermittent issues including:

  • Ceph HEALTH_WARN messages
  • Slow ops
  • Storage latency spikes
  • VM responsiveness issues
  • Packet loss/intermittent communication
  • OSD recovery/rebalancing activity
  • Interfaces occasionally flapping up/down

At first glance, it looked like:

  • DNS instability
  • DHCP issues
  • Ceph service problems
  • cluster communication bugs

However, the actual issue ended up being at the physical network layer.


First Step: Check Ceph Health

Start by validating the cluster state:

ceph -s

Look for:

  • degraded OSDs
  • heartbeat warnings
  • slow ops
  • network-related health warnings

This helps determine whether the issue is isolated or affecting cluster-wide communication.


Check Network Interface Status

Validate all interfaces are behaving normally:

ip link

or:

nmcli device status

Things to watch for:

  • interfaces repeatedly disconnecting/reconnecting
  • unexpected failovers
  • interfaces stuck in degraded states

The Step That Actually Identified the Problem

The issue became obvious after checking interface counters:

ip -s link

Specifically:

  • dropped packets
  • RX/TX errors
  • overruns
  • carrier errors

In this case, the dropped packet counter was rapidly increasing during normal operation.

Example:

RX errors: 15234
TX errors: 0
dropped: 9821

That strongly pointed toward:

  • failing NIC
  • bad cable
  • bad transceiver
  • switch port issue
  • hardware instability

Monitor for Link Flapping

Another useful step:

journalctl -f

or:

dmesg -w

Watch for:

  • Link is Down
  • Link is Up
  • carrier lost
  • NIC reset messages

Repeated flapping is usually a major clue that the issue is physical/network related instead of Ceph itself.


Monitor Interface Statistics Live

This was also helpful:

watch -n 1 cat /proc/net/dev

This makes it easy to watch:

  • dropped packets
  • interface errors
  • abnormal traffic behavior

in real time while the cluster is under load.


Isolating the Faulty Port

After identifying the suspected interface, temporarily disabling it helped confirm the issue.

Example:

sudo ip link set <interface> down

Once the interface was disabled:

  • dropped packets stopped increasing
  • cluster communication stabilized
  • Ceph recovery behavior normalized
  • storage latency improved

Additional Validation

After disabling the interface:

ip link show <interface>

Expected:

state DOWN

Then re-check cluster health:

ceph -s

Look for:

  • stabilized OSD communication
  • reduced slow ops
  • recovery progressing normally

Main Takeaway

The biggest lesson from this issue:

Ceph instability is not always a Ceph problem.

It’s easy to immediately start troubleshooting:

  • DNS
  • DHCP
  • authentication
  • Ceph services
  • storage daemons

But physical network instability can create symptoms that look like higher-level application failures.

Checking interface counters early during troubleshooting can save a huge amount of time.

Especially:

  • dropped packets
  • RX/TX errors
  • link flapping

before diving deeper into Ceph-specific troubleshooting.