Add Diagnosing EP06-A Drops After Attach

2025-11-24 08:43:25 +00:00
commit 3746576b38
1 changed files with 58 additions and 0 deletions
--- a/Attach.-.md
+++ b/Attach.-.md
@@ -0,0 +1,58 @@
 Diagnosing EP06-A Drops After Attach: Carrier vs. QMI vs. Internal Causes
 Potential Causes of the 3-Minute Drop
 Carrier-Side PDP Timeout or Inactivity: Many mobile carriers enforce an idle timeout on PDP contexts. If the device sends no data for a certain period, the network’s gateway (GGSN/PGW) may automatically deactivate the PDP context
 [networkengineering.stackexchange.com](https://networkengineering.stackexchange.com/questions/23810/in-which-cases-is-the-pdp-context-terminated#:~:text=The%20PDP%20context%20can%20be,be%20reset%20all%20the%20time)
 . Modern networks often set long timeouts (hours or more), but some carriers use short timers (on the order of a few minutes) for certain SIMs or APNs. In fact, reports exist of mobile NAT gateways dropping connections after just 3–5 minutes of no traffic
 [blog.wirelessmoves.com](https://blog.wirelessmoves.com/2020/09/carrier-grade-nat-timeouts-and-how-to-configure-your-xmpp-server.html#:~:text=file%20takes%20immediate%20effect%2C%20even,kill%20the%20TCP%20session%20and)
 [blog.wirelessmoves.com](https://blog.wirelessmoves.com/2020/09/carrier-grade-nat-timeouts-and-how-to-configure-your-xmpp-server.html#:~:text=So%20a%20TCP%20keep%20alive,alive%20was)
 . In your case, the ~180 second window strongly suggests the carrier is tearing down the data session due to inactivity, especially since keeping the link busy past 3 minutes prevents the drop. A related scenario is a walled garden or provisional session – e.g. the network initially allows attach but will cut it off quickly if you don’t start using it or fulfill some login/authorization. This would also manifest as the PDP context dropping from the carrier side (likely with no explicit error beyond a deactivation).
 NAT Mapping Expiration (Carrier CGNAT): Even if the PDP context itself isn’t torn down, carrier-grade NAT can cause connectivity to vanish after a few minutes of no traffic. Mobile networks often use CGNAT with short UDP/TCP binding timeouts (sometimes just a few minutes) to conserve resources
 [blog.wirelessmoves.com](https://blog.wirelessmoves.com/2020/09/carrier-grade-nat-timeouts-and-how-to-configure-your-xmpp-server.html#:~:text=file%20takes%20immediate%20effect%2C%20even,kill%20the%20TCP%20session%20and)
 [blog.wirelessmoves.com](https://blog.wirelessmoves.com/2020/09/carrier-grade-nat-timeouts-and-how-to-configure-your-xmpp-server.html#:~:text=So%20a%20TCP%20keep%20alive,alive%20was)
 . If your router doesn’t send any packets after connecting, the NAT mapping in the network may expire, meaning inbound traffic can’t reach you and even outbound may stall until a new session is established. Pure NAT timeout wouldn’t typically make the modem report a disconnect (the PDP stays up), but it can appear as a drop in connectivity. In practice, NAT timeouts often go hand-in-hand with the carrier’s idle PDP timeout – if no traffic flows, the NAT entry drops and the network might then tear down the PDP context. This could explain a drop at ~3 minutes if the network expects a keep-alive that isn’t happening. (In short, a NAT timeout alone breaks data flow, and if prolonged, the network could fully deactivate the PDP context.)
 QMI Client or Modem Internal Issues: It’s also possible the cause is on the device side – for example, a QMI client state mismatch or software bug. The Quectel EP06-A in QMI mode (often managed by uqmi/netifd on OpenWrt) requires a QMI “client ID” to manage the data session. If the QMI control process misbehaves, loses its client ID, or if another process (or a netifd script) inadvertently issues a disconnect, the link could drop without a direct carrier request. There are known quirks where uqmi can get into a bad state or the interface resets unexpectedly
 [forum.gl-inet.com](https://forum.gl-inet.com/t/e750-wan-not-reconnecting/19576#:~:text=Yes%2C%20when%20it%20happened%2C%20LTE,or%20reboot%20the%20whole%20router)
 [forum.gl-inet.com](https://forum.gl-inet.com/t/uqmi-hangs-when-modem-is-active/5982#:~:text=I%E2%80%99m%20finding%20that%20uqmi%20or,bug%20in%20the%20EP06%20firmware)
 . For instance, if uqmi/netifd times out or restarts the QMI connection around that 3-minute mark (perhaps due to not seeing expected keep-alive responses or a firmware quirk), it might terminate the PDP context from the host side. Another possibility is a modem firmware issue – though less likely, the EP06-A could conceivably crash or reset its data session internally (you’d often see the USB interface reset if it fully rebooted). In summary, a QMI/host-initiated drop would mean the device or its software decided to disconnect (or lost track of the session) rather than the network explicitly dropping you.
 Identifying Which Layer Initiated the Drop
 To pinpoint the culprit, you should gather evidence at multiple layers (QMI status, modem AT state, network registration, and traffic behavior) in the moments before and after the 3-minute drop. Here’s a step-by-step approach to diagnose which layer is causing the disconnection:
 Monitor QMI Connection Status: Use the QMI control utility to watch the link’s state in real time. For example, run uqmi -d /dev/cdc-wdm0 --get-data-status and uqmi --get-serving-system periodically (say every 10 seconds) immediately after a successful --start-network. The --get-data-status will report "connected" when the PDP is up, and it typically changes to "disconnected" if the QMI layer knows the session ended. Likewise, --get-serving-system shows registration status (e.g. registered on network, roaming/home, and PS attach state). If you see get-data-status flip from “connected” to “disconnected” at ~180s, that means the QMI layer received a disconnection event. Often this is triggered by the network (e.g. the modem got a PDP deactivate from the carrier) or by the host software. Correlate with serving-system: if after drop it still says “registered (LTE)” but perhaps “PS detached” or no data connection, that reinforces that the network ended the data session while the UE stayed registered on LTE. On the other hand, if QMI continues to show “connected” even when traffic stops, it implies the modem/QMI didn’t realize a drop occurred
 [forum.gl-inet.com](https://forum.gl-inet.com/t/e750-wan-not-reconnecting/19576#:~:text=Yes%2C%20when%20it%20happened%2C%20LTE,or%20reboot%20the%20whole%20router)
 . In that case the issue might be a silent network drop or a desync between the modem and host. (In summary, a carrier-initiated drop will usually be reflected in QMI status switching to disconnected relatively promptly, whereas a purely host-side glitch might not update the status correctly.)
 Poll PDP Context via AT Commands: In parallel with QMI, query the modem’s own view of the PDP context. Using the AT command interface (e.g. via AT+CGACT? and AT+CGPADDR=1 on the modem’s AT USB port) will show whether the PDP context is active and its IP address. Do this at startup, and around the 3-minute mark. Before the drop, you should see the context activated (e.g. +CGACT: 1,1 and an IP address from +CGPADDR: 1,<ip>). If the link drops, check these again: if you now see +CGACT: 1,0 (inactive) and no IP, that means the modem itself acknowledges the PDP context deactivated. In conjunction with QMI reporting a disconnect, this confirms the session truly ended (likely carrier or modem triggered it). Conversely, if the AT commands still show the context up (1,1 with an IP) even when you’ve lost connectivity or QMI says disconnected, that indicates a mismatch – the modem thinks it’s still attached while the host/network layer is out of sync. For example, a QMI client might have dropped while the modem PDP is actually still active, or the context is “stuck” active in the modem despite the network path being gone. Detecting such a state means the drop was not cleanly handled: possibly a QMI client issue or a network drop that the modem firmware didn’t report properly. Using AT+CGACT/CGPADDR polling basically lets you double-check the modem’s truth vs. the QMI/OS view at the critical moment.
 Inspect System Logs (netifd/QMI Events): Examine the logs on the router (e.g. logread or dmesg on OpenWrt) for any messages around the 180-second mark. You’ll want to look for QMI or wwan interface related logs. For instance, OpenWrt’s netifd might log events like “Interface ‘wwan’ is now down” when the cellular interface drops. You might also see messages from uqmi or the QMI driver if an error occurred (e.g. “uqmi[xxxx]: Failed to connect to service” or “Call failed” or QMI error codes). In some cases, the modem may output an unsolicited message on the AT log (e.g. a +QIND: PDP DEACT or similar) if the network initiated a cut – though uqmi should catch it. Key things to grep for: “wwan” (to see interface up/down changes), “qmi” (any QMI client or driver errors), and any obvious error or disconnect messages. If the drop is carrier-initiated, often you’ll see a log around that time indicating loss of data service or network detach – for example, a message that the modem is no longer registered or that the PDP context was lost (on some systems, you might see a change of state logged, similar to how MikroTik logs show “not registered, state: 0” when the link drops)
 [forum.mikrotik.com](https://forum.mikrotik.com/t/lte-cat6-modem-disconnecting-every-2-3-minutes/135493#:~:text=23%3A09%3A22%20lte%2Cinfo%20WAN2,LTE%20link%20up)
 . If the logs clearly show the interface going down at 3m with no manual intervention, that implies something (network or device) triggered it – this is strong evidence. On the other hand, if nothing is logged at all and the interface only drops much later when you, say, manually restart it, that suggests the modem didn’t inform the host immediately (which again points to a silent network drop or a host not listening for the event). Also check for any “client ID released” or “USB disconnect” in dmesg – if, for example, the USB interface reset (would hint the modem rebooted) or netifd closed the QMI client, you’d catch it here. The presence of a QMI error or timeout in logs at the drop would lean towards a QMI/host issue (e.g. uqmi might have crashed or given up), whereas a clean “network disconnected” type message would point to the carrier/network layer initiating it.
 Use Heartbeat Traffic (Ping Tests): Sending periodic pings through the cellular interface is an excellent way to see the real-time connectivity. Set up a cron or script to ping -I wwan0 -c 1 8.8.8.8 every 15–30 seconds and timestamp the results. This will reveal exactly when connectivity is lost and whether it recovers. Interpretation: If you observe that pings respond normally for, say, 170 seconds and then consistently time out after ~180s (and continue failing), it means traffic can no longer get through – a strong sign the PDP context is down or the path is blocked. If those ping failures align with QMI reporting disconnect and a log event, the case for a carrier-drop is very strong. If instead you see one or two pings drop around 3 minutes and then ping replies resume on their own, that suggests the PDP context actually remained active and a NAT binding simply timed out and got re-established when you sent new traffic. In other words, a temporary outage with self-recovery points to NAT timeout rather than full PDP teardown – the ping you sent after the idle period effectively refreshed the NAT mapping and restored traffic flow
 [blog.wirelessmoves.com](https://blog.wirelessmoves.com/2020/09/carrier-grade-nat-timeouts-and-how-to-configure-your-xmpp-server.html#:~:text=So%20a%20TCP%20keep%20alive,alive%20was)
 . On the flip side, if pings never actually fail (suppose you had a continuous ping running and it sails through the 3-minute mark without issues), yet around that time your management interface says “disconnected,” that would mean QMI/netifd thought the link dropped even though data was still flowing. That scenario would implicate a false drop indication by the software (a QMI client bug or mis-detection). Using pings in conjunction with the above checks not only helps detect the drop moment, but also can prevent drops due to inactivity. In fact, many implementations recommend periodic pings or keep-alive packets to keep the cellular link alive
 [docs.monogoto.io](https://docs.monogoto.io/getting-started/general-device-configurations/iot-devices/simcom-sim7600g-h#:~:text=When%20cellular%20modems%20are%20idle,device%20as%20being%20actively%20used)
 . In your testing, the pings will both serve as a detector and (if the cause is idle timeout) potentially as a workaround to stop the network from dropping the session in the first place.
 Correlate Multi-Layer Data to Pinpoint the Cause: Finally, bring all the observations together to conclude who “pulled the plug”:
 Carrier/Network Initiated: If you see the modem’s PDP context go down (AT reports inactive) and QMI status shows disconnected right at ~3 minutes, and logs/netifd indicate the link dropped without your input, it’s likely the carrier ended the session. This would align with an inactivity timeout or some network policy expiring the PDP context
 [networkengineering.stackexchange.com](https://networkengineering.stackexchange.com/questions/23810/in-which-cases-is-the-pdp-context-terminated#:~:text=Yes%2C%20simply%20not%20sending%20any,minutes)
 . The fact it stays up indefinitely after traffic is introduced reinforces this – essentially the network requires early traffic or it will assume the session isn’t needed. In this case, focusing on keep-alives (or contacting the carrier about PDP timeout settings) is the solution. (Example cause: GGSN/PGW idle timer expired – the network sent a PDP deactivate, which the modem obeyed, dropping the link.)
 NAT Timeout (no explicit PDP drop): If the only symptom of the “drop” is that traffic stops after 3 minutes idle but the QMI/AT status still show as if connected, then the PDP context is still up but the path was broken by NAT. In this scenario, you might notice that sending a new ping or some data after the drop reanimates the connection (since it causes a new NAT mapping). The modem never indicated a disconnect in this case. The layer that “initiated” the apparent drop is the carrier’s NAT gateway – it silently stopped forwarding traffic due to inactivity. The best detection here is the pattern of ping failures that recover with a new ping, combined with steady “connected” status in both QMI and AT. The remedy is to implement a periodic keepalive packet (ping or a UDP packet) to keep the NAT binding alive
 [blog.wirelessmoves.com](https://blog.wirelessmoves.com/2020/09/carrier-grade-nat-timeouts-and-how-to-configure-your-xmpp-server.html#:~:text=file%20takes%20immediate%20effect%2C%20even,kill%20the%20TCP%20session%20and)
 [docs.monogoto.io](https://docs.monogoto.io/getting-started/general-device-configurations/iot-devices/simcom-sim7600g-h#:~:text=When%20cellular%20modems%20are%20idle,device%20as%20being%20actively%20used)
 . This will prevent the illusion of a drop by ensuring the network sees the device as active.
 Device/QMI Initiated: If you find that the modem’s PDP context was actually still active (AT says active, or it re-connects quickly without an OTA attach) but the host network interface went down around 3 minutes (for example, log shows “wwan down” or uqmi error), then the drop was triggered internally. It could be that netifd or the QMI driver decided the link was dead (perhaps due to a missed heartbeat or a mis-read state) and it issued a disconnect or reset. Or uqmi might have crashed/timeout, dropping the client ID. In this case, QMI status might show “disconnected” (because netifd closed it) while the modem was in fact still registered and reachable. The telltale signs would be a log message about QMI or the interface shutting down without a corresponding network deregistration. Another sign is if immediately after the drop you can manually query the modem (via AT or another QMI client) and find the data session still present. To double-check, you could try running a manual uqmi --get-data-status or even an AT+PING from the modem at that time – if it works despite the interface being marked down, definitely the host dropped the ball. In summary, a host-initiated drop means the issue lies in the device firmware, QMI software, or configuration. The solution might be updating firmware, using a more robust connection manager, or adding explicit watchdog logic to reconnect if this condition is detected. Logging the QMI client ID allocation and any “release” events would also help confirm this. (For instance, if you see a log like “<wizard> releasing QMI client” around 180s, you’ve found the smoking gun that the software closed it.)
 By implementing the above logging and checks, you’ll be able to catch the moment of failure in real time and see which layer’s status changes first. In practice, a combination of QMI status change + modem PDP dropping is usually a carrier-triggered event, whereas a host interface drop with modem still saying connected points to a QMI/client issue. And if everything stays nominal except the ability to pass traffic, that points to a networking issue like NAT. Using a 3-minute marker in your logs (as you suggested) is wise – print out uqmi --get-data-status, uqmi --get-serving-system, AT+CGACT?, etc., at T+180s, and continue for a few minutes. This comprehensive view will definitively show whether the EP06-A is dropping due to the carrier’s PDP context timing out (in which case you see a clean teardown from the network side) or due to something internal like a QMI client state mismatch or software reset (where the network was fine but the device dropped). Once you identify who initiates the drop, you can take targeted action – e.g. enable periodic keep-alive pings to appease a carrier idle timer
 [docs.monogoto.io](https://docs.monogoto.io/getting-started/general-device-configurations/iot-devices/simcom-sim7600g-h#:~:text=When%20cellular%20modems%20are%20idle,device%20as%20being%20actively%20used)
 , or fix the QMI driver/client usage if that’s the culprit. This layered diagnostic approach ensures you catch the exact trigger of the 3-minute dropout and address the correct layer. 
 [networkengineering.stackexchange.com](https://networkengineering.stackexchange.com/questions/23810/in-which-cases-is-the-pdp-context-terminated#:~:text=Yes%2C%20simply%20not%20sending%20any,minutes)
 [forum.gl-inet.com](https://forum.gl-inet.com/t/e750-wan-not-reconnecting/19576#:~:text=Yes%2C%20when%20it%20happened%2C%20LTE,or%20reboot%20the%20whole%20router)