Enable the lock for LWIP to solve the crash problem when transmitting TCP package fast for ESP8266 based programmable controller

zhuyue - Oct 21 - - Dev Community

Modify the configuration of LWIP to solve the problem of ESP8266 kernel crash and restart when LWIP library sends data quickly.

A few days ago, optimized the read/write operation of spi flash, adjusted the priority of the controller main thread and LWIP thread, so that the priority of the main thread is higher than that of the LWIP thread, which can instantly preempt scheduling and ensure the fast response of the controller.

At the same time, the TCP packet processing state machine has also done some optimization, the final single 1200 byte packet http send and receive about 40ms;

But immediately ran into a problem, the PC tool began to do automatic download web page test, only a few data sent and received, the ESP8266 crashed and restarted, reported the following error:
Guru Meditation Error: Core 0 panic'ed (LoadProhibited). Exception was unhandled.

Or.
assertion “tcp_write: no pbufs on queue => both queues empty” failed: file “E:/software/esptool/ESP8266_RTOS_SDK/components/lwip/lwip/src/core /tcp_out.c”, line 347, function: tcp_write_checks

Parsing the PC pointer at the time of the crash via xtensa-lx106-elf-addr2line.exe reveals that it basically crashes when the following code is executed.
if (TCP_SEQ_LT(lwip_ntohl(seg->tcphdr->seqno), lwip_ntohl(useg->tcphdr->seqno))))
It should be seg, useg, seg->tcphdr, useg->tcphdr are NULL causing the crash.
Add test code and print the log, confirming that useg is NULL caused by the error, and then carefully analyze the code in front of this line, found that only pcb->unacked into the function after the NULL, run to the fault line of the process was exceptionally changed to a non-zero value.

The next step in the definition of variables through some logical judgment, found that pcb->unacked in the entire function, including the call of the sub-function has not been changed, and for unknown reasons were rewritten, specifically in the call relationship tcp_output->ip4_output_if_opt_src->etharp_output implementation of the etharp_output function was rewritten. output function is rewritten.

It is more likely that the function or the called subfunction is not thread-safe and does not allow reentry, but is called by multiple task reentries.

In the ip4_output_if_opt_src and etharp_output functions, the name of the task is obtained by pcTaskGetName(NULL) and printed out;

It was found that the LWIP thread named tiT called the function and before it finished executing, another thread with the name http server entered the function;

And http server is the name of the controller's main thread, which calls the tcp_output function after calling tcp_write to allow the LWIP to send the data out immediately.

If you don't use the kernel locking feature, the main task can't directly call tcp_output to send data, instead, it should call tcp_output to send data by LWIP's thread through TCPIP_MSG_API message;

For maximum communication efficiency, I still decided to use LWIP_TCPIP_CORE_LOCKING to avoid function reentry and solve the kernel crash problem.

In my lwipopt.h file, I define

define LWIP_TCPIP_CORE_LOCKING 1

Add LOCK_TCPIP_CORE() before calling tcp_write and tcp_output in the main thread, and UNLOCK_TCPIP_CORE() after the functions.

The problem of kernel crash is solved;

The uncompressed number, 16ksa/s sampling rate, 16bit bit depth of the HD audio signal requires a bandwidth of 256kb/s, now the TCP communication speed can fully meet the transmission requirements.

Image description

Image description

Image description

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .