TCP RST flag help to trace back the protocol issues

 

Starter:

Performance is one of the most critical metrics for any IT environment. It directly mapped to the productivity. In order to figure out which components or entities slowing down the overall performance, we need to find out evidence to support our points. Today let’s learn one of TCP layer knowledge regarding RST to locate the performance issue.

Knowledge Prerequisite:

Before we dig in, we need to know what is RST and how it is used.

Let’s first review the graphic of TCP Header and find out the RST location.

There are six control bits there, from left to right is URG, ACK,PSH, RST,SYN,FIN.

 

 

  1. What is RST? It is the abbreviation of word reset.
  2. What is RST used? For the RST, it used to reset the connection.
  3. Whether it happens in the middle of the data transfer or at the end phase of data transfer? It happens in the middle to abort the connection.
  4. Why RST happens? It everything running smoothly following the routines, TCP connection will be closed gracefully, but if any exceptions happen before safely closing the TCP connection RST happens, like application crash etc.
  5. What kind of relationship between TCP and RST is? The RST is one of the features within the TCP protocol, We know the TCP protocol is use to keep track of the packets sent from one host did receive by another host. RST is used to allow TCP or an application to abruptly close a connection when the application has crashed or for some reason can no longer continue with the connection.

There are three reasons that a TCP Reset might be sent by a host.

  • There is no listening process on the IP/port pair on which an application is attempting to make a connection.
  • An application may abort the connection in the middle of a data transfer due to an application anomaly or crash.
  • A host may disconnect its side of the TCP connection if it does not receive acknowledgments from a host to which it was sending TCP data.

 

Real-life Scenario:

Issue Description:
======================

We deploy a file cluster contains five hosts, we notice there is slow performance issue at the node 3 of the cluster, using Windows clients to access the file server SMB shares, per testing showing file property need take 2 mins, update SID need 4 mins. However using the node 4 perform the same operation it without abnormal delay.

(Node 3 and Node 4 are the hosts within the same cluster at the backend end of the hosts they are attached the same file system)

Troubleshooting steps:

  1. First let’s look into the service response time of SMB protocol to rule out whether the issue related to some of operation:

Node 3 SMB service response time is at the same level of Node 4, the response of node 3 is as good as node 4 without notice any significant delay.

 

 

  1. But per comparison, we notice for the same operations there are more negotiate and session setup Req/Rep between the server and client. For the SMB protocols specs, we need to know negotiate and session setup are the two initial phases before any file operations, how come node 3 has 2 times more calls than node 4 for the same operations? This is what we need to look into next.

Moreover, there is no access deny or unable to open issue occurs when we doing the file operation(check file property and update user SID info).

  1. We need to revisit the packets between the two negotiate requests, from the statistics info, we already known another key information is there is no session logoff call during the issue reproduce at node 3, that means it keep using the same session for the file operations, it is more suspicious why the same session there are many times negotiate and session setup.

 

  1. We found before we the Negotiate request, there is an ‘RST, ACK’ sent from the client to file server, that’s a good sign to find out the reason why.

 

  1. Using filter we find repeatedly there are ‘RST, ACK’ packets send from client to file server.

filter that I used “(tcp.flags.reset eq 1 && tcp.port eq 445) || smb.cmd == 0x72 || smb2.cmd eq 0 || (smb2.cmd eq 1 && smb2.flags.response == 1)”

 

  1. Finally, from the packets we can clearly see the issue were caused by server side no response to SMB request from the client, RST helped us to trace back the procedures clients called.

Script that I used “for i in tshark -t ad -nr file_server_node3.pcap -Y "(tcp.flags.reset eq 1 && tcp.port eq 445)"| awk '{print $(NF-2)}' | sed 's/Ack=//g'; do echo “Ack Number $i”; tshark -nr file_server_node3.pcap -Y “tcp.ack eq $i || tcp.seq eq $i”; echo;echo; done”

 

 

Summary:

We talked about what is RST and how RST works, learned from a real issue encountered to find out how to use RST trace back which call caused the issue.

 

Referenced Links and documents:

Current RFC:

  • RFC793 TRANSMISSION CONTROL PROTOCOL

Other docs:

  • TCP/IP Analysis and Troubleshooting Toolkit

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *