Click to See Complete Forum and Search --> : TCP/IP Works-then stops
Ron Jameson
04-04-2000, 10:40 PM
I have a customer, who's customer is having MAJOR issues with IP. They are running a Sun Firewall, Cisco routers and switches, NT servers and NT workstations. They say that IP will work (you can bombard the servers and is fine), but then will fail on the next batch of 20 or so packets. They have tried to replace the routers, hubs and the NIC in the Sun firewall but still have problems. This customer is holding out on $75k in payment until it is fixed so I am trying to help.
They have Cisco people, ISP and other tech support on site searching the cause for 2 weeks. Nothing. Has ANYONE ever heard of such a problem? I am trying to think what can cause this?
Server NIC's, net wiring (can a port tester figure this out), what program can find out what is going on here? Any help is appreciated.
Ron
Daniel Neises
04-12-2000, 10:58 PM
We're experiencing something similar:
Problem Statement:
Our customers are currently experiencing unacceptable / erratic behavior when using some types of applications in a client server or shared file environment.
Current Environments:
11MB Lucent Wireless network with roughly 10-35 Dell Optiplex GX1 workstations running Windows 2000 Professional. Each Site or LAN has one Dell Poweredge 2400 Server running Windows2000 Server with an Intel Pro 100+ Management adapter. The Server is plugged directly into a hub and workstations talk to a wireless access point that is plugged into the hub.
100mb hardwired network with roughly 4 Dell Optiplex GX1 workstations running Windows 2000 Professional. 6 Dell Optiplex GX1’s running Windows 2000 Server with Intel Pro 100+ adapters. The server and all workstation are plugged directly plugged into the hub.
Sample Problems:
Example 1; A user on a workstation tries to open or update a 4mb access db file on a remote server share. The end user experiences delayed, extremely slow and sometimes failed connections.
Example 2;. A customer is using QuickBooks client application on the local workstation and is trying to access a file located in a share on a server. When the user tries to update information on the server share one of three things happens. The application crashes, an error window pops up “the connection to the QuickBooks company file has been lost” then the application would crash, and third pop up window “Windows- Delayed Write failed …. Windows was unable to save all the data for the file ***.qbw. The data has been lost. This error may be caused by a failure of your computer hardware or network connection. Important to note there are no problems using this application in a non-client server mode.
Example 3; A customer using ACT tries to update information that is located on a server share and the application crashes or fails to update the information. If the user has ACT open while its connected to the shared database not entering any information the application will eventually DR.Watson error and crash. If this same user switches to using ACT locally no crashes or application problems.
Troubleshooting / Remedial Steps to Date:
1. Updated Intel Server NIC Driver to latest version
2. Updated the Lucent Wireless driver adapter to the latest version
3. Updated the Firmware on Wireless cards (version x / date)
4. Verified TCP/ IP is the only protocol loaded (version x / date)
Tools Used:
Network Associates Sniffer Pro 3.5 for network analysis
Windows Performance Monitor for TCP static’s on the server.
Detailed Analysis:
We have captured data from 5 different networks 4 of them wireless and 1 of them completely wired to help eliminate variables such as this only occurs on wireless networks. The traffic patterns are all the same regardless if wired or wireless.
When looking at the various network captures there is always the recurring theme of: Excessive ACK Too Long errors, Window Frozen errors, and Idle Too long errors. As a statistic in a typical 11-minute capture there were 254 instances of the above problems when communicating from the workstation to server and the server to the workstation. When looking at this statistics and high level analysis alone it looks like there is a problem with the workstations talking to the server and server talking to the workstations. If you look at the actual decode of the network capture you see normal conversations with the higher level protocol SMB and then the workstation sends out a delayed ACK to the server, the workstation closes the connection to the server. The server thinks its still talking to the workstation try’s to send information to the workstation a serious of errors occurs with the server. Then the workstation tears down the session and reestablishes itself with the server and communication continues as normal. Then this whole process starts up again.
With NT Performance monitor we set up two counters:
1. Connection Failures- the number of times TCP connections have made a direct transition to the CLOSED state from the SYN-SENT state or the SYN-RCVD state, plus the number of times TCP connections have made a direct transition to the LISTEN state from the SYN-RCVD state
2. Connections Reset the number of times TCP connections have made a direct transition to the CLOSED state from either the ESTABLISHED state or the CLOSE-WAIT state.
Both of these counters are off the charts all the time somewhere in the range of 4000
Anyone have an idea why we're seeing the TCP stack break down and try to rebuild these sessions?
------------
Ron Jameson at 4/4/00 5:40:20 PM
I have a customer, who's customer is having MAJOR issues with IP. They are running a Sun Firewall, Cisco routers and switches, NT servers and NT workstations. They say that IP will work (you can bombard the servers and is fine), but then will fail on the next batch of 20 or so packets. They have tried to replace the routers, hubs and the NIC in the Sun firewall but still have problems. This customer is holding out on $75k in payment until it is fixed so I am trying to help.
They have Cisco people, ISP and other tech support on site searching the cause for 2 weeks. Nothing. Has ANYONE ever heard of such a problem? I am trying to think what can cause this?
Server NIC's, net wiring (can a port tester figure this out), what program can find out what is going on here? Any help is appreciated.
Ron
Ron Jameson
04-13-2000, 09:24 PM
In my clients, clients case...they found multiple routers with same IP. Rookie problem I say to the rookie IT guy over there. It pays to have experience out there of which this guy is running by the seat of his pants.
Same IP on a network will shut down IP. That is what is happening.
When debugging any network...simplify, simplify, simplify...I use the same techniques to debug SCSI as I do network problems (odd ones). Strip the network down to server/workstation. Add componets to the mix and find out where the problem it. Ususally find a rougue NIC with buffer over/underruns delaying transmission.
Ron
------------
Daniel Neises at 4/12/00 5:58:53 PM
We're experiencing something similar:
Problem Statement:
Our customers are currently experiencing unacceptable / erratic behavior when using some types of applications in a client server or shared file environment.
Current Environments:
11MB Lucent Wireless network with roughly 10-35 Dell Optiplex GX1 workstations running Windows 2000 Professional. Each Site or LAN has one Dell Poweredge 2400 Server running Windows2000 Server with an Intel Pro 100+ Management adapter. The Server is plugged directly into a hub and workstations talk to a wireless access point that is plugged into the hub.
100mb hardwired network with roughly 4 Dell Optiplex GX1 workstations running Windows 2000 Professional. 6 Dell Optiplex GX1’s running Windows 2000 Server with Intel Pro 100+ adapters. The server and all workstation are plugged directly plugged into the hub.
Sample Problems:
Example 1; A user on a workstation tries to open or update a 4mb access db file on a remote server share. The end user experiences delayed, extremely slow and sometimes failed connections.
Example 2;. A customer is using QuickBooks client application on the local workstation and is trying to access a file located in a share on a server. When the user tries to update information on the server share one of three things happens. The application crashes, an error window pops up “the connection to the QuickBooks company file has been lost” then the application would crash, and third pop up window “Windows- Delayed Write failed …. Windows was unable to save all the data for the file ***.qbw. The data has been lost. This error may be caused by a failure of your computer hardware or network connection. Important to note there are no problems using this application in a non-client server mode.
Example 3; A customer using ACT tries to update information that is located on a server share and the application crashes or fails to update the information. If the user has ACT open while its connected to the shared database not entering any information the application will eventually DR.Watson error and crash. If this same user switches to using ACT locally no crashes or application problems.
Troubleshooting / Remedial Steps to Date:
1. Updated Intel Server NIC Driver to latest version
2. Updated the Lucent Wireless driver adapter to the latest version
3. Updated the Firmware on Wireless cards (version x / date)
4. Verified TCP/ IP is the only protocol loaded (version x / date)
Tools Used:
Network Associates Sniffer Pro 3.5 for network analysis
Windows Performance Monitor for TCP static’s on the server.
Detailed Analysis:
We have captured data from 5 different networks 4 of them wireless and 1 of them completely wired to help eliminate variables such as this only occurs on wireless networks. The traffic patterns are all the same regardless if wired or wireless.
When looking at the various network captures there is always the recurring theme of: Excessive ACK Too Long errors, Window Frozen errors, and Idle Too long errors. As a statistic in a typical 11-minute capture there were 254 instances of the above problems when communicating from the workstation to server and the server to the workstation. When looking at this statistics and high level analysis alone it looks like there is a problem with the workstations talking to the server and server talking to the workstations. If you look at the actual decode of the network capture you see normal conversations with the higher level protocol SMB and then the workstation sends out a delayed ACK to the server, the workstation closes the connection to the server. The server thinks its still talking to the workstation try’s to send information to the workstation a serious of errors occurs with the server. Then the workstation tears down the session and reestablishes itself with the server and communication continues as normal. Then this whole process starts up again.
With NT Performance monitor we set up two counters:
1. Connection Failures- the number of times TCP connections have made a direct transition to the CLOSED state from the SYN-SENT state or the SYN-RCVD state, plus the number of times TCP connections have made a direct transition to the LISTEN state from the SYN-RCVD state
2. Connections Reset the number of times TCP connections have made a direct transition to the CLOSED state from either the ESTABLISHED state or the CLOSE-WAIT state.
Both of these counters are off the charts all the time somewhere in the range of 4000
Anyone have an idea why we're seeing the TCP stack break down and try to rebuild these sessions?
------------
Ron Jameson at 4/4/00 5:40:20 PM
I have a customer, who's customer is having MAJOR issues with IP. They are running a Sun Firewall, Cisco routers and switches, NT servers and NT workstations. They say that IP will work (you can bombard the servers and is fine), but then will fail on the next batch of 20 or so packets. They have tried to replace the routers, hubs and the NIC in the Sun firewall but still have problems. This customer is holding out on $75k in payment until it is fixed so I am trying to help.
They have Cisco people, ISP and other tech support on site searching the cause for 2 weeks. Nothing. Has ANYONE ever heard of such a problem? I am trying to think what can cause this?
Server NIC's, net wiring (can a port tester figure this out), what program can find out what is going on here? Any help is appreciated.
Ron
Daniel Neises
04-13-2000, 11:02 PM
Thanks for the follow up. Unfortunately we are now getting alot more data that is pointing at a combination of Lucent wireless / Windows 2000 / and SMS as culprits during our sniffer analysis. I'll post what we find either way.
------------
Ron Jameson at 4/13/00 4:24:43 PM
In my clients, clients case...they found multiple routers with same IP. Rookie problem I say to the rookie IT guy over there. It pays to have experience out there of which this guy is running by the seat of his pants.
Same IP on a network will shut down IP. That is what is happening.
When debugging any network...simplify, simplify, simplify...I use the same techniques to debug SCSI as I do network problems (odd ones). Strip the network down to server/workstation. Add componets to the mix and find out where the problem it. Ususally find a rougue NIC with buffer over/underruns delaying transmission.
Ron
------------
Daniel Neises at 4/12/00 5:58:53 PM
We're experiencing something similar:
Problem Statement:
Our customers are currently experiencing unacceptable / erratic behavior when using some types of applications in a client server or shared file environment.
Current Environments:
11MB Lucent Wireless network with roughly 10-35 Dell Optiplex GX1 workstations running Windows 2000 Professional. Each Site or LAN has one Dell Poweredge 2400 Server running Windows2000 Server with an Intel Pro 100+ Management adapter. The Server is plugged directly into a hub and workstations talk to a wireless access point that is plugged into the hub.
100mb hardwired network with roughly 4 Dell Optiplex GX1 workstations running Windows 2000 Professional. 6 Dell Optiplex GX1’s running Windows 2000 Server with Intel Pro 100+ adapters. The server and all workstation are plugged directly plugged into the hub.
Sample Problems:
Example 1; A user on a workstation tries to open or update a 4mb access db file on a remote server share. The end user experiences delayed, extremely slow and sometimes failed connections.
Example 2;. A customer is using QuickBooks client application on the local workstation and is trying to access a file located in a share on a server. When the user tries to update information on the server share one of three things happens. The application crashes, an error window pops up “the connection to the QuickBooks company file has been lost” then the application would crash, and third pop up window “Windows- Delayed Write failed …. Windows was unable to save all the data for the file ***.qbw. The data has been lost. This error may be caused by a failure of your computer hardware or network connection. Important to note there are no problems using this application in a non-client server mode.
Example 3; A customer using ACT tries to update information that is located on a server share and the application crashes or fails to update the information. If the user has ACT open while its connected to the shared database not entering any information the application will eventually DR.Watson error and crash. If this same user switches to using ACT locally no crashes or application problems.
Troubleshooting / Remedial Steps to Date:
1. Updated Intel Server NIC Driver to latest version
2. Updated the Lucent Wireless driver adapter to the latest version
3. Updated the Firmware on Wireless cards (version x / date)
4. Verified TCP/ IP is the only protocol loaded (version x / date)
Tools Used:
Network Associates Sniffer Pro 3.5 for network analysis
Windows Performance Monitor for TCP static’s on the server.
Detailed Analysis:
We have captured data from 5 different networks 4 of them wireless and 1 of them completely wired to help eliminate variables such as this only occurs on wireless networks. The traffic patterns are all the same regardless if wired or wireless.
When looking at the various network captures there is always the recurring theme of: Excessive ACK Too Long errors, Window Frozen errors, and Idle Too long errors. As a statistic in a typical 11-minute capture there were 254 instances of the above problems when communicating from the workstation to server and the server to the workstation. When looking at this statistics and high level analysis alone it looks like there is a problem with the workstations talking to the server and server talking to the workstations. If you look at the actual decode of the network capture you see normal conversations with the higher level protocol SMB and then the workstation sends out a delayed ACK to the server, the workstation closes the connection to the server. The server thinks its still talking to the workstation try’s to send information to the workstation a serious of errors occurs with the server. Then the workstation tears down the session and reestablishes itself with the server and communication continues as normal. Then this whole process starts up again.
With NT Performance monitor we set up two counters:
1. Connection Failures- the number of times TCP connections have made a direct transition to the CLOSED state from the SYN-SENT state or the SYN-RCVD state, plus the number of times TCP connections have made a direct transition to the LISTEN state from the SYN-RCVD state
2. Connections Reset the number of times TCP connections have made a direct transition to the CLOSED state from either the ESTABLISHED state or the CLOSE-WAIT state.
Both of these counters are off the charts all the time somewhere in the range of 4000
Anyone have an idea why we're seeing the TCP stack break down and try to rebuild these sessions?
------------
Ron Jameson at 4/4/00 5:40:20 PM
I have a customer, who's customer is having MAJOR issues with IP. They are running a Sun Firewall, Cisco routers and switches, NT servers and NT workstations. They say that IP will work (you can bombard the servers and is fine), but then will fail on the next batch of 20 or so packets. They have tried to replace the routers, hubs and the NIC in the Sun firewall but still have problems. This customer is holding out on $75k in payment until it is fixed so I am trying to help.
They have Cisco people, ISP and other tech support on site searching the cause for 2 weeks. Nothing. Has ANYONE ever heard of such a problem? I am trying to think what can cause this?
Server NIC's, net wiring (can a port tester figure this out), what program can find out what is going on here? Any help is appreciated.
Ron