Wan Optimization Support

Technical => Hardware => Topic started by: Spiffster on March 19, 2015, 01:26:09 PM

Title: Identifying System Bottlenecks
Post by: Spiffster on March 19, 2015, 01:26:09 PM
We have 2.0.4 setup in production between two offices:

HQ: 1Gb/1Gb
Remote: 50Mb/10Mb

Hot transfers to the remote site we are seeing 25% improvement over that 50Mb link while we are seeing over 300% improvement on the 10Mb link. So effectively things are operating at 60Mb/30Mb which is very impressive but I cant help but think we could do even better on that 50Mb link. During file transfer I dont see one thread go over 40% utilization on an E3-1245. That system is running a single SSD (Crucial MX100 128GB).

Can we squeeze even more performance out of wanos than this setup will allow? If so, is there a reliable method to determine where the bottleneck may be? I would imagine it would have to be IO, right?
Title: Re: Identifying System Bottlenecks
Post by: ahenning on March 19, 2015, 02:21:25 PM
v.2.0.4 and v.2.0.5 are basically identically except for the one line MultiSite patch and version numbers.

Bottlenecks:
At 40% I don't think it is in Wanos or the hardware, but one way to check if more efficiency would improve throughput is to set CLICK=false in the /tce/etc/wanos/wanos.conf
It still uses too much memory to be the default mode, but we are working on it. If 3GB+ Ram is available it will be fine.

Note the receiving side can also be a bottleneck. E.g. the Wanos 200 appliance tops out at receiving 30 Mbps even though the head end can process a lot more.

I faintly recall noting 20MB/s copy speed late yesterday, but could be wrong? Is it possible that hot transfers improve after hours when providers WAN links are least busy and contention ratios low. I am wondering if contention, congestion and latency is starting to affect throughput during business hours. Another control check is to determine if the 50 Mbps runs at 50 Mbps during office hours without Wanos.

A test with multiple simultaneous transfers can be used to determine if the bottleneck is perhaps on the TCP layer. If this is the case we can look at  implementing TCP window scaling tweaks.

60 Mbps hot transfer should translate into <6 Mbps WAN bandwidth. It might be a small speed boost over 50 Mbps, but the saving on the link that is not contending with other transfers is significant.
Title: Re: Identifying System Bottlenecks
Post by: Spiffster on March 19, 2015, 02:33:06 PM
Dont worry, I realize the difference between Mb / 8 = MB
The testing I was doing was last night around midnight and online speed tests were showing full bandwidth was available on both ends, so bandwidth contention should not be an issue. I do see that when file transfers start they burst to higher speeds then taper off to the numbers I provided... so they are probably conservative. Thing is, when I see that wanos is capable of 300% increase, I am trying to find if I can see similar improvements on the other side.

Im not complaining by any means, wanos is performing quite well, im just being a bit greedy now :)

I will try CLICK=false option as we have oodles of memory available on both ends (12GB).
Title: Re: Identifying System Bottlenecks
Post by: ahenning on March 19, 2015, 02:40:55 PM
Ok great, that config parameter made a significant difference in some of our tests, so maybe it does the trick.

Another useful feature I think we need to add is a Diagnostics > Benchmark page to provide some insight into what is possible on the hardware.
Title: Re: Identifying System Bottlenecks
Post by: Spiffster on March 19, 2015, 02:43:05 PM
Another useful feature I think we need to add is a Diagnostics > Benchmark page to provide some insight into what is possible on the hardware.

That would be great to have!
Title: Re: Identifying System Bottlenecks
Post by: Spiffster on March 20, 2015, 04:41:02 PM
OK I tested last night again and saw a slight improvement in performance in both directions. This was after making the CLICK=false edit in /tce/etc/wanos/wanos.conf

Perhaps latency is another factor? Latency is pretty low between these sites though. Pings over the VPN from firewall to firewall are around 15ms.
Title: Re: Identifying System Bottlenecks
Post by: ahenning on March 20, 2015, 05:02:33 PM
Ok, 15ms is not much so I doubt its a bottleneck, unless window scaling is broken (had that before with Cisco ASA's)
The aggregate throughput of multiple simultaneous copies would tell us if the hardware is running at peak. If throughput e.g. doubles it means the hardware is capable, just a single session limitation.

Perhaps your original take on IO was spot on. We know the SSD's are capable of much more, but could the VM side perhaps introduce IO latency? This is probably already trialed and tested, but just in case, reserve memory and cpu mhz on the VM. Would have been ideal if another SSD based appliance was available to test with.

Regarding the CLICK setting. Just keep an eye on memory usage. It should max use about 1.9GB of actual memory usage in the single site config. After all testing, if the gains are not substantial enough, rather disable it again until Beyers managed to tweak it down to 1GB.
Title: Re: Identifying System Bottlenecks
Post by: Spiffster on May 04, 2015, 02:14:09 PM
I have done some testing with SSDs on both ends. With hot transfers, when transferring using the 100Mb pipe I get around 105Mb, so very little optimization. Going the other way on the 20Mb pipe I get around 55Mb so almost 300% optimization.

(These numbers are calculated simply by dividing transfer time by the file size of a test (uncompressed) Revit file.)

Im not sure what is limiting the 100Mb pipe, but it almost seems like it bursts to 200% then tapers off quickly, but that may be windows reporting transfer speeds inconsistently. Again this is with an E3-1245 processor sending on the fast pipe and a X5450 receiving, so pretty fast processing on both ends. The "top" interface doesnt show an IO or CPU bottleneck, but I am only running a single MX100 128Gb on both ends... for these speeds, Riverbed and Silverpeak recommend 4-6 SSD drives!

BTW, ESXi introduces almost no IO overhead... maybe 3-5% at most from testing I have done.

That said, I have ordered one of these: http://ark.intel.com/products/67008/Intel-SSD-910-Series-400GB-12-Height-PCIe-2_0-25nm-MLC

If that doesnt eliminate IO as a possible bottleneck, I dont know what will. :)

I will report back after installing on the new SSD and testing transfer speeds.
Title: Re: Identifying System Bottlenecks
Post by: ahenning on May 04, 2015, 02:31:26 PM
Hi Jeremy,

It could also be TCP backing off. There are two performance improvements that are 90% complete:

The Async IO has a little room for improvement, which will improve cold throughput by a few %.
We rolled back some of the previous throughput gains due to heavy RAM usage. Once this is optimized hot and cold throughput will benefit by a few %.

To do
Also some early benchmarks on CentOS indicated a few % perf increase.
Possibly revisit Netmap support in Click to improve packet processing.

I think if we ace the above, the 200% burst will be a bit higher, remain longer and if there is a drop off, it will not drop as much.

Essentially this means the more efficient the software, the less hardware resources required.
Title: Re: Identifying System Bottlenecks
Post by: Spiffster on May 04, 2015, 02:32:54 PM
Thats good news, thanks!

One other thing I should point out is that we are indeed getting between 3 and 5 to 1 optimization in both directions despite the limit in transfer speeds, so we are seeing huge benefit in bandwidth utilization, just not transfer speed on the 100Mb connection. Very happy with both metrics on the 20Mb upload side of things.
Title: Re: Identifying System Bottlenecks
Post by: Spiffster on May 12, 2015, 02:11:53 PM
OK so everything is up and running and I even have wanos loaded onto an Intel 750 Series SSD! Same transfer speeds and optimization, so SSD isnt a limiting factor. I noticed that I get a fairly consistent 3x optimization on revit files and anywhere from 3x to 12x on other files. Im pretty happy with the performance but still think the 100Mb connections could get better throughput. Again seeing about 105Mb throughput on the 100Mb connection with 3x optimization... where on the upload side of things im seeing 3x or more on both optimization and throughput. So optimization is good but there seems to be a bottle neck on throughput at the higher speeds... if that makes sense.
Title: Re: Identifying System Bottlenecks
Post by: ahenning on May 12, 2015, 05:05:37 PM
Hi Jeremy,

Thanks for the feedback. Yes, you are right, since the 3+ times optimization is there, it indicates that we just need to focus on keeping the initial burst of eg 200 Mbps constant. The memory optimizations, async io and improvements we see on 64-bit CentOS are incremental performance improvements that combined might do the job. Beyers has been very keen to get cracking on performance optimizations so I am sure the results will be positive.

At some stage a test with iperf could also be useful to test the saturation point. After the software upgrades we will then be able to determine the % performance increase in this particular case.
Title: Re: Identifying System Bottlenecks
Post by: Beyers Cronje on May 12, 2015, 10:05:58 PM
Hi Jeremy,

We are indeed working on performance optimizations, the first of which should be available in the next major release. We have some really exciting developments in the pipeline.

That said, reviewing your particular test case I have a strong suspicion that you are hitting TCP throughput limitations like poorly performing or faulty window scaling. This can easily be verified or ruled out using iperf. We are happy to help set such a test up with you.

Beyers
Title: Re: Identifying System Bottlenecks
Post by: ahenning on May 12, 2015, 10:30:16 PM
Another key test would be to run the benchmark in the low setting. This would give us a good indication of where to focus.

If low flies at a constant 200 Mbps with a compressible file, then we know we need to focus on making the high optimization more efficient. If not, then we need to dig a bit deeper.
Title: Re: Identifying System Bottlenecks
Post by: Spiffster on May 13, 2015, 02:05:22 PM
While im vaguely familiar with iperf and what its used for, I am not familiar with this TCP window scaling... not sure if it has any bearing but I can confirm that MTU size is default of 1500 across the board.

The setup is pretty simple:

LAN <--> wanos <--> GW <--> VPN <--> GW <--> wanos <--> LAN

Potentially stupid question:Will I need to install iperf on the wanos box or a windows machine?
Title: Re: Identifying System Bottlenecks
Post by: Beyers Cronje on May 13, 2015, 02:32:05 PM
The maximum throughput of a TCP session is expressed as:
Code: [Select]
TCP-Window-Size-in-bits / Latency-in-seconds = Bits-per-second-throughput
Without window scaling, the maximum window size is 64KB. Given a 15ms RTT and default 64KB window, the maximum TCP throughput is:

Code: [Select]
Window size = 64KB = 65536 Bytes.   65536 * 8 = 524288 bits
15ms RTT = 0.015 seconds
524288 / 0.015 = 34.95 Mbps

So the maximum throughput on a single TCP session over WAN link with 15ms RTT will be less than 34.95 Mbps
The throughput drops significantly the higher the RTT gets.

Some more information on TCP performance at http://en.wikipedia.org/wiki/TCP_tuning

You run iperf on both ends on the LAN, so on a Windows or Linux box. One side runs as the client while the other runs as the server. iperf is just a command line tool, and is available for Windows, Linux and OSX. Using iperf you can manually set the TCP windows size as well as number of threads to run, or you can use it to send UDP stream at a given rate. This will help to see what the theoretical maximum throughput of your WAN link is.

Edit: to emphasize window scaling, which most stacks support.
Title: Re: Identifying System Bottlenecks
Post by: ahenning on May 13, 2015, 03:35:42 PM
Wanos by default bypasses UDP, so it could provide some useful stats on native traffic speeds, but its also useful to provide latency and packetloss stats. I can help with command line flags.

I think that because we see the throughput stabilize at 105 Mbps, it could mean that window scaling is relatively ok.

The test in 'low' cuts out the heavy computational and io tasks of dedup and it will tell us to focus there or look for bottlenecks somewhere else.

What was the average throughput on the 100 Mbps link before optimization or in other words while in bypass?
Title: Re: Identifying System Bottlenecks
Post by: Spiffster on May 13, 2015, 05:20:06 PM
OK so I have iperf loaded on a few endpoints and have played around with it a bit... very nice tool to have. I will probably need to run some tests during lunchtime to get accurate measurements when bandwidth utilization will be low.

I have a server at each location that can be bypassed in traffic policies so can test with and without optimization.

What flags would you like me to use to test? Thanks.