Friday, February 12, 2016

OAM 11g Webgate Tuning

Introduction

This post is part of a larger series on Oracle Access Manager 11g called Oracle Access Manager Academy. An index to the entire series with links to each of the separate posts is available.
People typically are introduced to Webgate tuning in one of two ways, either forced into it because of a crisis or actively preparing an environment to do some aggressive load testing.  Hopefully you are in the later group.  Unfortunately, there is still a lot of mystery behind tuning some of these Webgate parameters.  Creating a comprehensive article to cover all aspects of tuning is a real challenge.  That said, this article will be focused on what I feel are the most important tuning parameters; 1) Max Connections, including the relationship between Max Connections and Max Number of Connection, 2) the Failover Threshold, and 3) the AAA Timeout Threshold.  If you can grasp the concepts around these few important key parameters your success in getting better performance and stability out of the Webgates and Access Servers will greatly increase.

Quick Overview

Knowledge in this article is based on extensive experience in the field, discussions with Oracle Webgate developers, and of course invaluable peers.  As I already mentioned in the introduction I will break out the Webgate tuning into three areas to help make it a little easier to digest.   Each of the three parameters are not necessarily relate to each other or dependent, so you are free to jump to the section you are interested in.  However, I highly advise that you spend time reading the entire article before making any major tuning changes.  Below is a screenshot of an 11gR2PS3 (OAM 11.1.2.3.0) Webgate definition that highlights the parameters I will cover plus any associated field; all settings are R2PS3 default values.


Max Connections --- Not so Literal

The Max Connections parameter can reap some big improvements in performance, but beware --- increasing the value does not necessarily equate to more performance and can in fact have a negative impact. The official Oracle OAM 11g 11.1.2 Administration Guide says, “Max Connections is the maximum number of connections that a Webgate can establish with the Access Server.” This statement is a bit confusing and could lead you to believe that by applying Max Connections value X will only send X number of connections to the Access Server, but that is completely false.
Before jumping into the Max Connection parameter, first things first, we need to understand how connections work with web servers and how it relates to the Webgate module. Since the majority of the audience use OHS (Oracle HTTP Server) or Apache, I will focus on OHS to keep things simple since it is basically Apache from a fundamental level. So what I explain with OHS going forward will also apply to Apache. If you use a different 11g Webgate supported web server, how connections work can be different so please extrapolate this information and try to apply it to your environment.

Worker or Pre-Fork Mode

OHS will run in one of two modes, “Worker” or “Pre-Fork”. The default in OHS is Worker mode, but with Apache it can depend on how it is compiled though typical implementations use Worker mode. Be sure to verify what mode you are running in. Worker mode uses multiple child processes with several threads for each process. Each thread will handle one connection at a time.
Now, the thing that is important to understand here is that the Webgate module is actually instantiated by the child processes directly, rather than by the OHS parent process. Again focusing purely on the multi-threaded “Worker” mode, a number of directives within the web server configuration file control exactly how many child processes will be spawned based on the number of incoming requests. From a Webgate point of view, we must bear in mind that each of these child processes will open its own pool of connections to the Access Servers, as defined by the Max Connections setting in the Webgate profile.
As a working example, let us specify Max Connections as “12” and our web server is configured to spawn up to 20 child processes, the total number of connections from the web server as a whole to the OAM servers will thus be 12 Max Connections times 20 child processes for a total of 240 connections; (12 x 20 = 240). We should always consider this multiplicative effect in mind when defining “Max Connections”, since we don’t want to end up opening too many connections and risk overloading the Access Servers. In the sections that follow, this multiplicative effect will not be explicitly called out, but please remember that it still applies in every case. So let’s apply another example so we fully understand the ramification of both the Max Connection and OHS configuration settings and how they relate.
Take for example the default mpm_worker_module section from an OHS httpd.conf file; shown below. We see ThreadsPerChild is set to 25, MaxClients is 150, and StartServers equals 2. The MaxClients value basically limits the maximum number of threads that can be opened by OHS while StartServers says open up 2 threads at start up. That means at start up we will immediately get 2 children times 25 threads for a total of 50 threads. We know that each child has X WebGate connections where X is defined by the Max Connection setting in the webgate profile.  So if our Max Connection is 12 we will immediately have a total of 24 connection (2 StartServers x 12 Max Connections = 24 Webgate connections).  As traffic increases, OHS/Apache will spawn more children and therefore more webgate connections until the MaxClients limit is reached.  With MaxClient set to 150 and ThreadsPerChild set to 25, we can expect somewhere between 6-8 children max (the extra are due to the spare threads portion of the algorithm).  With 12 connections per child this means a maximum of somewhere between 72 and 96 connections for our example OHS/Apache server.  

<IfModule mpm_worker_module>
     StartServers         2
     MaxClients         150
     MinSpareThreads     25
     MaxSpareThreads     75
     ThreadsPerChild     25
     MaxRequestsPerChild  0
     AcceptMutex fcntl
     LockFile 
</IfModule>

If Max Connections is changed to 24 then the number of connections goes to 1,200 (25 ThreadsPerChild x 2 StartServers x 24 Max Connections = 1,200 Webgate connections). As the web server accepts greater loads it will open up additional threads as needed. Each thread that is opened spawns 25 new children. We can easily see how the Webgate connections can multiply to become hundreds or even thousands of connections from one OHS server to each Access Server. The only throttle is MaxClients, which limits the total number of threads OHS will open. And keep in mind a production environment will have several OHS servers so the load on the Access Servers can grow quite fast. It is important when tuning Max Connections to monitor the utilization of CPU and Memory, plus the TCP connections on each Access Server as you tweak Webgate Max Connections or even the OHS ThreadsPerChild and MaxClients values. It is also important to understand that the specific number of threads per process is governed by the setting "ThreadsPerChild".  The take away for this lesson is that a few Max Connections can go a long way, but too much of a good thing can be bad. Remember Mom always knew best when she said everything in moderation.



Now if your web server is configured for Pre-Fork mode, be especially careful because each request to the web server is handled by a dedicated (i.e. single-threaded) child process.  It follows that the maximum number of child processes – and hence the total number of Access Server connections – can quickly grow to a very large number.  I am sure you are asking, so what is a good value for Max Connections?  As for a magical recommended number, besides calculating the total sum based on the Max Number of Connections from each primary Access Server (more on that in the next section), unfortunately there is no sweet spot.  The value needs to be determined based on experimenting with load tests and recording the results that can be compared to see what values reap the best performance.  No implementation is alike, and as many deployments I have seen I have equally seen as many different values.   Now before you decided on the Max Connections value, you need to read the next section.

Making the Connection to Max Connections

There is no pun in the connection between Max Connections and Max Number of Connections. In a nutshell, the value for the Max Connections parameter should be the sum of all the Max Number of Connections from each Primary Server. Take the following diagram as an example.




The value for Max Connections in the diagram is 12. If you add up the Max Number of Connections from each of the three Primary Servers it totals 12 (4+4+4=12).
Let’s take another example, but this time change OAM 3 primary Access Server to a secondary server, and also update the Max Number of Connections value for each OAM Server from 4 to 6.


The first thing I want to point out is that the secondary Access Server will not get requests from the Webgate until connections to any primary Access Server fall below the Failover Threshold; more on that later. Since we have two primary OAM servers with Max Number of Connections values of 6 each, the total Max Connections value for the Webgate would be 12 (6+6=12); it is pretty simple. Now that we understand how to get the value for Max Connections parameter, you maybe wondering about what value to even use for Max Number of Connections; 4, 6, 20, 100? Good question, and fortunately Chris Johnson wrote a great article on this very subject, “How many connections do I need from the WebGate to the OAM Server?”. Again, it must be called out that the number you define in the Webgate profile will be multiplied by the number of Web Server child processes to determine the actual number of connections – so a little can often go a long way!

Does each Max Number of Connections need to be Symmetrical?

So far in my examples I have made each OAM server Max Number of Connections the same or symmetrical, but you don’t necessarily have to do that. You can optionally add more connections to different primary servers if you want more requests to go to any specific server. This strategy is basically a type of load balancing using the Webgate Max Number of Connections configuration value instead of using an actual physical load balancer appliance; take the following diagram as an example.


Notice that OAM 1 primary server has 8 Max Number of Connections while OAM 2 and OAM 3 primary servers have 4 each. So the total Max Connections value would be 16 (8+4+4=16). In this particular configuration OAM 1 server would get double the number of connections from the Webgates as the other two primary OAM servers. One reason to do this would be that OAM 1 is a much larger server, more memory, etc. and can handle more traffic, or maybe OAM 1 is physically closer to the Webgate so it can process requests much faster. In reality even though this is an option, I have never really seen this in practice because normally all the servers have the equivalent sized hardware, are in the same network, and therefore there is no need to distribute more requests to any one server. That said, I did want to at least bring this up so you understand that there are options for various reasons if you so decide it makes sense.

The Skinny on Failover Threshold

The latest (At the time of this post) official 11g Access Manager documentation in section Table 16-3 Elements on Expanded 11g and 10g WebGate/Access Client Registration Pages says the Failover Threshold parameter is “Number representing the point when this Webgate opens connections to a Secondary OAM Server.” It also gives an example, if 30 were used as a value, and the number of connections to primary servers drops to 29, connections begin to open up to the secondary Access Server; the default value is 1. This description kind of gives an idea of what is happening, but no recommendations and some find it confusing. So I wanted to add some of my experience with recommendations.

1. First, the word “Failover” in the parameter name is exactly what it means. As connections are lost from each primary OAM server, the Webgate will then try to make up that connection by connecting to a secondary OAM server; hence the word “Failover”.  So a big note here, this setting only works if there are at least one or more secondary OAM servers defined in the Webgate profile. The parameter Failover Threshold will do nothing if there is no secondary OAM server defined.


2. Second, the word “Threshold” in the parameter name is talking about at what point do connections begin to go over to the secondary OAM server(s).   Based on the official documentation, which is correct, if the Failover Threshold is set to 6 where the Max Number of Connections is also set to 6, then as soon as the number of connections going from the Webgate to the OAM server drops below the Failover Threshold of 6, connections will start to be sent to the secondary OAM server(s).   If there are two secondary OAM servers, the first in the list will be the one getting all the connections. As soon as the first secondary OAM server fills up its Max Number of Connections, the second secondary OAM server will start getting connections. Are you following?
So the big question is what is the best setting? My recommendation is two fold.
1. If you DO have Secondary OAM Servers configured:
Set the Failover Threshold value equal to the Max Number of Connections only if you have at least one secondary OAM server. Take my examples above, if the OAM server Max Number of Connections is 4, then set the Failover Threshold to 4. The reason for this is that you engage all the processing power needed as connections drop from any one primary OAM server since the secondary OAM server will start picking up the slack. As soon as the primary server having connection problems corrects itself, the Webgate will start failing back to the primary OAM server and slowly drop the connections from the secondary server until all the Max Number of Connections are met.
2. If you DO NOT have Secondary OAM Servers configured:
If you decide not to configure any secondary OAM server, you can leave the Failover Threshold value to the default of 1 because it will never be used. Remember, Failover Threshold requires a secondary OAM server to be configured. In practice, most clients like to see all their hardware provide some value, which means keep them all working to get their money worth. So I will typically see all OAM servers configured as primary servers; there is nothing wrong with this. That said, I have also seen various configurations with a mix of primary and secondary servers in a criss cross fasion that is a bit more complicated, but certainly has merrits too depending on the situation.
If you follow either of the points above you should have a solid configuration.

AAA Timeout Threshold

The AAA Timeout Threshold parameter setting determines how long the Webgate will wait on a connection response before it gives up and attempts to request a new connection. For example let’s say the Webgate has a connection opened, and a request comes through to validate some credentials. This process normally should take a fraction of a second, but there could be all sorts of variables to make this request take much longer. If the wait for the response is longer than the AAA Timeout Threshold, it will abandon the connection for that request, toss it back in the pool, and open a new connection to try again.
For most of OAM's life (prior to R2 PS3), the default value for AAA Timeout Threshold is “-1” (minus one). The -1 is a special value that tells the Webgate to use the operating systems TCP timeout, which could easily be 2 minutes or even more! I have seen actual cases in practice where something goes awry with some Access Server and while the Webgate tries to connect to the Access Server or get some response from it, the Webgate keeps trying for a long time because the AAA Timeout Threshold was set to the default -1. As each connection tries for a very long time, the Webgate begins to get into a state that gives impression it is down when in reality the Webgate is doing what it was told, and that was to wait for a long time before retrying. When all the connections start doing this we have an OAM zombie apocalypse problem. Zombies are bad, but we can try to avoid this behavior by shortening that wait time.
The recommended value is any where from 5 to 10; this is in seconds. For example if you set the AAA Timeout Threshold to 5, the Webgate will open its connection, send its request, and expect to get a response back in say 5 seconds. If not, then it opens a new connection and tries again while the old connection is just freed up and tossed back into the pool. If the value is set to be shorter, like say 1 second, an authentication or authorization request could possibly take longer because the Access Server is waiting for a long LDAP search to be returned, and therefore send us into a whirling tail spin because you would never get your request completed since there is not enough time allotted for such an LDAP search. So we have found that a 5 – 10 seconds value seems to be a fair and balanced approach.  In R2 PS3 the default is now 5 seconds, which is reasonable.

User-Defined Webgate Parameters

One worthy parameter to mention that many may not know about is “client_request_retry_attempts”. A description of this parameter can be found in the latest (at the time of this article) in the official Oracle online document https://docs.oracle.com/cd/E40329_01/admin.1112/e27239/register.htm#AIAAG5856. The official description says; “WebGate-to-OAM Server timeout threshold specifies how long (in seconds) the WebGate waits for the OAM Server before it considers it unreachable and attempts the request on a new connection.” This at first seems similar to the AAA Timeout Threshold, but the difference is that this parameter is more about how many times the WebGate will retry its request before attempting the secondary server.
So if the AAA Timeout Threshold is set to 5 seconds, it will time out that connection after 5 seconds if there is no response, but using the client_request_retry_attempts tells the Webgate how any times it will attempt to retry that connection. If the value is set to 2, then the Webgate will wait 5 seconds (Assuming the AAA Timeout Threshold is set to 5), and if it times out it will try up to 2 times before timing out the connection. This configuration may be useful if you think a network connectivity between the Webgates and the Access Servers are not stable and you want the Webgate to at least try more than once before closing its connection.

Summary

I realize there are a lot of details in this blog, but it is all very useful and you may need to read each section carefully to absorb the data.  I can say that tuning the Webgate profile is a very important part of an OAM deployment and can save you lots of late nights worrying about performance or outages.  Good luck and be sure to load test your configurations before going live.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.