Monday, November 5, 2012

OAM/OVD JVM Tuning

Over the past few weeks I've been involved in several performance tuning exercises involving OAM and OVD.  I thought it would be helpful if I created a post sharing the process I use to analyse and improve performance in OVD and OAM.


The scenario our customer's are seeing are authentication failures in OAM.  Looking at the OAM logs we have the following:

<Oct 16, 2012 11:12:56 AM MDT> <Warning> <oracle.oam.engine.authz>
<BEA-000000> <OAMSSA-06302
oracle.security.am.common.policy.common.response.ResponseException: oracle.security.am.engines.common.identity.provider.exceptions.IdentityProviderException: OAMSSA-20012: Exception in getting user attributes for user : testuser1, idstore MyIdentityStore with exception javax.naming.NamingException: LDAP response read timed out, timeout used:2000ms.; remaining name 'ou=people,dc=oracle,dc=com' at oracle.security.am.common.policy.common.response.IdentityValueProvider.getUserAttribute(IdentityValueProvider.java:243) 
...

During the authentication and authorization process, OAM complains that the LDAP repository is taking too long to return user attributes.  The default value is 2 seconds as can be seen from the exception, "2000ms".   This value can be changed in the oam-config.xml, however, I do not recommend to increase this value until you verify that there is no underlying performance issue with LDAP.  And in the case with OVD, this includes whatever back end LDAP repositories OVD is connected to.

For reference, the attribute to add in the oam-config.xml file is:
<Setting Name="LdapReadTimeout" Type="xsd:string">2000</Setting>

Steps to troubleshoot

First you want to verify that all back end LDAP servers connected via OVD are responding well.  If there are performance issues, these must be resolved first before moving on to OVD.

Second, go through the OVD Tuning Guide located here.  You should also check out this great post from our Fusion Security blog here.

Based on my experience with our customers, there are two configuration changes that reap the greatest performance improvement.  The first, is a single parameter that should be added in the 'listener_os.xml' file.  For each endpoint, add the following parameter, <useNIO>false</useNio>.  This disables JAVA new IO api, which includes support for non-blocking IO.  OVD currently has partial support for NIO, so we need to disable this.

Here is a snippet:

<ldap version="8" id="LDAP Endpoint">
      <port>389</port>
      <host>0.0.0.0</host>
      <threads>100</threads>
      <active>true</active>
      <ssl enabled="false">
         <protocols>SSLv3,TLSv1</protocols>
         <cipherSuites>
            <cipher>SSL_RSA_WITH_RC4_128_MD5</cipher>
            <cipher>SSL_RSA_WITH_RC4_128_SHA</cipher>
         </cipherSuites>
         <authType>Mutual</authType>
         <keyStore>keys.jks</keyStore>
         <trustStore>keys.jks</trustStore>
      </ssl>
      <extendedOps/>
      <anonymousBind>Allow</anonymousBind>
      <workQueueCapacity>8096</workQueueCapacity>
      <allowStartTLS>false</allowStartTLS>
      <socketOptions>
         <backlog>128</backlog>
         <reuseAddress>false</reuseAddress>
         <keepAlive>false</keepAlive>
         <tcpNoDelay>true</tcpNoDelay>
         <readTimeout>0</readTimeout>
      </socketOptions>
      <useNIO>false</useNIO>
   </ldap>

Note: Prior to OVD 11.1.1.6, if you modify the "listener_os.xml" file manually, you will lose your changes if you also make a change in EM.  Be aware of this!

The second change requires a little more work.  Here, we need to determine what the JVM is doing in term of Garbage Collection (GC).  This is a common issue with Java applications.  When major or minor GC occurs, all threads wait while the GC takes place.  So first we need to understand how OVD handles GC's during load and how long each GC takes.  To do this, we need to add a few JAVA options to the start script for OVD.  OVD uses OPMN to manage the process, therefore we must modify the "opmn.xml" file.

<category id="start-options"><data id="java-bin" value="$ORACLE_HOME/jdk/bin/java"/><data id="java-options" value="-server -Xms4096m -Xmx4096m -Dvde.soTimeoutBackend=0 -Didm.oracle.home=$ORACLE_HOME -Dcommon.components.home=$ORACLE_HOME/../oracle_common -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/stage/domains/idmDomain/asinst_1/diagnostics/logs/OVD/ovd1/ovdGClog.log -Doracle.security.jps.config=$ORACLE_INSTANCE/config/JPS/jps-config-jse.xml"/><data id="java-classpath" value="$ORACLE_HOME/ovd/jlib/vde.jar$:$ORACLE_HOME/jdbc/lib/ojdbc6.jar"/></category>

After making the changes you will need to restart OVD.  Now you should have a log file that was specified in the "-Xloggc" parameter.  Below is a sample of the logs:

2012-09-18T10:27:00.709-0700: 19.883: [GC [PSYoungGen: 524288K->10000K(611648K)] 524288K->10000K(2009792K), 0.1052698 secs] [Times: user=3.03 sys=1.09, real=0.11 secs]
2012-09-18T10:28:29.930-0700: 109.099: [GC [PSYoungGen: 534288K->22923K(611648K)] 534288K->22923K(2009792K), 0.1447090 secs] [Times: user=2.69 sys=0.63, real=0.15 secs]
2012-09-18T10:29:25.315-0700: 164.480: [GC [PSYoungGen: 547211K->25151K(611648K)] 547211K->25151K(2009792K), 0.2550807 secs] [Times: user=2.69 sys=1.17, real=0.26 secs]
2012-09-18T10:30:58.813-0700: 257.972: [GC [PSYoungGen: 549439K->27566K(611648K)] 549439K->27566K(2009792K), 0.1709691 secs] [Times: user=1.03 sys=0.02, real=0.17 secs]

You will notice that there are a lot of minor GC's taking over .11 seconds long.  And that the GCs are occurring approximately every 90 seconds.  If the time of the GC are taking longer than the customer SLA in terms of authentication time, this will be a problem.  So what to do, here are some steps to think about.

1) Lower the memory heap size.  Initially the heap was set to 4096 MB, reducing the heap size will of course quicken the GC time as there is less memory in the new generation space to clean up.
2) Change the GC model from the default to Concurrent Mark Sweep.  This GC model does the best when concerned with latency.

So now, here is the latest parameter changes in opmn,xml. Here I'm adding the Concurrent Mark Sweep GC model and lowered the total heap size to 1GB:

<category id="start-options"><data id="java-bin" value="$ORACLE_HOME/jdk/bin/java"/><data id="java-options" value="-server -Xms1024m -Xmx1024m -Dvde.soTimeoutBackend=0 -Didm.oracle.home=$ORACLE_HOME -Dcommon.components.home=$ORACLE_HOME/../oracle_common -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/stage/domains/idmDomain/asinst_1/diagnostics/logs/OVD/ovd1/ovdGClog.log -XX:+UseConcMarkSweepGC -Doracle.security.jps.config=$ORACLE_INSTANCE/config/JPS/jps-config-jse.xml"/><data id="java-classpath" value="$ORACLE_HOME/ovd/jlib/vde.jar$:$ORACLE_HOME/jdbc/lib/ojdbc6.jar"/></category>

After testing we found the following in the GC logs:

2012-09-26T15:43:24.652-0700: 165.404: [GC 165.404: [ParNew: 59008K->2766K(59008K), 0.0423653 secs] 81941K->29498K(517760K), 0.0435140 secs] [Times: user=0.33 sys=0.05, real=0.04 secs]
2012-09-26T15:43:29.488-0700: 170.240: [GC 170.240: [ParNew: 55246K->1914K(59008K), 0.0274577 secs] 81978K->28646K(517760K), 0.0285813 secs] [Times: user=0.24 sys=0.02, real=0.03 secs]
2012-09-26T15:43:32.748-0700: 173.499: [GC 173.500: [ParNew: 54394K->1679K(59008K), 0.0255866 secs] 81126K->28410K(517760K), 0.0265438 secs] [Times: user=0.19 sys=0.01, real=0.03 secs]
2012-09-26T15:43:36.071-0700: 176.822: [GC 176.823: [ParNew: 54159K->2069K(59008K), 0.0297551 secs] 80890K->28801K(517760K), 0.0310910 secs] [Times: user=0.28 sys=0.01, real=0.03 secs]
2012-09-26T15:43:38.963-0700: 179.714: [GC 179.715: [ParNew: 54549K->1444K(59008K), 0.0262057 secs] 81281K->28618K(517760K), 0.0273837 secs] [Times: user=0.16 sys=0.02, real=0.03 secs]
2012-09-26T15:43:41.404-0700: 182.155: [GC 182.156: [ParNew: 53924K->1481K(59008K), 0.0277875 secs] 81098K->28754K(517760K), 0.0289236 secs] [Times: user=0.19 sys=0.01, real=0.03 secs]
2012-09-26T15:43:44.122-0700: 184.872: [GC 184.873: [ParNew: 53961K->1504K(59008K), 0.0286972 secs] 81234K->28918K(517760K), 0.0296503 secs] [Times: user=0.19 sys=0.01, real=0.03 secs]

Notice the difference, the time it took for GC has reduced drastically.  The trade-off is that now the GC's are occurring more frequently, this is okay.  This is almost an 8x better performance!

Make sure that you also run some longevity test to see if any major GC's are occurring.  These should rarely occur, if at all.  As you run through this iterative process, you may need to adjust the perm size, max/min heap etc. for optimal performance.  

For OAM you will need to apply the same steps describe to reduce the latency that is caused by GC's.  However, for OAM I would not recommend the total heap size to be below 2GB.  You need to keep mindful that OAM does more caching than OVD (unless the cache plug-in is enabled) so set the heap size accordingly.











No comments:

Post a Comment

Note: Only a member of this blog may post a comment.