Category Archives: Linux

A story about Data Guard and network latency

Imagine you run a trading application that, every morning at market open, processes many thousands of orders from a queue using dozens of database connections. And as we all know, trade order processing is a time critical task.
The system is a two node RAC cluster that is backed by a Data Guard physical standby database with the exact same specifications and configuration. To give you an idea what is going on in the database during this peak order processing, here are a few AWR stats (this is from node 1, but node 2 very much looks the same):

									Per Second
DB Time(s):							23.7
DB CPU(s):							4.8
Background CPU(s):					0.6
Redo size (bytes):					2,569,543.4
Logical read (blocks):				692,837.9
Block changes:						11,668.0
Physical read (blocks):				463.7
Physical write (blocks):			219.8
Read IO requests:					416.8
Write IO requests:					99.6
Global Cache blocks received:		2,125.3
Global Cache blocks served:			2,368.7
User calls:							9,354.2
Parses (SQL):						6,208.7
Hard parses (SQL):					1.6	0.0
Executes (SQL):						14,642.5
Rollbacks:							6.9	0.0
Transactions:						518.9

This is a fairly busy system with 9000+ user calls and 500+ transactions per second. “DB CPU” is about 1/5 of “DB Time” and with 32 physical cores per node the system certainly is not CPU bound. Below stats are based on a 30 seconds snapshot interval (I know, it’s insanely short but we had our reasons 🙂 ).

Event								Waits		Total Wait Time (sec)		Wait Avg(ms)	% DB time	Wait Class
log file sync						15,404		244.9						15.90			29.8		Commit
DB CPU							 				167.9	 									20.4	 
gc buffer busy release				2,601		101.1						38.87			12.3		Cluster
gc buffer busy acquire				7,240		81							11.19			9.8			Cluster
gc current block busy				4,662		78.8						16.91			9.6			Cluster
gc cr block busy					3,945		68.7						17.40			8.3			Cluster
cell single block physical read 	13,029		12.3						0.94			1.5			User I/O
enq: TX - index contention			653			7.3							11.21			.9			Concurrency
enq: TX - row lock contention		378			5.3							14.08			.6			Application
gc current block 2-way				24,480		4.4							0.18			.5			Cluster

The top 10 waits show that 30% of the time was spent on “log file sync” with almost 16ms average wait time. 16ms for commit latency seems high, let’s take a closer look in the wait event histogram:

Event			Waits 64ms to 2s		<32ms	<64ms	<1/8s	<1/4s	<1/2s	<1s		<2s		>=2s
log file sync	801						94.8	 .7		  .5	   3.7		.2		 .2

Most outliers (15404/100*3.7 = 570) are in the 128ms – 250ms bucket . “log file sync” is the wait from the foreground processes and in Oracle 12c the wait chain for Data Guard sync mode is as follows:

"log file sync" (foreground process)
	=> "SYNC Remote Write" (LGWR)
		=> "Redo Transport MISC" (NSSn)

So let’s see what we have in the wait stats for the background processes:

Event								Waits		%Time -outs		Total Wait Time (s)		Avg wait (ms)	Waits /txn		% bg time
gcs log flush sync					91,603		1				45						0.49			5.09			19.64
Redo Transport MISC					6,335		0				29						4.59			0.35			12.62
SYNC Remote Write					6,335		0				28						4.34			0.35			11.93
log file parallel write				6,297		0				3						0.54			0.35			1.49
gc current block busy				53			0				2						43.59			0.00			1.00
gc buffer busy release				18			0				1						72.30			0.00			0.56
...

While not great, the average for “Redo Transport MISC” is “only” 4.59ms as compared to the 16ms “log file sync”.
Background wait event histogram:

Event					Waits 64ms to 2s	<32ms	<64ms	<1/8s	<1/4s	<1/2s	<1s		<2s		>=2s
Redo Transport MISC		57					99.1	 .0		  .0	   .8		.0		 .0	 	 	 	 
SYNC Remote Write		57					99.1	 .0		  .0	   .8		.0		 .0	 	 

Same as for “log file sync” most of the outliers for the redo transmission wait events are in the 128ms – 250ms range.

It looks like the redo transmission from primary to standby is sometimes stalling LGWR and thus causing many foreground transactions to wait for LGWR (that’s why the average “log file sync” wait is so much higher).

SQL traces from NSS/RFS processes confirmed that there are around 80-120 outliers per minute. While the redo transmission is on TCP (or lower) the corresponding RFS process on the standby is waiting on “SQL*Net message from client” or “SQL*Net more data from client”.
NSS2 trace:

WAIT #0: nam='Redo Transport MISC' ela= 205429 p1=0 p2=0 p3=0 obj#=-1 tim=12736668661230

corresponding RFS trace:

WAIT #0: nam='SQL*Net message from client' ela= 207662 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=18731316629403

This told us that both, the sender and the receiving process were waiting on the network stack and interestingly, in all instances the waits took a just a little over 200ms.

The next thing was running a TCP packet capture (tcpdump) on the primary & standby database servers to see what was going on on the network stack. This revealed that every time we saw a 200ms long “Redo Transport MISC” event in the NSS2 trace a TCP retransmission would occur, e.g.:

Why does it consistently take 200ms before the packet is retransmitted?
There is an exponential back-off algorithm in the Linux kernel for TCP retransmissions and it starts at 200ms in this environment:

$ grep '^CONFIG_HZ' /boot/config-$(uname -r)
CONFIG_HZ_1000=y
CONFIG_HZ=1000

$ grep '#define TCP_RTO_MIN' /usr/src/kernels/$(uname -r)/include/net/tcp.h
#define TCP_RTO_MIN	((unsigned)(HZ/5))

1000 HZ / 5 = 200ms(period). As in this case retransmissions only occur occasionally relative to the total packet volume the back-off algorithm never kicks in and the RTO stays at 200ms.

By the way, the retransmission timeout is calculated per port and the current value can be displayed using the “ss” command. For instance:

$ ss -o -p -n -i sport = :2483
Netid State      Recv-Q Send-Q                                                          Local Address:Port                                                                         Peer Address:Port              
tcp   ESTAB      0      0                                                        ::ffff:192.168.56.60:2483                                                                  ::ffff:192.168.56.1:46218               users:(("oracle_3485_dev",pid=3485,fd=16)) timer:(keepalive,9min52sec,0)
	 ts sack cubic wscale:7,7 rto:208 rtt:7.382/13.049 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:5897 bytes_received:4860 send 15.7Mbps lastsnd:8237 lastrcv:8238 lastack:8237 pacing_rate 31.4Mbps rcv_rtt:60 rcv_space:28960
tcp   ESTAB      0      0                                                        ::ffff:192.168.56.60:2483                                                                  ::ffff:192.168.56.1:46086               users:(("oracle_2024_dev",pid=2024,fd=16)) timer:(keepalive,4min45sec,0)
	 ts sack cubic wscale:7,7 rto:212 rtt:11.107/15.77 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:7009 bytes_received:7710 send 10.4Mbps lastsnd:1515530 lastrcv:1515611 lastack:315015 pacing_rate 20.9Mbps rcv_rtt:54 rcv_space:28960

You can see one port has RTO=208 and the other is RTO=212 but they are all close to 200ms.

What can we do about it?
Well, ideally you want no TCP retransmissions to occur in your network but that’s just not going to happen, after all TCP was designed to deal with lossy networks. In this system the retransmissions make less than 0.1% of all the Data Guard traffic, nevertheless the impact on the trading application is real. Since TCP_RTO_MIN and the back-off algorithm are hard-wired into the Linux kernel I only know of one way to change the minimum RTO:
Setup a separate network route for the Data Guard traffic (if you haven’t already for other reasons) and set the RTO on the IP route level:

ip route change <dest> dev <device> proto kernel scope link src <IP> \
  rto_min 10ms

With that retransmissions happen after 10ms instead of 200ms which mitigates the impact on LGWR and the foreground processes waiting for the redo data to be shipped. How low you can set the RTO depends on your network characteristics and you may need to dial in the value not to cause extra retransmissions.

ss -o -p -n -i sport = :2483
Netid State      Recv-Q Send-Q                                                          Local Address:Port                                                                         Peer Address:Port              
tcp   ESTAB      0      0                                                        ::ffff:192.168.56.60:2483                                                                  ::ffff:192.168.56.1:45430               users:(("oracle_1651_dev",pid=1651,fd=16)) timer:(keepalive,9min52sec,0)
	 ts sack cubic wscale:7,7 rto:11 rtt:0.303/0.43 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:5897 bytes_received:4860 send 382.3Mbps lastsnd:5082 lastrcv:5421 lastack:5082 pacing_rate 764.3Mbps retrans:0/2 rcv_rtt:31 rcv_space:28960
tcp   ESTAB      0      0                                                        ::ffff:192.168.56.60:2483                                                                  ::ffff:192.168.56.1:45438               users:(("oracle_1655_dev",pid=1655,fd=16)) timer:(keepalive,9min54sec,0)
	 ts sack cubic wscale:7,7 rto:11 rtt:0.291/0.334 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:5896 bytes_received:4860 send 398.1Mbps lastsnd:5082 lastrcv:5556 lastack:5082 pacing_rate 794.1Mbps retrans:0/2 rcv_rtt:69 rcv_space:28960

Thanks to the IP route configuration the socket level RTO now starts at 10ms (actually at 11ms in above example).

For single instance the setup is fairly simple, for RAC you need to provision an additional set of SCAN listeners.

To make the route change persistent across OS reboot I use a oneshot systemd unit.

12c Oracle DCD – kernel call dive

As you probably read in the 12c documentation Oracle has changed the mechanism for DCD (Dead Connection Detection) from sending TNS packets to using TCP KEEPALIVE on the socket. Unfortunately, the documentation is extremely sparse about implementation details:

When I was tracking down a firewall configuration issue recently I was in need of a few more details. Specifically, I wanted to know what TCP_KEEPIDLE, TCP_KEEPINTVL, and TCP_KEEPCNT are set to for various values of SQLNET.EXPIRE_TIME.
Since Oracle reads sqlnet.ora and sets the socket options early in the process startup (spawned by the listener) I opted for SystemTap to caputre all calls to “setsockopt”. Unfortunately, on my 7.4 Oracle Linux I ran into a stupid SystemTap bug. So, DTrace it is:

syscall::setsockopt:entry
{
  printf("%s | socket=%d | level=%d | optname=%d | optlen=%d", execname, arg0, arg1, arg2, arg4);
  self->optval = *(int *)copyin(arg3, arg4);
}

syscall::setsockopt:return
{
  printf("optval=%d\n", self->optval);
}

I ran it on dedicated lab VM to minimize cluttering of the output as the D script captures all “setsockopt” calls, not just the ones from Oracle.
But, before I show the output we need to know a few things.

The function definition for “setsockopt”:
This is the kernel function to modify socket options like TCP_KEEPALIVE, etc.

int setsockopt(
  int sockfd,
  int level,
  int optname,
  const void *optval,
  socklen_t optlen);
sockfd  => file descriptor for the socket
level   => the level on which to set the option/value (socket or any protocol)
optname => the number representing the option name according to header files
optval  => the value to set for the given option

Note, the sequence of arguments matches arg0 .. arg4 in above D script.

The relevant socket level & options (from socket.h):

# level
1 => SOL_SOCKET
 
# options
9  => SO_KEEPALIVE
20 => SO_RCVTIMEO
21 => SO_SNDTIMEO

The relevant TCP level & options (from tcp.h)

# level
6 => SOL_TCP
 
# options
1 => TCP_NODELAY
4 => TCP_KEEPIDLE
5 => TCP_KEEPINTVL
6 => TCP_KEEPCNT

Now we’ve got all the information necessary to interpret the output from my D script.

SQLNET.EXPIRE_TIME = 1

dtrace: script 'sockopt.d' matched 2 probes
CPU     ID                    FUNCTION:NAME
  0    125                 setsockopt:entry tnslsnr | socket=14 | level=6 | optname=1 | optlen=4
  0    126                setsockopt:return optval=1
  
  1    125                 setsockopt:entry oracle_7090_dev | socket=14 | level=6 | optname=1 | optlen=4
  1    126                setsockopt:return optval=1
  
  1    125                 setsockopt:entry oracle_7090_dev | socket=14 | level=1 | optname=9 | optlen=4
  1    126                setsockopt:return optval=1
  
  1    125                 setsockopt:entry oracle_7090_dev | socket=14 | level=6 | optname=4 | optlen=4
  1    126                setsockopt:return optval=60
  
  1    125                 setsockopt:entry oracle_7090_dev | socket=14 | level=6 | optname=5 | optlen=4
  1    126                setsockopt:return optval=6
  
  1    125                 setsockopt:entry oracle_7090_dev | socket=14 | level=6 | optname=6 | optlen=4
  1    126                setsockopt:return optval=10

The output shows the following:

  • TCP_KEEPALIVE is enabled
  • TCP_KEEPIDLE is set to the value of SQLNET.EXPIRE_TIME in seconds
  • TCP_KEEPINTVL is set to 6
  • TCP_KEEPCNT is set to 10

OK, let’s gradually increase SQLNET.EXPIRE_TIME and see what happens.

SQLNET.EXPIRE_TIME = 10

... 
  0    125                 setsockopt:entry oracle_6400_dev | socket=14 | level=6 | optname=4 | optlen=4
  0    126                setsockopt:return optval=600
  
  0    125                 setsockopt:entry oracle_6400_dev | socket=14 | level=6 | optname=5 | optlen=4
  0    126                setsockopt:return optval=6
  
  0    125                 setsockopt:entry oracle_6400_dev | socket=14 | level=6 | optname=6 | optlen=4
  0    126                setsockopt:return optval=10
...

SQLNET.EXPIRE_TIME = 60

...
  1    125                 setsockopt:entry oracle_7147_dev | socket=14 | level=6 | optname=4 | optlen=4
  1    126                setsockopt:return optval=3600
  
  1    125                 setsockopt:entry oracle_7147_dev | socket=14 | level=6 | optname=5 | optlen=4
  1    126                setsockopt:return optval=6
  
  1    125                 setsockopt:entry oracle_7147_dev | socket=14 | level=6 | optname=6 | optlen=4
  1    126                setsockopt:return optval=10
...

SQLNET.EXPIRE_TIME = 120

...
  1    125                 setsockopt:entry oracle_7163_dev | socket=14 | level=6 | optname=4 | optlen=4
  1    126                setsockopt:return optval=7200
  
  1    125                 setsockopt:entry oracle_7163_dev | socket=14 | level=6 | optname=5 | optlen=4
  1    126                setsockopt:return optval=6
  
  1    125                 setsockopt:entry oracle_7163_dev | socket=14 | level=6 | optname=6 | optlen=4
  1    126                setsockopt:return optval=10
...

Conclusion
From this I conclude that Oracle always sets TCP_KEEPIDLE to the value of SQLNET.EXPIRE_TIME in seconds.
TCP_KEEPCNT and TCP_KEEPCNT are not adjusted based on SQLNET.EXPIRE_TIME but are always set to 6 and 10, respectively.

For the sake of completeness, here’s what happens when the connection closes.

 
  1    125                 setsockopt:entry oracle_7090_dev | socket=14 | level=1 | optname=21 | optlen=16
  1    126                setsockopt:return optval=0
  
  1    125                 setsockopt:entry oracle_7090_dev | socket=14 | level=1 | optname=20 | optlen=16
  1    126                setsockopt:return optval=0
  
  1    125                 setsockopt:entry oracle_7090_dev | socket=14 | level=1 | optname=9 | optlen=4
  1    126                setsockopt:return optval=0

Are you fishing or catching? – Server-side SQL*Net tracing for specific clients

SQL*Net tracing on the database server is all or nothing (sqlnet.ora):
TRACE_LEVEL_SERVER=support

On a busy system this quickly gets out of hand. More often than not I find myself in need of enabling SQL*Net trace on the server for a particular database client.
The first step is to setup a separate TNS_ADMIN location that we want the sever process for our client connections to pick up:

$ mkdir /u01/app/oracle/network/admin-trc
$ copy /u01/app/oracle/network/admin/sqlnet.ora /u01/app/oracle/network/admin-trc/.

Then enable SQL*Net tracing in the new TNS_ADMIN location:

vi /u01/app/oracle/network/admin-trc/sqlnet.ora

...
TRACE_LEVEL_SERVER=support

If you can modify the client’s connection string things are easy. All we need is a separate service on the existing listener that the client then connects to:
vi /u01/app/oracle/network/admin/listener.ora

SID_LIST_LISTENER =
  (SID_LIST =
    (SID_DESC =
      (GLOBAL_DBNAME = pdbdev1_trc.localdomain)
      (ORACLE_HOME = /u01/app/oracle/product/ora12201)
      (SID_NAME = DEV1)
      (ENVS="TNS_ADMIN=/u01/app/oracle/network/admin-trc")
    )
  )

Line 4: Defines our new service name
Line 7: This makes server processes for connections through that service read our sqlnet.ora in the “admin-trc” directory and, therefore enable SQL*Net tracing.

Now, change the client’s connection string to use “pdbdev_trc” service and we’re done.
But, what if for some reason you cannot easily change the client’s connect string. That’s when I like to resort to port-forwarding using firewalld.

First, we spin up an additional listener on another port. Let’s assume our original listener runs on port 1521.
vi /u01/app/oracle/network/admin/listener.ora

LISTENER_TRC =
  (DESCRIPTION_LIST =
    (DESCRIPTION =
      (ADDRESS = (PROTOCOL = TCP)(HOST = ol7ora122dev1.localdomain)(PORT = 1522))
    )
  )

SID_LIST_LISTENER_TRC =
  (SID_LIST =
    (SID_DESC =
      (GLOBAL_DBNAME = pdbdev1.localdomain)
      (ORACLE_HOME = /u01/app/oracle/product/ora12201)
      (SID_NAME = DEV1)
      (ENVS="TNS_ADMIN=/u01/app/oracle/network/admin-trc")
    )
  )

Line 4: listening on new port
Line 11: the service name is the same as on the original listener (where the client connects to)
Line 14: Again, this sets the TNS_ADMIN location for the server process to pick up our sqlnet.ora with SQL*Net tracing enabled.

Secondly, add the port forwarding rule:

firewall-cmd --zone=oradb-server \
  --add-rich-rule='rule family="ipv4" source address="<client-IP>" forward-port port=1521 protocol=tcp to-port=1522'

Now, every connection request from the client we want to have a server-side SQL*Net trace from gets forwarded to our new listener and picks up our modified sqlnet.ora.
Obviously, this is assuming you have firewalld configured and running already. If not, await my next blog post on basic firewalld configuration for Oracle database servers.

Enjoy!

P.s. when you have all the traces you need you can remove the port forwading rule like this:

firewall-cmd --zone=oradb-server \
  --remove-rich-rule='rule family="ipv4" source address="<client-IP>" forward-port port=1521 protocol=tcp to-port=1522'