{"id":416,"date":"2019-01-21T00:49:22","date_gmt":"2019-01-21T00:49:22","guid":{"rendered":"https:\/\/www.spotonoracle.com\/?p=416"},"modified":"2019-01-21T18:39:19","modified_gmt":"2019-01-21T18:39:19","slug":"a-story-about-data-guard-and-network-latency","status":"publish","type":"post","link":"https:\/\/www.spotonoracle.com\/?p=416","title":{"rendered":"A story about Data Guard and network latency"},"content":{"rendered":"<p>Imagine you run a trading application that, every morning at market open, processes many thousands of orders from a queue using dozens of database connections. And as we all know, trade order processing is a time critical task.<br \/>\nThe system is a two node RAC cluster that is backed by a Data Guard physical standby database with the exact same specifications and configuration. To give you an idea what is going on in the database during this peak order processing, here are a few AWR stats (this is from node 1, but node 2 very much looks the same):<\/p>\n<pre class=\"brush: plain; collapse: false; highlight: [2,3,14,19]; title: ; wrap-lines: false; notranslate\" title=\"\">\r\n\t\t\t\t\t\t\t\t\tPer Second\r\nDB Time(s):\t\t\t\t\t\t\t23.7\r\nDB CPU(s):\t\t\t\t\t\t\t4.8\r\nBackground CPU(s):\t\t\t\t\t0.6\r\nRedo size (bytes):\t\t\t\t\t2,569,543.4\r\nLogical read (blocks):\t\t\t\t692,837.9\r\nBlock changes:\t\t\t\t\t\t11,668.0\r\nPhysical read (blocks):\t\t\t\t463.7\r\nPhysical write (blocks):\t\t\t219.8\r\nRead IO requests:\t\t\t\t\t416.8\r\nWrite IO requests:\t\t\t\t\t99.6\r\nGlobal Cache blocks received:\t\t2,125.3\r\nGlobal Cache blocks served:\t\t\t2,368.7\r\nUser calls:\t\t\t\t\t\t\t9,354.2\r\nParses (SQL):\t\t\t\t\t\t6,208.7\r\nHard parses (SQL):\t\t\t\t\t1.6\t0.0\r\nExecutes (SQL):\t\t\t\t\t\t14,642.5\r\nRollbacks:\t\t\t\t\t\t\t6.9\t0.0\r\nTransactions:\t\t\t\t\t\t518.9\r\n<\/pre>\n<p>This is a fairly busy system with 9000+ user calls and 500+ transactions per second. &#8220;DB CPU&#8221; is about 1\/5 of &#8220;DB Time&#8221; and with 32 physical cores per node the system certainly is not CPU bound. Below stats are based on a 30 seconds snapshot interval (I know, it&#8217;s insanely short but we had our reasons \ud83d\ude42 ).<\/p>\n<pre class=\"brush: plain; collapse: false; highlight: [2]; title: ; wrap-lines: false; notranslate\" title=\"\">\r\nEvent\t\t\t\t\t\t\t\tWaits\t\tTotal Wait Time (sec)\t\tWait Avg(ms)\t% DB time\tWait Class\r\nlog file sync\t\t\t\t\t\t15,404\t\t244.9\t\t\t\t\t\t15.90\t\t\t29.8\t\tCommit\r\nDB CPU\t\t\t\t\t\t\t \t\t\t\t167.9\t \t\t\t\t\t\t\t\t\t20.4\t \r\ngc buffer busy release\t\t\t\t2,601\t\t101.1\t\t\t\t\t\t38.87\t\t\t12.3\t\tCluster\r\ngc buffer busy acquire\t\t\t\t7,240\t\t81\t\t\t\t\t\t\t11.19\t\t\t9.8\t\t\tCluster\r\ngc current block busy\t\t\t\t4,662\t\t78.8\t\t\t\t\t\t16.91\t\t\t9.6\t\t\tCluster\r\ngc cr block busy\t\t\t\t\t3,945\t\t68.7\t\t\t\t\t\t17.40\t\t\t8.3\t\t\tCluster\r\ncell single block physical read \t13,029\t\t12.3\t\t\t\t\t\t0.94\t\t\t1.5\t\t\tUser I\/O\r\nenq: TX - index contention\t\t\t653\t\t\t7.3\t\t\t\t\t\t\t11.21\t\t\t.9\t\t\tConcurrency\r\nenq: TX - row lock contention\t\t378\t\t\t5.3\t\t\t\t\t\t\t14.08\t\t\t.6\t\t\tApplication\r\ngc current block 2-way\t\t\t\t24,480\t\t4.4\t\t\t\t\t\t\t0.18\t\t\t.5\t\t\tCluster\r\n<\/pre>\n<p>The top 10 waits show that 30% of the time was spent on &#8220;log file sync&#8221; with almost 16ms average wait time. 16ms for commit latency seems high, let&#8217;s take a closer look in the wait event histogram:<\/p>\n<pre class=\"brush: plain; collapse: false; highlight: [2]; title: ; wrap-lines: false; notranslate\" title=\"\">\r\nEvent\t\t\tWaits 64ms to 2s\t\t&lt;32ms\t&lt;64ms\t&lt;1\/8s\t&lt;1\/4s\t&lt;1\/2s\t&lt;1s\t\t&lt;2s\t\t&gt;=2s\r\nlog file sync\t801\t\t\t\t\t\t94.8\t .7\t\t  .5\t   3.7\t\t.2\t\t .2\r\n<\/pre>\n<p>Most outliers (15404\/100*3.7 = 570) are in the 128ms &#8211; 250ms bucket . &#8220;log file sync&#8221; is the wait from the foreground processes and in Oracle 12c the wait chain for Data Guard sync mode is as follows:<\/p>\n<pre class=\"brush: plain; collapse: false; title: ; wrap-lines: false; notranslate\" title=\"\">\r\n&quot;log file sync&quot; (foreground process)\r\n\t=&gt; &quot;SYNC Remote Write&quot; (LGWR)\r\n\t\t=&gt; &quot;Redo Transport MISC&quot; (NSSn)\r\n<\/pre>\n<p>So let&#8217;s see what we have in the wait stats for the background processes:<\/p>\n<pre class=\"brush: plain; collapse: false; highlight: [3,4,5]; title: ; wrap-lines: false; notranslate\" title=\"\">\r\nEvent\t\t\t\t\t\t\t\tWaits\t\t%Time -outs\t\tTotal Wait Time (s)\t\tAvg wait (ms)\tWaits \/txn\t\t% bg time\r\ngcs log flush sync\t\t\t\t\t91,603\t\t1\t\t\t\t45\t\t\t\t\t\t0.49\t\t\t5.09\t\t\t19.64\r\nRedo Transport MISC\t\t\t\t\t6,335\t\t0\t\t\t\t29\t\t\t\t\t\t4.59\t\t\t0.35\t\t\t12.62\r\nSYNC Remote Write\t\t\t\t\t6,335\t\t0\t\t\t\t28\t\t\t\t\t\t4.34\t\t\t0.35\t\t\t11.93\r\nlog file parallel write\t\t\t\t6,297\t\t0\t\t\t\t3\t\t\t\t\t\t0.54\t\t\t0.35\t\t\t1.49\r\ngc current block busy\t\t\t\t53\t\t\t0\t\t\t\t2\t\t\t\t\t\t43.59\t\t\t0.00\t\t\t1.00\r\ngc buffer busy release\t\t\t\t18\t\t\t0\t\t\t\t1\t\t\t\t\t\t72.30\t\t\t0.00\t\t\t0.56\r\n...\r\n<\/pre>\n<p>While not great, the average for &#8220;Redo Transport MISC&#8221; is &#8220;only&#8221; 4.59ms as compared to the 16ms &#8220;log file sync&#8221;.<br \/>\nBackground wait event histogram:<\/p>\n<pre class=\"brush: plain; collapse: false; highlight: [2]; title: ; wrap-lines: false; notranslate\" title=\"\">\r\nEvent\t\t\t\t\tWaits 64ms to 2s\t&lt;32ms\t&lt;64ms\t&lt;1\/8s\t&lt;1\/4s\t&lt;1\/2s\t&lt;1s\t\t&lt;2s\t\t&gt;=2s\r\nRedo Transport MISC\t\t57\t\t\t\t\t99.1\t .0\t\t  .0\t   .8\t\t.0\t\t .0\t \t \t \t \r\nSYNC Remote Write\t\t57\t\t\t\t\t99.1\t .0\t\t  .0\t   .8\t\t.0\t\t .0\t \t \r\n<\/pre>\n<p>Same as for &#8220;log file sync&#8221; most of the outliers for the redo transmission wait events are in the 128ms &#8211; 250ms range.<\/p>\n<p>It looks like the redo transmission from primary to standby is sometimes stalling LGWR and thus causing <strong>many<\/strong> foreground transactions to wait for LGWR (that&#8217;s why the average &#8220;log file sync&#8221; wait is so much higher).<\/p>\n<p>SQL traces from NSS\/RFS processes confirmed that there are around 80-120 outliers per minute. While the redo transmission is on TCP (or lower) the corresponding RFS process on the standby is waiting on &#8220;SQL*Net message from client&#8221; or &#8220;SQL*Net more data from client&#8221;.<br \/>\nNSS2 trace:<\/p>\n<pre class=\"brush: plain; collapse: false; highlight: [2]; title: ; wrap-lines: false; notranslate\" title=\"\">\r\nWAIT #0: nam='Redo Transport MISC' ela= 205429 p1=0 p2=0 p3=0 obj#=-1 tim=12736668661230\r\n<\/pre>\n<p>corresponding RFS trace:<\/p>\n<pre class=\"brush: plain; collapse: false; highlight: [2]; title: ; wrap-lines: false; notranslate\" title=\"\">\r\nWAIT #0: nam='SQL*Net message from client' ela= 207662 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=18731316629403\r\n<\/pre>\n<p>This told us that both, the sender and the receiving process were waiting on the network stack and interestingly, in all instances the waits took a just a little over 200ms.<\/p>\n<p>The next thing was running a TCP packet capture (tcpdump) on the primary &#038; standby database servers to see what was going on on the network stack. This revealed that every time we saw a 200ms long &#8220;Redo Transport MISC&#8221; event in the NSS2 trace a TCP retransmission would occur, e.g.:<br \/>\n<a href=\"https:\/\/www.spotonoracle.com\/wp-content\/uploads\/2019\/01\/tcpdump_retransmission.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.spotonoracle.com\/wp-content\/uploads\/2019\/01\/tcpdump_retransmission-1024x80.png\" alt=\"\" width=\"625\" height=\"49\" class=\"alignnone size-large wp-image-415\" srcset=\"https:\/\/www.spotonoracle.com\/wp-content\/uploads\/2019\/01\/tcpdump_retransmission-1024x80.png 1024w, https:\/\/www.spotonoracle.com\/wp-content\/uploads\/2019\/01\/tcpdump_retransmission-300x23.png 300w, https:\/\/www.spotonoracle.com\/wp-content\/uploads\/2019\/01\/tcpdump_retransmission-768x60.png 768w, https:\/\/www.spotonoracle.com\/wp-content\/uploads\/2019\/01\/tcpdump_retransmission-624x49.png 624w, https:\/\/www.spotonoracle.com\/wp-content\/uploads\/2019\/01\/tcpdump_retransmission.png 1503w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/p>\n<p><strong>Why does it consistently take 200ms before the packet is retransmitted?<\/strong><br \/>\nThere is an exponential back-off algorithm in the Linux kernel for TCP retransmissions and it starts at 200ms in this environment:<\/p>\n<pre class=\"brush: bash; collapse: false; highlight: [3,7]; title: ; wrap-lines: false; notranslate\" title=\"\">\r\n$ grep '^CONFIG_HZ' \/boot\/config-$(uname -r)\r\nCONFIG_HZ_1000=y\r\nCONFIG_HZ=1000\r\n\r\n$ grep '#define TCP_RTO_MIN' \/usr\/src\/kernels\/$(uname -r)\/include\/net\/tcp.h\r\n#define TCP_RTO_MIN\t((unsigned)(HZ\/5))\r\n<\/pre>\n<p>1000 HZ \/ 5 = <strong>200ms<\/strong>(period). As in this case retransmissions only occur occasionally relative to the total packet volume the back-off algorithm never kicks in and the RTO stays at 200ms.<\/p>\n<p>By the way, the retransmission timeout is calculated per port and the current value can be displayed using the &#8220;ss&#8221; command. For instance:<\/p>\n<pre class=\"brush: bash; collapse: false; highlight: [4,6]; title: ; wrap-lines: false; notranslate\" title=\"\">\r\n$ ss -o -p -n -i sport = :2483\r\nNetid State      Recv-Q Send-Q                                                          Local Address:Port                                                                         Peer Address:Port              \r\ntcp   ESTAB      0      0                                                        ::ffff:192.168.56.60:2483                                                                  ::ffff:192.168.56.1:46218               users:((&quot;oracle_3485_dev&quot;,pid=3485,fd=16)) timer:(keepalive,9min52sec,0)\r\n\t ts sack cubic wscale:7,7 rto:208 rtt:7.382\/13.049 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:5897 bytes_received:4860 send 15.7Mbps lastsnd:8237 lastrcv:8238 lastack:8237 pacing_rate 31.4Mbps rcv_rtt:60 rcv_space:28960\r\ntcp   ESTAB      0      0                                                        ::ffff:192.168.56.60:2483                                                                  ::ffff:192.168.56.1:46086               users:((&quot;oracle_2024_dev&quot;,pid=2024,fd=16)) timer:(keepalive,4min45sec,0)\r\n\t ts sack cubic wscale:7,7 rto:212 rtt:11.107\/15.77 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:7009 bytes_received:7710 send 10.4Mbps lastsnd:1515530 lastrcv:1515611 lastack:315015 pacing_rate 20.9Mbps rcv_rtt:54 rcv_space:28960\r\n<\/pre>\n<p>You can see one port has RTO=208 and the other is RTO=212 but they are all close to 200ms.<\/p>\n<p><strong>What can we do about it?<\/strong><br \/>\nWell, ideally you want no TCP retransmissions to occur in your network but that&#8217;s just not going to happen, after all TCP was designed to deal with lossy networks. In this system the retransmissions make less than 0.1% of all the Data Guard traffic, nevertheless the impact on the trading application is real. Since TCP_RTO_MIN and the back-off algorithm are hard-wired into the Linux kernel I only know of one way to change the minimum RTO:<br \/>\nSetup a separate network route for the Data Guard traffic (if you haven&#8217;t already for other reasons) and set the RTO on the IP route level:<\/p>\n<pre class=\"brush: bash; collapse: false; highlight: [2]; title: ; wrap-lines: false; notranslate\" title=\"\">\r\nip route change &lt;dest&gt; dev &lt;device&gt; proto kernel scope link src &lt;IP&gt; \\\r\n  rto_min 10ms\r\n<\/pre>\n<p>With that retransmissions happen after 10ms instead of 200ms which mitigates the impact on LGWR and the foreground processes waiting for the redo data to be shipped. How low you can set the RTO depends on your network characteristics and you may need to dial in the value not to cause extra retransmissions.<\/p>\n<pre class=\"brush: bash; collapse: false; highlight: [4,6]; title: ; wrap-lines: false; notranslate\" title=\"\">\r\nss -o -p -n -i sport = :2483\r\nNetid State      Recv-Q Send-Q                                                          Local Address:Port                                                                         Peer Address:Port              \r\ntcp   ESTAB      0      0                                                        ::ffff:192.168.56.60:2483                                                                  ::ffff:192.168.56.1:45430               users:((&quot;oracle_1651_dev&quot;,pid=1651,fd=16)) timer:(keepalive,9min52sec,0)\r\n\t ts sack cubic wscale:7,7 rto:11 rtt:0.303\/0.43 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:5897 bytes_received:4860 send 382.3Mbps lastsnd:5082 lastrcv:5421 lastack:5082 pacing_rate 764.3Mbps retrans:0\/2 rcv_rtt:31 rcv_space:28960\r\ntcp   ESTAB      0      0                                                        ::ffff:192.168.56.60:2483                                                                  ::ffff:192.168.56.1:45438               users:((&quot;oracle_1655_dev&quot;,pid=1655,fd=16)) timer:(keepalive,9min54sec,0)\r\n\t ts sack cubic wscale:7,7 rto:11 rtt:0.291\/0.334 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:5896 bytes_received:4860 send 398.1Mbps lastsnd:5082 lastrcv:5556 lastack:5082 pacing_rate 794.1Mbps retrans:0\/2 rcv_rtt:69 rcv_space:28960\r\n<\/pre>\n<p>Thanks to the IP route configuration the socket level RTO now starts at 10ms (actually at 11ms in above example).<\/p>\n<p>For single instance the setup is fairly simple, for RAC you need to provision an additional set of SCAN listeners.<\/p>\n<p>To make the route change persistent across OS reboot I use a oneshot systemd unit.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Imagine you run a trading application that, every morning at market open, processes many thousands of orders from a queue using dozens of database connections. And as we all know, trade order processing is a time critical task. The system is a two node RAC cluster that is backed by a Data Guard physical standby [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3,16],"tags":[],"class_list":["post-416","post","type-post","status-publish","format-standard","hentry","category-internals","category-linux"],"_links":{"self":[{"href":"https:\/\/www.spotonoracle.com\/index.php?rest_route=\/wp\/v2\/posts\/416","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.spotonoracle.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.spotonoracle.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.spotonoracle.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.spotonoracle.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=416"}],"version-history":[{"count":30,"href":"https:\/\/www.spotonoracle.com\/index.php?rest_route=\/wp\/v2\/posts\/416\/revisions"}],"predecessor-version":[{"id":448,"href":"https:\/\/www.spotonoracle.com\/index.php?rest_route=\/wp\/v2\/posts\/416\/revisions\/448"}],"wp:attachment":[{"href":"https:\/\/www.spotonoracle.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=416"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.spotonoracle.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=416"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.spotonoracle.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=416"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}