今天我的数据库hang住了,查看告警日志提示
WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=31
AIX 5.3 是一个RAC环境,10.2.0.4
由于问题一致,我这边没有把生产环境中日志拿出来,但是我参考一个博文的,具体如下:
1.山西电信告警alert_mega.log
Mon Jun 01 02:50:03 CST 2015>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=31System State dumped to trace file /opt/oracle/admin/dba/udump/mega_ora_27190.trcTue Jun 02 02:50:03 CST 2015>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=32System State dumped to trace file /opt/oracle/admin/dba/udump/mega_ora_2063.trcTue Jun 02 17:53:40 CST 2015>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=37System State dumped to trace file /opt/oracle/admin/dba/udump/mega_ora_7006.trcTue Jun 02 18:00:41 CST 2015>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=38System State dumped to trace file /opt/oracle/admin/dba/udump/mega_ora_7120.trcWed Jun 03 02:50:02 CST 2015>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=39System State dumped to trace file /opt/oracle/admin/dba/udump/mega_ora_9632.trcWed Jun 03 20:54:39 CST 2015>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=41System State dumped to trace file /opt/oracle/admin/dba/udump/mega_ora_15023.trcWed Jun 03 20:55:07 CST 2015>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=42System State dumped to trace file /opt/oracle/admin/dba/udump/mega_ora_15029.trcWed Jun 03 20:55:26 CST 2015>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=43System State dumped to trace file /opt/oracle/admin/dba/udump/mega_ora_15033.trcThu Jun 04 02:50:02 CST 2015>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=44System State dumped to trace file /opt/oracle/admin/dba/udump/mega_ora_16725.trcFri Jun 05 02:50:02 CST 2015>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=45System State dumped to trace file /opt/oracle/admin/dba/udump/mega_ora_23781.trcSat Jun 06 02:50:03 CST 2015>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=46System State dumped to trace file /opt/oracle/admin/dba/udump/mega_ora_30833.trcSat Jun 06 23:50:03 CST 2015>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=47System State dumped to trace file /opt/oracle/admin/dba/udump/mega_ora_4951.trcSun Jun 07 02:50:02 CST 2015>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=48System State dumped to trace file /opt/oracle/admin/dba/udump/mega_ora_5891.trcMon Jun 08 02:50:02 CST 2015>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=49System State dumped to trace file /opt/oracle/admin/dba/udump/mega_ora_7520.trcTue Jun 09 02:50:02 CST 2015>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=50System State dumped to trace file /opt/oracle/admin/dba/udump/mega_ora_14768.trcWed Jun 10 02:50:02 CST 2015>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=51System State dumped to trace file /opt/oracle/admin/dba/udump/mega_ora_23975.trcWed Jun 10 22:24:05 CST 2015>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=52System State dumped to trace file /opt/oracle/admin/dba/udump/mega_ora_29755.trcWed Jun 10 22:25:03 CST 2015>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=53System State dumped to trace file /opt/oracle/admin/dba/udump/mega_ora_29760.trcThu Jun 11 02:50:02 CST 2015>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=54System State dumped to trace file /opt/oracle/admin/dba/udump/mega_ora_31040.trc
2.
在TRACE文件中包含一个SYSTEMSTATE DUMP ,首先通过ASS.awk来分析一下systemstate dump:
Object Names
~~~~~~~~~~~~Enqueue PR-00000000-00000000 Latch c0000000c2df3b70 holding (efd=5) c0000000c2df3b70 slave clRcache object=c000000f9fdf8160, Enqueue US-0000004C-00000000 Enqueue FB-00000006-0186BDC8 LOCK: handle=c000000f388db3d0 TABL:REPORT.STATQ_AGENT_SUBS_NEW看样子ORAPID=41的会话可能是BLOCKER,从ass生成的报告来看,系统中存在大量的log file switch(checkpoint incomplete)等待。我们首先查看一下latch c0000000c2df3b70的持有者:
PROCESS 41:
---------------------------------------- SO: c000000fbe3c1910, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00 (process) Oracle pid=41, calls cur/top: c000000f9d63cbd0/c000000fa2c63088, flag: (2) SYSTEM int error: 0, call error: 0, sess error: 0, txn error 0 (post info) last post received: 0 0 33 last post received-location: ksrpublish last process to post me: c000000fba3f0d80 1 6 last post sent: 0 0 24 last post sent-location: ksasnd last process posted by me: c000000fb93cbc00 1 6 (latch info) wait_event=0 bits=1 holding (efd=5) c0000000c2df3b70 slave class create level=0 Location from where latch is held: ksvcreate: Context saved from call: 0 state=busy, wlstate=free waiters [orapid (seconds since: put on list, posted, alive check)]: 48 (61788, 1251079676, 3) 118 (61728, 1251079676, 0) waiter count=2 Process Group: DEFAULT, pseudo proc: c000000fbc44e418 O/S info: user: oracle, term: UNKNOWN, ospid: 16929 OSD pid info: Unix process pid: 16929, p_w_picpath: (MMON)mmon在等待:
service name: SYS$BACKGROUND
last wait for 'os thread startup' blocking sess=0x0000000000000000 seq=59956 wait_time=1006468 seconds since wait started=62472 =0, =0, =0 Dumping Session Wait History for 'os thread startup' count=1 wait_time=1006468 =0, =0, =0 for 'os thread startup' count=1 wait_time=958707 =0, =0, =0 for 'os thread startup' count=1 wait_time=1017506 =0, =0, =0 for 'os thread startup' count=1 wait_time=975671 =0, =0, =0 for 'os thread startup' count=1 wait_time=976846 =0, =0, =0 for 'os thread startup' count=1 wait_time=984219 =0, =0, =0 for 'os thread startup' count=1 wait_time=585799 =0, =0, =0 for 'os thread startup' count=1 wait_time=1355858 =0, =0, =0 for 'os thread startup' count=1 wait_time=979583 =0, =0, =0 for 'os thread startup' count=1 wait_time=990744 =0, =0, =0看样子mmon在启动其他进程的时候在等待启动结束:SO: c000000f9e6e3a20, type: 16, owner: c000000fbe3c1910, flag: INIT/-/-/0x00
(osp req holder) CHILD REQUESTS: (osp req) type=2(BACKGROUND) flags=0x20001(STATIC/-) state=8(DEAD) err=0 pg=129(default) arg1=0 arg2=(null) reply=(null) pname=m000 pid=0 parent=c000000f9e6e3a20 fulfill=0000000000000000原来mmon在等待m000启动.那么mmon等待m000会导致整个系统HANG住吗?我们需要进一步分析3.
我们回过头来在来看这个>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=XXX
这个等待从临晨4点多到宕机前一直存在。通过ass的结果,我们可以看到:
254:waiting for 'row cache lock' [Rcache object=c000000f9fdf8160,] wait
255:waiting for 'row cache lock' [Rcache object=c000000f9fdf8160,] wait256:waiting for 'row cache lock' [Rcache object=c000000f9fdf8160,] wait257:waiting for 'row cache lock' [Rcache object=c000000f9fdf8160,] wait258:waiting for 'row cache lock' [Rcache object=c000000f9fdf8160,] wait259:waiting for 'row cache lock' [Rcache object=c000000f9fdf8160,] wait260:waiting for 'row cache lock' [Rcache object=c000000f9fdf8160,] wait261:waiting for 'row cache lock' [Rcache object=c000000f9fdf8160,] wait262:waiting for 'row cache lock' [Rcache object=c000000f9fdf8160,] wait263:waiting for 'row cache lock' [Rcache object=c000000f9fdf8160,] wait264:waiting for 'row cache lock' [Rcache object=c000000f9fdf8160,] wait265:waiting for 'row cache lock' [Rcache object=c000000f9fdf8160,] wait266:waiting for 'row cache lock' [Rcache object=c000000f9fdf8160,] wait267:waiting for 'row cache lock' [Rcache object=c000000f9fdf8160,] wait268:waiting for 'row cache lock' [Rcache object=c000000f9fdf8160,] wait269:waiting for 'row cache lock' [Rcache object=c000000f9fdf8160,] wait270:waiting for 'row cache lock' [Rcache object=c000000f9fdf8160,] wait271:waiting for 'row cache lock' [Rcache object=c000000f9fdf8160,] wait272:waiting for 'row cache lock' [Rcache object=c000000f9fdf8160,] wait273:waiting for 'row cache lock' [Rcache object=c000000f9fdf8160,] wait274:waiting for 'row cache lock' [Rcache object=c000000f9fdf8160,] wait275:waiting for 'row cache lock' [Rcache object=c000000f9fdf8160,] wait276:waiting for 'row cache lock' [Rcache object=c000000f9fdf8160,] wait277:waiting for 'row cache lock' [Rcache object=c000000f9fdf8160,] wait278:waiting for 'row cache lock' [Rcache object=c000000f9fdf8160,] wait279:waiting for 'row cache lock' [Rcache object=c000000f9fdf8160,] wait280:waiting for 'row cache lock' [Rcache object=c000000f9fdf8160,] wait281:waiting for 'row cache lock' [Rcache object=c000000f9fdf8160,] wait检查一下process 254
service name: SYS$USERS
O/S info: user: report, term: , ospid: 17444, machine: qdbb2 program: (TNS V1-V3) application name: (TNS V1-V3), hash value=516955253 waiting for 'row cache lock' blocking sess=0x0000000000000000 seq=1 wait_time=0 seconds since wait started=3425 cache id=7, mode=0, request=3 Dumping Session Wait History for 'row cache lock' count=1 wait_time=2615950 cache id=7, mode=0, request=3 for 'row cache lock' count=1 wait_time=2929664 cache id=7, mode=0, request=3 for 'row cache lock' count=1 wait_time=2929674 cache id=7, mode=0, request=3 for 'row cache lock' count=1 wait_time=2929666 cache id=7, mode=0, request=3SO: c000000f9f9198b8, type: 50, owner: c000000ee30a3060, flag: INIT/-/-/0x00
row cache enqueue: count=1 session=c000000fba59b660 object=c000000f9fdf8160, request=S savepoint=0x4 row cache parent object: address=c000000f9fdf8160 cid=7(dc_users) hash=cdb2e447 typ=11 transaction=c000000faae8c710 flags=0000012a own=c000000f9fdf8230[c000000f9f919a08,c000000f9f919a08] wat=c000000f9fdf8240[c000000f9f9198e8,c000000f9f918b08] mode=X这是一个dc_users的对象,请求s方式的访问,目前在等待一个mode=x的等待释放,因此我们搜索c000000f9fdf8160, mode=X,找到:
SO: c000000f9f9199d8, type: 50, owner: c000000faae8c710, flag: INIT/-/-/0x00
row cache enqueue: count=1 session=c000000fba5f3b30 object=c000000f9fdf8160, mode=X savepoint=0x6 row cache parent object: address=c000000f9fdf8160 cid=7(dc_users) hash=cdb2e447 typ=11 transaction=c000000faae8c710 flags=0000012a own=c000000f9fdf8230[c000000f9f919a08,c000000f9f919a08] wat=c000000f9fdf8240[c000000f9f9198e8,c000000f9f918b08] mode=X status=VALID/UPDATE/-/-/IO/-/-/-/- request=N release=TRUE flags=0 instance lock id=QH b6ae8f92 4cc6da8c set=0, complete=FALSE set=1, complete=FALSE set=2, complete=FALSE就是这个会话持有了这个dc_users的row cache object,而且持有方式是不兼容的,往前看,发现这个会话是:PROCESS 61:
---------------------------------------- SO: c000000fbf3bfe48, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00 (process) Oracle pid=61, calls cur/top: c000000fa0c0eb30/c000000f8b0fdef8, flag: (0) - int error: 0, call error: 0, sess error: 0, txn error 0 (post info) last post received: 109 0 4 last post received-location: kslpsr last process to post me: c000000fbc3c2d98 1 6 last post sent: 0 0 24 last post sent-location: ksasnd last process posted by me: c000000fbc3c2d98 1 6 (latch info) wait_event=109 bits=0 Process Group: DEFAULT, pseudo proc: c000000fbc44e418 O/S info: user: oracle, term: UNKNOWN, ospid: 2006 OSD pid info: Unix process pid: 2006, p_w_picpath:SO: c000000fba5f3b30, type: 4, owner: c000000fbf3bfe48, flag: INIT/-/-/0x00
(session) sid: 1973 trans: c000000faae8c710, creator: c000000fbf3bfe48, flag: (41) USR/- BSY/-/-/-/-/- DID: 0001-003D-041A6D36, short-term DID: 0000-0000-00000000 txn branch: 0000000000000000 oct: 0, prv: 0, sql: 0000000000000000, psql: 0000000000000000, user: 0/SYS service name: SYS$USERS O/S info: user: liping, term: L902054, ospid: 163000:163128, machine: program: plsqldev.exe application name: plsqldev.exe, hash value=2126984564 waiting for 'log file switch (checkpoint incomplete)' blocking sess=0x0000000000000000 seq=9 wait_time=0 seconds since wait started=2290 =0, =0, =0 Dumping Session Wait History for 'log file switch (checkpoint incomplete)' count=1 wait_time=87117 =0, =0, =0 for 'log file switch (checkpoint incomplete)' count=1 wait_time=976593 =0, =0, =0 for 'log file switch (checkpoint incomplete)' count=1 wait_time=979179 =0, =0, =0 for 'log file switch (checkpoint incomplete)' count=1 wait_time=7099 =0, =0, =0 for 'log file switch (checkpoint incomplete)' count=1 wait_time=976471 =0, =0, =0 for 'log file switch (checkpoint incomplete)' count=1 wait_time=976611 =0, =0, =0 for 'log file switch (checkpoint incomplete)' count=1 wait_time=973437 =0, =0, =0 for 'log file switch (checkpoint incomplete)' count=1 wait_time=969860 =0, =0, =0 for 'log file switch (checkpoint incomplete)' count=1 wait_time=976535 =0, =0, =0 for 'log file switch (checkpoint incomplete)' count=1 wait_time=978815 =0, =0, =0这是一个PL/SQL DEVELOPER的用户,不过从上面的情况看,这个会话是产生了阻塞大量会话的源头,不过这个会话本身也是受害者,在等待log file switch(checkpoint incomplete),为什么会这样呢?4.
于是我们来看看lgwr的状态:
SO: c000000fba6167c0, type: 4, owner: c000000fbc3c2d98, flag: INIT/-/-/0x00
(session) sid: 2155 trans: 0000000000000000, creator: c000000fbc3c2d98, flag: (51) USR/- BSY/-/-/-/-/- DID: 0001-0024-00000005, short-term DID: 0000-0000-00000000 txn branch: 0000000000000000 oct: 0, prv: 0, sql: 0000000000000000, psql: 0000000000000000, user: 0/SYS service name: SYS$BACKGROUND waiting for 'rdbms ipc message' blocking sess=0x0000000000000000 seq=13185 wait_time=0 seconds since wait started=0 timeout=d3, =0, =0 Dumping Session Wait History for 'control file sequential read' count=1 wait_time=233 file#=0, block#=2e, blocks=1 for 'control file sequential read' count=1 wait_time=210 file#=0, block#=2c, blocks=1 for 'control file sequential read' count=1 wait_time=255 file#=0, block#=29, blocks=1 for 'control file sequential read' count=1 wait_time=213 file#=0, block#=27, blocks=1 for 'control file sequential read' count=1 wait_time=222 file#=2, block#=1, blocks=1 for 'control file sequential read' count=1 wait_time=221 file#=1, block#=1, blocks=1 for 'control file sequential read' count=1 wait_time=282 file#=0, block#=1, blocks=1 for 'rdbms ipc message' count=1 wait_time=4564 timeout=d4, =0, =0 for 'rdbms ipc message' count=1 wait_time=3188 timeout=d4, =0, =0 for 'rdbms ipc message' count=1 wait_time=1 timeout=d4, =0, =0LGWR居然在等待ipc message,说明lgwr并没有处于工作状态,或者无事可做,或者等待其他会话处理结束后唤醒它。而目前大家都在等待log file switch,lgwr没有闲着没事做的理由,因此我们有理由相信checkpoint出问题了,lgwr是在等待ckpt完成工作,于是我们继续检查ckpt的状态:
5.
搜索(ckpt)
SO: c000000fbb632130, type: 4, owner: c000000fbb40bc08, flag: INIT/-/-/0x00
(session) sid: 2154 trans: 0000000000000000, creator: c000000fbb40bc08, flag: (51) USR/- BSY/-/-/-/-/- DID: 0001-0025-00000004, short-term DID: 0001-0025-00000005 txn branch: 0000000000000000 oct: 0, prv: 0, sql: 0000000000000000, psql: 0000000000000000, user: 0/SYS service name: SYS$BACKGROUND waiting for 'enq: PR - contention' blocking sess=0xc000000fbf5e4278 seq=61352 wait_time=0 seconds since wait started=62446 name|mode=50520006, 0=0, 0=0 Dumping Session Wait History for 'enq: PR - contention' count=1 wait_time=429069 name|mode=50520006, 0=0, 0=0for 'enq: PR - contention' count=1 wait_time=2929688
name|mode=50520006, 0=0, 0=0 for 'enq: PR - contention' count=1 wait_time=2929668 name|mode=50520006, 0=0, 0=0 for 'enq: PR - contention' count=1 wait_time=2929682 name|mode=50520006, 0=0, 0=0 for 'enq: PR - contention' count=1 wait_time=2929671 name|mode=50520006, 0=0, 0=0 for 'enq: PR - contention' count=1 wait_time=2929679 name|mode=50520006, 0=0, 0=0 for 'enq: PR - contention' count=1 wait_time=2929666 name|mode=50520006, 0=0, 0=0 for 'enq: PR - contention' count=1 wait_time=2929683 name|mode=50520006, 0=0, 0=0 for 'enq: PR - contention' count=1 wait_time=2929624 name|mode=50520006, 0=0, 0=0 for 'enq: PR - contention' count=1 wait_time=2929676 name|mode=50520006, 0=0, 0=0ckpt一直在等待PR锁,这回我们终于看到了blocking sess=0xc000000fbf5e4278 ,这个会话就是阻塞ckpt的元凶,我们来看看是谁:SO: c000000fbf5e4278, type: 4, owner: c000000fbe3c1910, flag: INIT/-/-/0x00
(session) sid: 2150 trans: 0000000000000000, creator: c000000fbe3c1910, flag: (100051) USR/- BSY/-/-/-/-/- DID: 0001-0029-00000002, short-term DID: 0001-0029-00000003 txn branch: 0000000000000000 oct: 0, prv: 0, sql: c000000fa3ed29b8, psql: 0000000000000000, user: 0/SYS service name: SYS$BACKGROUND last wait for 'os thread startup' blocking sess=0x0000000000000000 seq=59956 wait_time=1006468 seconds since wait started=62472 =0, =0, =0我们又找到了mmon,原来mmon才是导致系统HANG住的元凶。6.
案例总结:
1、到目前为止,我们弄清楚了系统HANG住的主要原因是MMON启动m000的时候,m000没能够正常启动,导致mmon长期持有pr锁,阻塞了ckpt,导致无法正常做日志切换,但是mmon未能正常启动m000的原因还没有找到,这需要更多的资料进行分析,不在本文讨论之内
2、碰到这种问题如何处理?实际上如果我们分析清楚了,只要杀掉mmon进程,或者调整一下一下statistics_level就可能解决问题,而不需要重启实例。但是这个分析需要时间,最起码也需要30-60分钟的分析时间,如果在生产环境下发生,很难给我们这么长的分析时间。但是大家要注意这个问题的出现时在4点多钟,系统HANG住是在10点多钟,中间有很长时间供我们分析。如果我们能及时发现这个异常,是可能在系统HANG住钱解决问题的