<?xml version="1.0" encoding="UTF-8"?> <rss
version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
> <channel><title>Professional VMware &#187; PSOD</title> <atom:link href="http://professionalvmware.com/category/psod/feed/" rel="self" type="application/rss+xml" /><link>http://professionalvmware.com</link> <description>How Many Turtles Can You Fit On A Rock?</description> <lastBuildDate>Fri, 10 Feb 2012 00:37:53 +0000</lastBuildDate> <language>en</language> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <generator>http://wordpress.org/?v=3.2.1</generator> <item><title>vSphere 4 and Core Dumps (vmkdump)</title><link>http://professionalvmware.com/2010/02/vsphere-4-and-core-dumps-vmkdump/</link> <comments>http://professionalvmware.com/2010/02/vsphere-4-and-core-dumps-vmkdump/#comments</comments> <pubDate>Fri, 26 Feb 2010 17:59:29 +0000</pubDate> <dc:creator>bunchc</dc:creator> <category><![CDATA[Crash Dump]]></category> <category><![CDATA[ESX]]></category> <category><![CDATA[PSOD]]></category> <category><![CDATA[Troubleshooting]]></category> <category><![CDATA[Uncategorized]]></category> <category><![CDATA[vSphere]]></category> <guid
isPermaLink="false">http://professionalvmware.com/?p=1121</guid> <description><![CDATA[Today I was reviewing my post on ESX Crash Dumps and found that well, for vSphere, it is quite broken. How? Well&#8230; No /usr/sbin/vmkdump in ESX 4 As referenced in this KB article, vmkdump has been replaced with some additional flags on esxcfg-dumppart: In ESX 4.X, esxcfg-dumppart is now used to extract the logs files. [...]]]></description> <content:encoded><![CDATA[<p></p><p>Today I was reviewing my post on <a
href="http://professionalvmware.com/2009/02/how-to-read-a-dump-esx-crash-dumps-that-is/">ESX Crash Dumps</a> and found that well, for vSphere, it is quite broken. How? Well&#8230;</p><h4>No /usr/sbin/vmkdump in ESX 4</h4><p>As referenced in this KB article, vmkdump has been replaced with some additional flags on esxcfg-dumppart:</p><blockquote><p>In ESX 4.X, esxcfg-dumppart is now used to extract the logs files.<br
/> The syntax is:<br
/> esxcfg-dumppart &#8211;log &lt;ESX dump file&gt;<br
/> esxcfg-dumppart -L &lt;ESX dump file&gt;</p></blockquote><h4>Here it is in action:</h4><p>The file:<br
/> -rw-r&#8211;r&#8211; 1 root root 6790236 Feb 18 10:11 vmkernel-zdump-021810.10.11.1</p><p># esxcfg-dumppart &#8211;log vmkernel-zdump-021810.10.11.1<br
/> Created file vmkernel-log.1<br
/> Log wrapped</p><p># ls -l | grep log-rw-r&#8211;r&#8211; 1 root root 262144 Feb 25 06:49 vmkernel-log.1</p><p>There it is. Woot! Questions? Comments? Drop us a line.</p> ]]></content:encoded> <wfw:commentRss>http://professionalvmware.com/2010/02/vsphere-4-and-core-dumps-vmkdump/feed/</wfw:commentRss> <slash:comments>6</slash:comments> </item> <item><title>Using AMD&#8217;s mcat.exe to Debug your PSOD MCE (Machine Check Exception)</title><link>http://professionalvmware.com/2009/12/using-amds-mcat-exe-to-debug-your-psod-mce-machine-check-exception/</link> <comments>http://professionalvmware.com/2009/12/using-amds-mcat-exe-to-debug-your-psod-mce-machine-check-exception/#comments</comments> <pubDate>Wed, 16 Dec 2009 18:38:00 +0000</pubDate> <dc:creator>bunchc</dc:creator> <category><![CDATA[ESX]]></category> <category><![CDATA[PSOD]]></category> <category><![CDATA[Troubleshooting]]></category> <category><![CDATA[machine check]]></category> <category><![CDATA[mce]]></category> <guid
isPermaLink="false">http://professionalvmware.com/2009/12/using-amds-mcat-exe-to-debug-your-psod-mce-machine-check-exception/</guid> <description><![CDATA[&#34;Sokath, his eyes opened&#34; or roughly “Understanding”. So what does the Tamarian language have to do with PSODs or Machine Check Exceptions (MCEs)? Well, neither one of them make much sense, and need some understanding in order to translate them appropriately. What is an MCE (Machine Check Exception) A machine check exception, or MCE is [...]]]></description> <content:encoded><![CDATA[<p></p><p><a
href="http://professionalvmware.com/wp-content/uploads/2009/12/tamarian1.jpg"><img
style="border-bottom: 0px; border-left: 0px; margin: 0px 10px 0px 0px; display: inline; border-top: 0px; border-right: 0px" title="tamarian[1]" border="0" alt="tamarian[1]" align="left" src="http://professionalvmware.com/wp-content/uploads/2009/12/tamarian1_thumb.jpg" width="184" height="171" /></a> &quot;Sokath, his eyes opened&quot; or roughly “Understanding”. So what does the Tamarian language have to do with PSODs or Machine Check Exceptions (MCEs)? Well, neither one of them make much sense, and need some understanding in order to translate them appropriately.</p><h4>What is an MCE (Machine Check Exception)</h4><p>A machine check exception, or MCE is the systems way of throwing a hardware error up through the operating system when the error is severe enough to warrant a system halt. On ESX these look similar to:</p><blockquote><p><font
size="1" face="Courier New">[45m[33;1mVMware ESX Server [Releasebuild-113339][0m <br
/>Machine Check Exception: Unable to continue <br
/>frame=0x3ad3d2c ip=0x625eb0 cr2=0xff400000 cr3=0x3f737000 cr4=0x168 <br
/>es=0xffffffff ds=0xffffffff fs=0xffffffff gs=0xffffffff <br
/>eax=0xffffffff ebx=0xffffffff ecx=0xffffffff edx=0xffffffff <br
/>ebp=0x3ad3e88 esi=0xffffffff edi=0xffffffff err=-1 eflags=0xffffffff <br
/>0:1024/console *1:1076/vmware-vm 2:1120/vmware-vm 3:1121/vmm0:2551 <br
/>4:1128/mks:20480 5:1084/mks:20480 6:1126/vmm0:2048 7:1108/vmm0:2308 <br
/>@BlueScreen: Machine Check Exception: Unable to continue <br
/>0x3ad3e88:[0x625eb0]Panic+0&#215;17 stack: 0x8424f0, 0x3ad3ea4, 0x3ad3eb0 <br
/>0x3ad3e98:[0x625eb0]Panic+0&#215;17 stack: 0x8424f0, 0&#215;0, 0&#215;0 <br
/>0x3ad3eb0:[0x6667f8]MCE_HandleException+0x6b stack: 0x3ad3ef8, 0xbf5feaeb, 0x3ad3f20 <br
/>0x3ad3ec0:[0x62093d]Int18_MachineCheck+0x4c stack: 0x3ad3ef8, 0&#215;4028, 0&#215;4028 <br
/>0x3ad3f20:[0x692cac]CommonTrap+0xb stack: 0&#215;23, 0xbf5feaea, 0xc1e40ee <br
/>0x3ad3f3c:[0x7024a7]User_CopyOut+0&#215;52 stack: 0xbf5feaea, 0xc1e40ee, 0&#215;2 <br
/>0x3ad3f74:[0x722975]LinuxFileDesc_Poll+0&#215;120 stack: 0xbf5feae4, 0&#215;10, 0&#215;64 <br
/>0x3ad3fa8:[0x70304b]User_LinuxSyscallHandler+0x6a stack: 0x3ad3fe0, 0&#215;23, 0&#215;23 <br
/>0xbf5fda98:[0x692cac]CommonTrap+0xb stack: 0&#215;0, 0&#215;0, 0&#215;0 <br
/>VMK uptime: 169:17:26:00.887 TSC: 27792623958323601 <br
/>169:17:26:00.884 cpu1:1076)MCE: 169: Machine Check Exception: General Status 0000000000000004 <br
/>169:17:26:00.884 cpu1:1076)MCE: 193: Machine Check Exception: Bank 0, Status b673400000000145 <br
/>169:17:26:00.884 cpu1:1076)MCE: 226: Machine Check Exception: Bank 0, Addr 00000000206e38e0, Valid TRUE</font></p></blockquote><p>Now you start to see the similarities to the Tamarian language, no? Well perhaps the Star Trek metaphor doesn’t work for you, but you can agree that the above is a little obtuse. How do we read it then?</p><h4>Translating Using mcat.exe</h4><p>Glad you asked… about translation that is. For both AMD and Intel Platforms, this <a
href="http://kb.vmware.com/kb/1005184">VMware KB</a> provides excellent detail and guidance. If you are running an AMD platform, you can use some additional toolage (yes, it is a word!). First go download and install <a
href="http://support.amd.com/us/Processor_TechDownloads/MCAT_1.1.10.0132.zip">mcat.exe</a> from the AMD site. Running mcat.exe /? gives us quite a bit of output. I’ve included the bits that are relevant to us below:</p><blockquote><p><font
size="1" face="Courier New">Machine Check Analysis Tool (MCAT) Version 1.1.10 <br
/>USAGE: <br
/>&#160;&#160; mcat /? [/cmd] bank status address misc] | <br
/>where <br
/>&#160;&#160; /cmd <br
/>&#160;&#160;&#160;&#160;&#160; bank&#160;&#160;&#160;&#160;&#160; MCA error bank number <br
/>&#160;&#160;&#160;&#160;&#160; status&#160;&#160;&#160; MCA error status register value (prefix with 0x for hex) <br
/>&#160;&#160;&#160;&#160;&#160; address&#160;&#160; MCA error address register value (prefix with 0x for hex) <br
/>&#160;&#160;&#160;&#160;&#160; misc&#160;&#160;&#160;&#160;&#160; MCA error misc register value (prefix with 0x for hex)&#160; <br
/> /cmd&#160;&#160; Decode bank, status, address, misc provided in command line</font></p></blockquote><p>You can see we’re primarily concerned with the /cmd switch, but where do we get the parameters to feed it? They’re in our PSOD message… These lines specifically:</p><p>169:17:26:00.884 cpu1:1076)MCE: 193: Machine Check Exception: Bank 0, Status b673400000000145 <br
/>169:17:26:00.884 cpu1:1076)MCE: 226: Machine Check Exception: Bank 0, Addr 00000000206e38e0, Valid TRUE</p><p>On the command line it looks like this:</p><blockquote><p><font
size="2" face="Courier New">C:\Program Files\AMD\MCAT&gt;mcat /cmd 0 0xb673400000000145 0x00000000206e38e0 0 <br
/>Processor Number&#160; : Unknown <br
/>Bank Number&#160;&#160;&#160;&#160;&#160;&#160; : 0 <br
/>Time Stamp&#160;&#160;&#160; (0x): 00000000 00000000 <br
/>Error Status&#160; (0x): B6734000 00000145 <br
/>Error Address (0x): 00000000 206E38E0 <br
/>Error Misc.&#160;&#160; (0x): 00000000 00000000 <br
/>Status Bit Decode: <br
/>&#160;&#160; Correctable ECC error <br
/>&#160;&#160; Processor state corrupted by error <br
/>&#160;&#160; Error address valid in MCi_ADDR <br
/>&#160;&#160; Error reporting enabled <br
/>&#160;&#160; Error not corrected <br
/>&#160;&#160; Error valid <br
/>Memory Error Code: <br
/>&#160;&#160; Memory transaction type: Data write (DWR) <br
/>&#160;&#160; Transaction type: Data <br
/>&#160;&#160; Cache level: Level 1 (L1) <br
/>Data Cache Error MC0: <br
/>&#160;&#160; Data array Store (DWR) error on Level 1 (L1) data cache <br
/>&#160;&#160; Syndrome: 0xE6</font></p></blockquote><p>A bit less obtuse this time. Reading it over, we find out that there was a “correctable ECC error”, but in the next line that “Processor state corrupted by error” occurred, indicating we may have lost our specific error in memory. At this point, your best bet is to schedule a maintenance window, and run Memtest86 or a similar diagnostic to rule RAM out of the equation. Once ram is ruled out, contact your hardware vendor for some replacement procs.</p><p>Questions? Comments? Drop a line in the comments.</p> ]]></content:encoded> <wfw:commentRss>http://professionalvmware.com/2009/12/using-amds-mcat-exe-to-debug-your-psod-mce-machine-check-exception/feed/</wfw:commentRss> <slash:comments>1</slash:comments> </item> <item><title>How To Read Dumps – ESX Crash Dumps That Is</title><link>http://professionalvmware.com/2009/02/how-to-read-a-dump-esx-crash-dumps-that-is/</link> <comments>http://professionalvmware.com/2009/02/how-to-read-a-dump-esx-crash-dumps-that-is/#comments</comments> <pubDate>Wed, 11 Feb 2009 20:12:13 +0000</pubDate> <dc:creator>bunchc</dc:creator> <category><![CDATA[Crash Dump]]></category> <category><![CDATA[ESX]]></category> <category><![CDATA[PSOD]]></category> <category><![CDATA[Troubleshooting]]></category> <guid
isPermaLink="false">http://professionalvmware.com/?p=495</guid> <description><![CDATA[About thirty years ago in the jungle in South Korea I was spending some time living as a monk. One of the things I learned from these monks, was the ancient art of Dump reading. Yes! That’s right, I can tell the future by reading the finer texture and smell of a dump. Ok, while [...]]]></description> <content:encoded><![CDATA[<p></p><p><a
href="http://professionalvmware.com/wp-content/uploads/2009/02/2331307556-84c8bb52c7-o1.jpg"><img
style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; margin-left: 0px; border-left-width: 0px; margin-right: 0px" title="2331307556_84c8bb52c7_o[1]" src="http://professionalvmware.com/wp-content/uploads/2009/02/2331307556-84c8bb52c7-o1-thumb.jpg" border="0" alt="2331307556_84c8bb52c7_o[1]" width="184" height="244" align="right" /></a></p><p>About thirty years ago in the jungle in South Korea I was spending some time living as a monk. One of the things I learned from these monks, was the ancient art of Dump reading. Yes! That’s right, I can tell the future by reading the finer texture and smell of a dump.</p><p>Ok, while not true (I’m naught by 26) and I can’t tell the future by reading dumps. I can tell you, however, that reading ESX dumps would be conducive to your future.</p><h3>What Makes A Dump?</h3><p>Lots and lots of fiber in your diet. That… and PSOD’s (Purple Screens of Death). They’ll generate an ESX kernel dump and drop a crash dump file into the /root/ directory, named something like: ‘vmkernel-zdump-&lt;reversed date&gt;.#.#.#’</p><p>This file is created on the first reboot following your psod and is generated from the contents of your VMKCORE partition, you did make a VMKCORE partition, right? It&#8217;s the one labeled &#8216;fc&#8217;. Can&#8217;t find it? Sure? Did you look in your sock drawer? Ok&#8230; well in that case &#8220;vmkdump -d /dev/sda5&#8243; where /dev/sda5 is the output from esxcfg-dumppart -l</p><h3>I Have My Dump, Now What?</h3><p>So you can do a few things. First is to generate a <a
href="http://professionalvmware.com/2009/01/27/log-bundles-of-the-virtual-center-variety/">support bundle</a> and send it off to VMware for analysis (which you should do anyways). However, if you’re like me, and can’t wait, from the service console you can do the following:</p><p>Here is where the dump hides:</p><p><span
style="font-family: Courier New; color: #ff8040;"># ls -alh<br
/> total 14M<br
/> -rw-r&#8211;r&#8211;    1 root     root          13M Feb  6 04:40 vmkernel-zdump-020609.04.40.1</span></p><p>Lets extract it:</p><p><span
style="font-family: Courier New; color: #ff8040;"># vmkdump -l vmkernel-zdump-020609.04.40.1<br
/> created file vmkernel-log.1</span></p><p><span
style="font-family: Courier New; color: #ff8040;"># ls -alh<br
/> -rw-r&#8211;r&#8211;    1 root     root         186K Feb 11 14:32 vmkernel-log.1<br
/> -rw-r&#8211;r&#8211;    1 root     root          13M Feb  6 04:40 vmkernel-zdump-020609.04.40.1</span></p><p>So there it is… now lets take a look at the insides:</p><p><span
style="font-family: Courier New; color: #ff8040;">54:01:08:11.385 cpu15:1166)&lt;6&gt;Debug scsi underrun<br
/> 54:01:08:11.385 cpu15:1166)&lt;6&gt;Debug scsi underrun<br
/> 54:01:08:11.385 cpu15:1166)&lt;6&gt;Debug scsi underrun<br
/> 54:01:08:11.386 cpu15:1166)&lt;6&gt;Debug scsi underrun<br
/> 54:01:08:11.386 cpu15:1166)&lt;6&gt;Debug scsi underrun<br
/> 54:06:35:47.637 cpu7:1074)&lt;6&gt;qla24xx_abort_command(0): handle to abort=1457<br
/> _[45m_[33;1mVMware ESX Server [Releasebuild-113339]_[0m<br
/> Exception type 13 in world 1169:vmm0:197830- @ 0x6ff49b<br
/> frame=0x3c47cec ip=0x6ff49b cr2=0x8617c88 cr3=0x3f686000 cr4=0x2660<br
/> es=0x3ee64028 ds=0x4028 fs=0x1580000 gs=0x4041<br
/> eax=0x2a ebx=0xb3f0f80 ecx=0x9ff47e90 edx=0x50<br
/> ebp=0x3c47ed4 esi=0xe edi=0x15806c8 err=0 eflags=0x10286<br
/> 0:1024/console 1:1196/vmware-vm 2:1200/mks:19783 3:1186/mks:19783<br
/> *4:1169/vmm0:1978 5:1161/vmware-vm 6:1170/vmm1:1978 7:1179/mks:19783<br
/> 8:1176/vmm0:1978 9:1184/vmm1:1978 10:1182/vmware-vm 11:1177/vmm1:1978<br
/> 12:1162/vmm0:1978 13:1198/vmm1:1978 14:1197/vmm0:1978 15:1039/idle15<br
/> @BlueScreen: Exception type 13 in world 1169:vmm0:197830- @ 0x6ff49b<br
/> 0x3c47ed4:[0x6ff49b]E1000PollTxRing+0&#215;366 stack: 0&#215;7030140, 0xb3f0fb4, 0&#215;0<br
/> 0x3c47f2c:[0x701474]E1000_PollRings+0x1d7 stack: 0x3ee6a308, 0&#215;704, 0x267d49c0<br
/> 0x3c47f84:[0x618647]BH_Check+0x2ee stack: 0&#215;1, 0&#215;82000000, 0x85f7d70<br
/> 0x3c47fd8:[0x62249c]VMKCall+0&#215;147 stack: 0x2d, 0x85f7d70, 0&#215;82000000<br
/> 0x3c47ffc:[0x67af0b]VMKVMMEnterVMKernel+0x8e stack: 0&#215;0, 0&#215;0, 0&#215;0<br
/> VMK uptime: 57:17:09:07.125 TSC: 11937242658207618<br
/> Starting coredump to disk Starting coredump to disk Dumping using slot 1 of 1&#8230; using slot 1 of 1&#8230; log</span></p><p>The first column is your uptime. The last event before the crash was the aborted handle:</p><p><span
style="font-family: Courier New; color: #ff8040;">54:06:35:47.637 cpu7:1074)&lt;6&gt;qla24xx_abort_command(0): handle to abort=1457</span></p><p>The uptime of the kernel when the crash occurred is the second last line:</p><p><span
style="font-family: Courier New; color: #ff8040;">VMK uptime: 57:17:09:07.125 TSC: 11937242658207618</span></p><p>We can see that there is 11 hours between the last message and the time of the crash. This means that those debug scsi underrun messages can basically be ignored.</p><p>Now let’s move on to the backtrace itself:</p><p><span
style="font-family: Courier New; color: #ff8040;">@BlueScreen: Exception type 13 in world 1169:vmm0:notthemama- @ 0x6ff49b<br
/> 0x3c47ed4:[0x6ff49b]E1000PollTxRing+0&#215;366 stack: 0&#215;7030140, 0xb3f0fb4, 0&#215;0<br
/> 0x3c47f2c:[0x701474]E1000_PollRings+0x1d7 stack: 0x3ee6a308, 0&#215;704, 0x267d49c0<br
/> 0x3c47f84:[0x618647]BH_Check+0x2ee stack: 0&#215;1, 0&#215;82000000, 0x85f7d70<br
/> 0x3c47fd8:[0x62249c]VMKCall+0&#215;147 stack: 0x2d, 0x85f7d70, 0&#215;82000000<br
/> 0x3c47ffc:[0x67af0b]VMKVMMEnterVMKernel+0x8e stack: 0&#215;0, 0&#215;0, 0&#215;0</span></p><p>The last instruction was E1000PollTxRing then E1000_PollRings then BH_Check then VMKCall and finally VMKVMMEnterVMKernel</p><p>Based on the name of the last instruction, this host probably crashed due to some type of packet or frame corruption in the Intel E1000 driver in the VM that was running with world ID 1169 in vmm0 named &#8216;notthemama&#8217;.</p><p>Thanks for playing along. If you have questions hit me up in the comments or on twitter @cody_bunch</p> ]]></content:encoded> <wfw:commentRss>http://professionalvmware.com/2009/02/how-to-read-a-dump-esx-crash-dumps-that-is/feed/</wfw:commentRss> <slash:comments>24</slash:comments> </item> </channel> </rss>
