ubuntu20.04系统遭遇死机,mce报内存错误,问题定位及解决 |
您所在的位置:网站首页 › 内存模块错误引起黑屏 › ubuntu20.04系统遭遇死机,mce报内存错误,问题定位及解决 |
系统为ubuntu20.04.6,超微x9dri-f双路主板,两颗e5 2600v2系列cpu,内存条插满。主要用来跑深度学习模型,在训练过程中经常会出现死机的现象,开启mcelog后有如下报错 [49150.466577] mce: [Hardware Error]: Machine check events logged [49150.466591] mce: [Hardware Error]: Machine check events logged Hardware event. This is not a software error. MCE 0CPU 12 BANK 9 TSC 49c688083e1dd MISC d221010001000c8c TIME 1686062534 Tue Jun 6 22:42:14 2023 MCG status: MCi status: Error overflow Corrected error MCi_MISC register validMCA: MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read error MemCtrl: Corrected memory read error STATUS c800008600800090 MCGSTATUS 0 MCGCAP 1000c1d APICID 20 SOCKETID 1 MICROCODE 42e CPUID Vendor Intel Family 6 Model 62 Step 4 Hardware event. This is not a software error. MCE 0CPU 12 BANK 9 TSC 49c689e9ed41b MISC d221010001000c8c TIME 1686062534 Tue Jun 6 22:42:14 2023 MCG status: MCi status: Error overflow Corrected error MCi_MISC register validMCA: MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read error MemCtrl: Corrected memory read error STATUS c800008600800090 MCGSTATUS 0 MCGCAP 1000c1d APICID 20 SOCKETID 1 MICROCODE 42e CPUID Vendor Intel Family 6 Model 62 Step 4 有那么几个令人在意的点,似乎是内存的问题,多方查找无法找到定位cpu12 bank9位置的方法,最后在chatgpt的帮助下找到如下指令 sudo lshw -class memory # 列出系统中所有的内存设备信息,包括内存模块所在的插槽位置在我电脑上运行后得 *-memory description: System Memory physical id: 2d slot: System board or motherboard size: 208GiB capabilities: ecc configuration: errordetection=multi-bit-ecc *-bank:0 ....................... *-bank:9 description: DIMM Synchronous [empty] product: Dimm1_PartNum vendor: Dimm1_Manufacturer physical id: 9 serial: Dimm1_SerNum slot: P2_DIMME2 width: 64 bits .............................................. *-bank:14 * *-bank:15 * 可以看到bank9处对应我电脑的E2内存插槽,应该是内存条有问题,拔下或者更换。 mcelog安装有些博客给出直接 sudo apt-get install mcelog但在我的系统上无法使用 官网给出的方式之一: git clone git://git.kernel.org/pub/scm/utils/cpu/mce/mcelog.git # 将相关文件下载到本地 cp mcelog.service /usr/lib/systemd/system # 把mcelog服务文件放入系统相关的目录 systemctl enable mcelog.service # 开机启动 systemctl start mcelog.service # 启动 如果遇到报错,输出文档在/var/log/syslog中 |
今日新闻 |
点击排行 |
|
推荐新闻 |
图片新闻 |
|
专题文章 |
CopyRight 2018-2019 实验室设备网 版权所有 win10的实时保护怎么永久关闭 |