Monit

Monit，不要与 M/Monit 混淆，是一个 AGPL3.0 许可的系统和进程监控工具。Monit 可以自动重启崩溃的服务，显示来自标准硬件（通过 lm_sensors）和硬盘（来自 smartmontools 例如）的温度。可以基于广泛的标准（包括单次或一段时间内多次发生）发送服务警报。可以通过命令行直接访问它，也可以使用其集成的 HTTP(S) 服务器作为 Web 应用程序运行。这允许快速和精简地快照给定系统的状态。

安装

安装 monit 软件包和任何用于可选测试的软件，例如 lm_sensors 或 smartmontools。完成配置后，请务必启用并启动 monit.service。

配置

Monit 将主配置文件保存在 /etc/monitrc。您可以选择编辑此文件，但如果您希望运行脚本（例如获取硬盘温度或健康状况），您应该取消注释 include /etc/monit.d/* 的最后一条指令，保存 /etc/monitrc 并创建 /etc/monit.d/。

注意： Monit 要求 /etc/monitrc 文件（以及可能存储在 /etc/monit.d 中的文件）具有 0700 权限。未能遵守将导致 Monit 启动失败。

配置语法

Monit 使用一种配置语法，使其非常易于阅读；本质上是 check WHAT 后跟 if THING condition THEN action 格式。在配置文件中，任何出现的 if、and、with(in)、has、us(ing|e)、on(ly)、then、for、of 仅供人类阅读，Monit 完全忽略它们。

检查通常在 cycles 中执行。这在配置文件的开头定义，例如，30 秒轮询定义为

set daemon 30

4 cycles 的检查因此每 2 分钟发生一次

配置示例

邮件服务器声明

set mailserver smtp.myserver.com port 587
        username "MyUser" password "MyPassW0rd"
using tlsv12

邮件通知格式

set mail-format {
      from: Monit@MyServer
   subject: $SERVICE $EVENT at $DATE
   message: Monit $ACTION $SERVICE at $DATE on $HOST: $DESCRIPTION.
}

注意： 上述变量（例如 $SERVICE）不是通用示例，而是特定变量名称，Monit 会将其替换为警报的内容、系统等等。

CPU、内存和交换空间利用率

check system $HOST
    if loadavg (15min) > 15 for 5 times within 15 cycles then alert
    if memory usage > 80% for 4 cycles then alert
    if swap usage > 20% for 4 cycles then alert

文件系统使用率

check filesystem rootfs with path /
    if space usage > 90% then alert

check filesystem NFS with path /mnt/nfs_share
    if space usage > 90% then alert

进程监控

check process sshd with pidfile /var/run/sshd.pid
   start program  "systemctl start sshd"
   stop program  "systemctl stop sshd"
   if failed port 22 protocol ssh then restart

check process smbd with pidfile /run/samba/smbd.pid
   group samba
   start program = "/etc/init.d/samba start"
   stop  program = "/etc/init.d/samba stop"
   if failed host 192.168.1.250 port 139 type TCP  then restart
   depends on smbd_bin

check file smbd_bin with path /usr/bin/smbd
   group samba
   if failed permission 755 then unmonitor
   if failed uid root then unmonitor
   if failed gid root then unmonitor

注意： 对于上面的 samba 示例，第一个块具有 depends on smbd_bin，这使得 Samba 的测试需要实际的 smbd 进程

使用脚本监控硬盘健康状况和温度

温度

创建文件 /etc/monit.d/scripts/hdtemp.sh 以及 /etc/monit.d/scripts 文件夹（如果必要）。

/etc/monit.d/scripts/hdtemp.sh

 #!/usr/bin/sh
 HDDTP=`/usr/bin/smartctl -A /dev/sd${1} | grep Temp.*Cels | awk -F " " '{printf "%d",$10}'`
 #echo $HDDTP # for debug only
 exit $HDDTP

monitrc or /etc/monit.d/*.monit file

check program SSD-A-Temp with path "/etc/monit.d/scripts/hdtemp.sh a"
    every 5 cycles
    if status > 40 then alert
    group health

check program HDD-B-Temp with path "/etc/monit.d/scripts/hdtemp.sh b"
    every 5 cycles
    if status > 40 then alert
    group health

在此示例中，/etc/monit.d/scripts/hdtemp.sh 脚本假定您的驱动器路径为 /dev/sdX，其中 X 由 check 声明末尾的字母填充。类似的方法用于下一个示例中的 SMART 健康状态。

SMART 健康状态

/etc/monit.d/scripts/hdhealth.sh

 #!/usr/bin/sh
 STATUS=`/usr/bin/smartctl -H /dev/sd${1} | grep overall-health | awk 'match($0,"result:"){print substr($0,RSTART+8,6)}'`
 if [ "$STATUS" = "PASSED" ] 
 then
     # 1 implies PASSED
     TP=1
 else 
     # 2 implies FAILED
     TP=2
 fi
 #echo $TP # for debug only
 exit $TP

monitrc or /etc/monit.d/*.monit file

check program SSD-A-Health with path "/etc/monit.d/scripts/hdhealth.sh a"
    every 120 cycles
    if status != 1 then alert
    group health

check program HDD-B-Health with path "/etc/monit.d/scripts/hdhealth.sh b"
    every 120 cycles
    if status != 1 then alert
    group health

提示： group 声明将使 Monit 将所有具有相同组名称（在本例中为 health）的分配检查一起显示。

警报接收者：全局或子系统级别

警报可以全局设置，其中给定用户/电子邮件地址会收到任何 alert 条件的警报；或者您可以为每种类型的检查设置警报接收者（例如，网络警报发送给接收者 A；进程警报发送给接收者 B）。您可以设置任意数量的全局或子系统接收者，只需进行多次声明即可。

全局警报

全局警报设置在任何子系统检查之外；为了便于阅读，它们应该与邮件服务器声明设置在同一位置。

SET ALERT email@domain

子系统警报

子系统警报的设置与全局警报非常相似，只是它们缺少 SET 标志。

ALERT email@domain

参见