Software to do monitoring and keep track of updates on multiple (10+) servers?

kawaiipunk · 6 February 2018 18:05

There isn’t really a category for this yet

As we in Autonomic are managing quite a few servers now, I was wondering what solutions co-ops are using to monitor (and possibly apply) updates on multiple Linux servers that scales well?

All our servers are Debian or Ubuntu and I would prefer notifications to not use email for notifications if possible. We do use Ansible, so that’s a possible solution but we don’t really have time to write stuff from scratch. It has to be free and open source software ofc

A couple of solutions I have been looking at:

finn · 6 February 2018 19:46

We use a combination of open source tools.

For monitoring:

For provisioning and updates:

Ansible
Puppet

Perhaps we should try to share recipes on our setups where that would make sense, and avoid duplication of effort.

kawaiipunk · 6 February 2018 20:30

oh yes I forgot to say…

We are using netdata for monitoring:

Alerts will be sent by email or perhaps Telegram. We don’t have any longer term graphing right now.

The netdata dashboard is served on localhost:19999 on the remote server and then forwarded to the admin’s local browser via ssh using this command:
ssh -f user@host.foo -L 19998:localhost:19999 -N

This is good for security as no ports need to be opened as all traffic is outgoing (apart from ssh).

p.s. I should add that there is a another layer of monitoring at the provider level in case the server crashes or what not.

chris · 7 February 2018 09:18

We also use Munin and it is set to send emails when updates are available, however this fix and this fix are needed for it work properly.

When we first started using Xen on CentOS5 servers (probably about 12 years ago) I found that doing a yum update on multiple virtual servers at exactly the same time on the same physical server caused such a load spike that they would stop responding, so since then I have been updating servers sequentially, using this script, it is rather old and could probably do with improving but it means that to update a server it is a matter of sshing to it and running sudo -i and then a-up, this writes the changes that are to be made to a /root/Changelog file, see the logchange script.

To make life easier my (Ansible provisioned) ~/.bash_aliases file contains sections like this:

ssh-stretch() {
  ssh server1
  ssh server2
}

And my (Ansible provisioned) ~/.ssh/config file contains corresponding entries like this:

Host server1
  Username foo
  HostName server1.example.org

So to update all the Stretch servers I type ssh-stretch and then sudo -i, a-up, exit, exit and then do the next one, to make this easier when out and about I have shortcuts for all these commands in the terminal client on my Ubuntu Touch phone, which has an encrypted Debian chroot on it, so it is four button presses per server…

If all these servers were on other peoples hardware and I didn’t need to worry about the impact of updating 30 virtual servers on a physical host all at the same time then I’d consider sorting out a quicker why of thing this, however, I’d still worry about the one in a fifty, or so, updates that require interaction — I guess most people enable automatic updates, but again I had some bad experiences of this over a decade ago and vowed to do things manually, however most updates don’t require much thought, but when there are ones that change a key PHP or Apache or Nginx or whatever config file for security reasons then you do need to looks at the diffs and manually sort things out — my approach is time consuming but it minimises the risk of an automatic update leaving a key service unable to restart.

harry · 7 February 2018 09:27

I believe @nick @joaquimds and @matt have been working on some basic monitoring via Ansible scripts…

joaquimds · 7 February 2018 12:59

We’re using Ansible to check the status of all our servers on AWS.

All our servers are identified using their Name tag.

We have an inventory.yml that lists all our servers and puts them into groups:

tag_Name_bgv_wordpress_staging:
tag_Name_bgv_wordpress_production:
...

active:
  children:
    # The "active" group is used as the "host" value in the playbooks
    # If you don't include your project in here it will never be run.
    bgv:


bgv:
  children:
    tag_Name_bgv_wordpress_staging:
    tag_Name_bgv_wordpress_production:

...

We can then run playbooks against each group of servers:

ansible-playbook check.yml --limit bgv

This playbook checks for Meltdown and Spectre, and whether a reboot is required, and looks like this:

---
- name: Check for Meltdown/Spectre vulnerabilities
  hosts: active
  remote_user: ubuntu
  become: yes
  gather_facts: yes
  roles:
  - reboot-required
  - meltdown

The roles are specific to Ubuntu (perhaps Debian also) distributions. The reboot-required role looks like this:

---
- stat:
    path: /var/run/reboot-required
  register: output

- fail:
    msg: "Server needs reboot"
  when: output.stat.exists
  ignore_errors: yes

We use Ubuntu’s unattended-upgrades service (documentation) to automatically install security updates, so we only need to check if reboot is required.

Our code isn’t open source and freely available, as it contains references to all our servers and login users. However, we could share a sanitised version of the repo if there’s interest.

kawaiipunk · 7 February 2018 13:12

We’ve been testing the Netdata to Telegram bot alerts today and it’s working great. Should tide us over for alerts for the time being so we can focus on our clients. We could perhaps try and write a Signal plugin sometime.

Deffo agree with @chris about the unattended-upgrades. It’s only on non-critical and somewhat simplistic servers that we do that.

Interesting to see the range of solutions that folks are using. This has been super useful thanks

stephen · 9 February 2018 16:24

To elaborate a bit on what Finn has said, we use Icinga2 to monitor when security updates are required. I then have an Ansible script that runs through all of our servers, checks for updates, applies any required and optionally reboots the servers. I generally do this once a week or when a patch for a serious issue is released.

I’m happy to supply this Ansible script is others would find it useful. It only works on Ubuntu servers, but could easily be adapted to other flavours of Linux.

This works for about 70 servers and I can see it working on over 100, but I don’t think it will scale much beyond that.

chris · 9 February 2018 18:29

I’d be interested to see that, you could upload to a repo at git.coop and make it only available to people with accounts if you just want to share with other co-operators?

stephen · 12 February 2018 10:57

The script is part of a big Ansible repo that contains client info. I’ll look at sanitising it at some point. In the meantime the script is:

# Upgrade Ubuntu based systems and reboot if necessary.
---

- name: Update and reboot
  hosts: all
  become: true
  tasks:

  - name: Update apt cache
    apt:
      update_cache: yes
      cache_valid_time: 0
    tags:
      - update

  - name: Check for updates
    command: /usr/lib/update-notifier/apt-check --package-names
    register: packages
    tags:
      - update

  - name: Updates required
    debug:
      msg: "{{ packages.stderr }}"
    when: packages.stderr != ""
    tags:
      - update

  - name: Dist upgrade
    apt:
      update_cache: no
      upgrade: dist
    when: packages.stderr != ""
    tags:
      - update

  - name: Check for reboot
    stat:
      path: /var/run/reboot-required
      get_md5: no
    register: reboot
    tags:
      - update
      - reboot
      - check-reboot

  - name: Reboot required
    debug:
      msg: "Reboot required"
    when: reboot.stat.exists == true
    tags:
      - update
      - check-reboot

  - name: Reboot
    shell: sleep 2 && /sbin/shutdown -r now "Reboot triggered by Ansible"
    async: 1
    poll: 0
    when: reboot.stat.exists == true
    tags:
      - reboot

  - name: Rebooting?
    debug:
      msg: "Rebooting"
    when: reboot.stat.exists == true
    tags:
      - reboot

  - name: Pause for server reboot
    pause:
      seconds: 30
    when: reboot.stat.exists == true
    tags:
      - reboot

  - name: Wait for reboot
    become: false
    local_action: shell ansible -u {{ ansible_user_id }} -m ping {{ inventory_hostname }}
    register: result
    until: result.rc == 0
    retries: 30
    delay: 10
    when: reboot.stat.exists == true
    tags:
      - reboot

chris · 17 July 2018 08:48

Just to note that this Ubuntu specific code:

Can be changed to the following for Debian (you need to install the apt-show-versions package):

- name: Check for updates
  command: apt-show-versions -b -u
  register: packages
  tags:
    - update

matt · 17 July 2018 10:24

Has anyone experimented with using Prometheus to track software updates on a server? The node_exporter provides metrics on a software updates on the server. This could then be displayed on a dashboard like Grafana making it easier for people to see the state of a server. You could also use Alertmanager with Prometheus to create alarms when the number of packages goes over a threshold.

I’ve been working on deploying Prometheus and Grafana using docker-compose and found it relatively easy to do and could share the Ansible scripts if people also wanted to give it a go.

kawaiipunk · 24 July 2018 09:39

I’ve been working on deploying Prometheus and Grafana using docker-compose and found it relatively easy to do and could share the Ansible scripts if people also wanted to give it a go.

Yes please

chris · 15 August 2018 08:29

If anyone was looking for a Ansible role to apply the fix for the latest Debian Linux kernel security update, you could copy this one:

---
- name: Set net.ipv4.ipfrag_low_thresh
  command: /sbin/sysctl -w net.ipv4.ipfrag_low_thresh=196608

- name: Stat net.ipv6.ip6frag_low_thresh
  stat:
    path: /proc/sys/net/ipv6/ip6frag_low_thresh
  register: ip6frag_low_thresh

- name: Set net.ipv6.ip6frag_low_thresh
  command: /sbin/sysctl -w net.ipv6.ip6frag_low_thresh=196608
  when: ip6frag_low_thresh.stat.exists

- name: Set net.ipv4.ipfrag_high_thresh
  command: /sbin/sysctl -w net.ipv4.ipfrag_high_thresh=262144

- name: Stat net.ipv6.ip6frag_high_thresh
  stat:
    path: /proc/sys/net/ipv6/ip6frag_high_thresh
  register: ip6frag_high_thresh

- name: Set net.ipv6.ip6frag_high_thresh
  command: /sbin/sysctl -w net.ipv6.ip6frag_high_thresh=262144
  when: ip6frag_high_thresh.stat.exists

- name: Update /etc/sysctl.conf
  blockinfile:
    dest: /etc/sysctl.conf
    marker: "# {mark} ANSIBLE MANAGED BLOCK https://lists.debian.org/debian-security-announce/2018/msg00201.html"
    block: |
      net.ipv4.ipfrag_high_thresh = 262144
      net.ipv6.ip6frag_high_thresh = 262144
      net.ipv4.ipfrag_low_thresh = 196608
      net.ipv6.ip6frag_low_thresh = 196608

chris · 8 November 2018 11:23

The Ansible code above was failing today as the latest version of Docker also installs two new packages (containerd.io and docker-ce-cli) and apt-show-versions -b -u wasn’t listing these, so I have switched to using this command to see what is available to be upgraded and also show new packages:

- name: Check for updates and new packages using apt-get dist-upgrade -q -s
  shell: apt-get dist-upgrade -q -s | grep '^The following' -A1 | grep '^ ' | xargs
  args:
    warn: False
    executable: /bin/bash
  register: packages

And I have moved this code from a private repo to a public Ansible role:

@stephen hope you can come to the Ansible session at the CoTech hack — the use of Ansible Galaxy to make roles public and shareable is one of the mains thing we want to discuss.

I’m now about to use this role to upgrade the server running this Discourse site so it is going to go down for a few minutes…