GeekSocket Plug in and be Geekified

Monitoring workstation with Prometheus

Prometheus is a monitoring system and a time series database. It can collect metrics from different places and store it as series of values over time. It uses pull based mechanism to collect the metrics. Applications can expose the metrics in a plain text format using HTTP server, which is then fetched by Prometheus. Fetching of metrics is called scraping.

For other systems which don’t expose the metrics in Prometheus exposition format, we can use exporters. When Prometheus requests metrics, these exporters calculate the values and provide them to Prometheus. For metrics related to a machine, we can use node_exporter. It supports collection of metrics like CPU, memory, disk and so on. List of all the supported collectors by node_exporter.

We can query this time series data using Prometheus. It provides a web UI and API to run queries. The queries are written using a query language called PromQL.

"Phippy and Zee Go To The Mountains — A Prometheus Story” illustration explains more about querying and types of vectors in very interesting way.

Running Prometheus on workstation

When I bought my current workstation (Lenovo ThinkPad T480), I had heating issues with it. Those were mostly because of unsupported discrete graphics card (basically the machine has two GPUs, Intel UHD Graphics 620 and Nvidia GeForce MX150). I wanted something which can show me the graph of CPU usage, temperature from CPU and graphics card. So I though why not just use Prometheus for this purpose.

By downloading two binaries (Prometheus and node_exporter) and running them in separate terminals, I was ready to collect the metrics on my machine. I was mostly interested in CPU usage and temperature from different sensors.

Prometheus dashboard showing a temperaturegraph

After observing the values, it was confirmed that graphics card was heating up. It was showing temperature of graphics card as 90 °C, but it’s possible that these values were wrong due to unsupported hardware. I wanted to show the graph of increased temperature values from different sensors. But, I lost that data after 15 days (default retention period). I disabled the Nvidia graphics card by blacklisting the nouveau kernel module.

Writing systemd unit files

I kept using it that way for many days. Whenever I rebooted the system, I used to run both node_exporter and prometheus binaries. Doing it manually every time is not ideal, so I decided to write systemd unit files for both Prometheus and node_exporter.

Read more about systemd and writing unit files on systemd - ArchWiki.

While reading more about systemd, I found that systemd has a user mode as well. This mode is useful in scenarios where we want to manage units per user. That way the units are under user’s control.

On Fedora, systemd is also started in user mode i.e. systemd --user. You can check that by running pgrep on systemd.

$ pgrep --list-full systemd
1 /usr/lib/systemd/systemd --switched-root --system --deserialize 31
…
2244 /usr/lib/systemd/systemd --user

Checkout systemd/User - ArchWiki, for more information about user mode.

I wrote service files for Prometheus and node_exporter. After that, I copied these files to ~/.config/systemd/user, so that systemd can manage them. The binaries are kept in ~/.local/bin.

$ systemctl --user status prometheus
● prometheus.service - Monitoring system & time series database
     Loaded: loaded (/home/bhavin/.config/systemd/user/prometheus.service; enabled; vendor preset: disabled)
     Active: active (running) since Sat 2020-05-16 21:27:47 IST; 4 days ago
       Docs: https://prometheus.io/docs
   Main PID: 2294 (prometheus)
      Tasks: 21 (limit: 18865)
     Memory: 143.4M
        CPU: 2min 43.824s
     CGroup: /user.slice/user-1000.slice/user@1000.service/prometheus.service
             └─2294 /home/bhavin/.local/bin/prometheus --config.file=…

You can find the systemd service files on GitLab: prometheus.service, node_exporter.service.

Prometheus configuration and directory structure

For storing data and configuration of Prometheus, I’m following XDG Base Directory Specification.

  • Configuration file prometheus.yaml is stored in ~/.config/prometheus/.
  • Data is stored in ~/.local/share/prometheus/data.

Apart from data and configuration file path, I’m adding following flags to prometheus (snippet from prometheus.service file).

ExecStart=%h/.local/bin/prometheus \
	  --storage.tsdb.retention.time="180d" \
	  --web.listen-address="127.0.0.1:9099" \
	  …

The --storage.tsdb.retention.time flag tells Prometheus to retain the collected data for 180 days. Any data older than this time period will be deleted automatically.

The second flag --web.listen-address, makes sure that we are exposing the Prometheus endpoint only on localhost and not on any other interface. It’s default value is 0.0.0.0:9090, which exposes the end point on all the available interfaces. Same with node_exporter.

That little %h from prometheus.service file is a specifier which resolves to user’s home directory.

This setup is automated using these tasks from my Ansible playbook.

With release of version 1.0.0-rc.1, node_exporter added support for Power Supply Class. This change makes it possible to collect metrics related to power supply like battery percentage for a laptop.

Prometheus dashboard showing a battery percentagegraph


Comments

comments powered by Disqus