Back to Articles
Operations & Administration

PI Administrator runbook: daily, weekly and monthly checks

PI Systems seldom fail completely; availability usually degrades incrementally: delayed events, late interfaces, rising archive queues, a backup that stopped, or a “temporary” trust that was never removed. A reliable...

6 min read85 views

PI Administrator runbook: daily, weekly and monthly checks

Meta description: Practical PI Administrator runbook for daily, weekly and monthly checks, with incident triage guidance for monitoring, backups and stability.

PI Systems seldom fail completely; availability usually degrades incrementally: delayed events, late interfaces, rising archive queues, a backup that stopped, or a “temporary” trust that was never removed. A reliable runbook emphasises consistent checks that catch drift early rather than heroic troubleshooting.

This runbook assumes basic observability (PI Server health dashboards, interface/service status visibility, Windows event collection and alerting). Where those are missing, the checks highlight the minimum you should standardise before you rely on them.

If you need a refresher on hardening, review PI Admin guidance: Securing the AVEVA PI System in Modern Enterprise Environments.

How to use this runbook

  • Treat schedules as cadence, not doctrine: daily for active degradation, weekly for trends/backlogs, monthly for recovery confidence and controlled change.
  • If you cannot define what “good” looks like for a metric, capture 2–4 weeks of baseline behaviour before automating alerts. Alert on deviations from baseline rather than a single absolute for a new site.

Daily operational checks (15–30 minutes)

  1. Confirm end-to-end data flow
  • Pick a handful of representative points (high-rate, low-rate, critical) and check last event times are recent and plausible.
  • If buffering is used, verify buffer queues are stable or draining. Flatline data with no alarms is a common early sign of trouble.
  1. Review interfaces and connectors
  • Focus on exceptions: repeated reconnects, authentication failures, timeouts.
  • If you have a central view (PI ICU, connector tooling), prioritise state changes. Otherwise check Windows Service status and that logs are being written.
  1. Check PI Server resource headroom and backlogs
  • Verify no sustained CPU/memory spikes, excessive disk queue, or I/O contention.
  • Check PI-specific queues (archive/queue) and client connection churn for steady patterns predicting incidents.
  1. Validate backups and their usability
  • Confirm scheduled backups completed, files are plausible size in the expected location and retention is working.
  • For stricter recovery SLAs, spot-check presence in secondary location or vault and that copy jobs ran. This check often prevents prolonged outages.
  1. Scan for security and access anomalies
  • Look for unexpected trusts, new identities, mapping changes, repeated failed logons or unusual client connection increases.
  • Point to your internal hardening/ change-control standard; if needed, use PIAdmin guidance above.
  1. Confirm time synchronisation
  • Ensure PI Server and key interface nodes are synchronised to your authoritative time source. Time drift causes future-dated events, compression anomalies and odd calculations.
  1. Check relevant disk space
  • Beyond C:, check archive volumes, interface log volumes, buffering locations and any shared folders used by exports or integrations. Disk-full events cascade and erase evidence.

Incident triage workflow (use when something looks off)

Step A — Rapid impact classification

  • Decide scope: single tag/asset, single interface, site network segment or PI Server-wide by comparing points sharing sources vs different sources.

Step B — Determine symptom type

  • “No data”: service stoppage, permissions, device outage or mapping break.
  • “Late data”: buffering/backlog or resource contention.
  • “Bad data”: scaling, timestamp issues or source config changes.

Step C — Work upstream first

  • From PI point → interface/connector → source. Avoid starting with consuming applications; they usually see downstream symptoms.

Step D — Preserve evidence before restarts

  • Capture logs, queue sizes and timestamps. If you restart, record exact times and before/after observations.

Step E — Choose recovery path

  • If buffered: plan safe catch-up without overwhelming storage or network.
  • If not buffered: stabilise rapidly and communicate the data gap.

Step F — Close the loop

  • Record scope, duration, root cause (or best hypothesis) and one preventive action. Use these notes to refine the runbook.

Weekly operational checks (45–90 minutes)

  1. Trend leading health signals
  • Review the week for recurring warnings: intermittent reconnects, auth failures, archive/IO pressure or rising memory correlated with workloads. Identify two or three reliable indicators to alert on.
  1. Review buffering behaviour and recovery
  • Ensure buffers have been exercised and recovered as expected. A buffer that never runs may be misconfigured; frequent recovery indicates upstream instability.
  1. Check capacity and growth
  • Sanity-check archive growth and volume headroom. Identify unplanned ingestion (scan class changes, compression, new points) before storage incidents occur.
  1. Review system/application events for patterns
  • Inspect Windows and PI logs for certificate issues, DCOM problems, failed scheduled tasks or AV/EDR interference. Confirm AV exclusions and EDR policies remain PI-aware.
  1. Validate change control and temporary access
  • Remove temporary trusts, expire emergency access and reconcile pressured changes. Prevent “permanent temporary” configurations that become audit findings.

Monthly operational checks (half‑day, scheduled)

  1. Test restore procedures
  • Perform a documented restore test aligned to RTO/RPO. For small estates, restore key config to a non‑production host and confirm PI services start cleanly. For larger estates, run tabletop exercises and rotating component restores. If full restores are difficult politically, prove you can retrieve last night’s backup from the secondary location within a defined time.
  1. Review patch cadence and maintenance windows
  • Reconcile what was patched (OS and supporting components) and what wasn’t. Define approvers, post‑patch validation (including data flow) and rollback procedures.
  1. Review user access and service accounts
  1. Check architecture and documentation hygiene
  • Maintain an authoritative inventory: what runs where, interface inventory, simple data‑flow diagrams and key dependencies (DNS, time, certificates, firewall rules, service accounts). Monthly, reconcile inventory with reality.
  1. Review alert quality and reduce noise
  • Tune thresholds based on observed behaviour. Remove alerts that don’t lead to action. Where trusted, move to anomaly detection rather than static thresholds.

Minimum viable monitoring

  • Prioritise visibility that supports the runbook: ingesting, buffering, storing, backing up and basic security posture.
  • Two dashboards are useful: “today” (ingestion freshness, buffering/backlog, storage headroom, backup status) and “trend” (growth, recurring errors, restarts, access anomalies). Keep them simple and actionable.

Getting specialist help

  • For large, regulated or near‑capacity estates, consider a short health check or runbook validation by experienced practitioners. Start with the PIAdmin directory.

Skills and careers

  • Operational excellence is learnable. Rotate runbook ownership to build resilience and reduce single points of failure. Benchmark PI admin roles here: View PI Administrator roles.

Where to go next on PIAdmin.com

Share: