Back to Articles
Operations & Administration

Running the PI System Day-to-Day: a PI Administrator playbook

PI Systems rarely fail loudly; they degrade quietly: buffer queues grow, scans slip, archives fill, certificates expire, and temporary fixes become tribal knowledge. This playbook gives PI Administrators a pragmatic o...

9 min read15 views

Running the PI System Day-to-Day: a PI Administrator playbook

Meta description: Practical day-to-day guidance for a PI Administrator: operations, backups, DR, upgrades, troubleshooting, monitoring, alerting, runbooks and maturity.

Audience and scope

PI Systems rarely fail loudly; they degrade quietly: buffer queues grow, scans slip, archives fill, certificates expire, and temporary fixes become tribal knowledge. This playbook gives PI Administrators a pragmatic operating model: what to do daily/weekly/monthly, how to manage change safely, how to recover quickly, and how to raise maturity over time.

Intended readers: PI Administrators, PI Engineers, and OT/IT architects responsible for operating PI Server and its surrounding services, not just installing them.

If you’re designing or re-platforming, start with Designing a Scalable and Resilient PI System Architecture and return here once the platform is live: Designing a Scalable and Resilient PI System Architecture

The PI Administrator role

A PI Administrator bridges OT realities (devices, networks, maintenance windows) and IT realities (patching, identity, backups, monitoring, audit). The role is about reducing operational risk rather than knowing every product detail.

Core responsibilities:

  • Service ownership: PI Server availability, data continuity, client access.
  • Change control: upgrades, patches, configuration drift, permissions.
  • Data reliability: end-to-end acquisition health.
  • Security posture: least privilege, segmentation, identity integration, auditability.
  • Operational hygiene: runbooks, documentation, repeatable procedures.

Treat PI as a product: maintain a backlog, define SLOs, and measure outcomes (data loss, latency, incident frequency). For context on ingestion, see How Data Gets Into the PI System: Interfaces, Adapters, and MQTT

Day-to-day operational tasks

Operate to confirm normal behaviour, spot drift early, and keep changes small and reversible.

Daily checks (10–20 minutes; automate where possible) Focus on signals that indicate user impact or impending failure:

  • Data flow: key sources current; no widespread stale tags; buffer accumulation or flush failures; scan overruns or measurement gaps.
  • Core services: PI services stable (no flapping); dependent components reachable.
  • Storage/archives: archive creation proceeding; free disk above threshold; no unusual archive I/O latency.
  • Time synchronisation: PI and acquisition nodes aligned to your time source.
  • Access/auth: authentication failures or locked service accounts.
  • Change awareness: review changes since yesterday (patches, firewall, switch maintenance, certificate updates).

Drive checks from dashboards rather than ad-hoc RDP sessions. For performance baselining, see Keeping PI Fast, Stable, and Predictable at Scale.

Weekly tasks

  • Review event logs for recurring warnings.
  • Verify backups are not just completed but restorable.
  • Triage small risks: expiring certificates, low disk, failing interface nodes.
  • Update documentation and runbooks after changes.
  • Review new point creation or configuration changes for standards compliance.

Monthly tasks

  • Plan patch windows: OS and PI updates coordinated with OT stakeholders.
  • Capacity review: archive growth, AF database trends, interface node load.
  • Access review: group membership, service accounts, stale identities.
  • DR readiness: verify credentials, media, and runbooks for recovery.

Standardise early If you standardise three things, pick:

  1. Naming and ownership: every critical data stream has an owner and on-call path.
  2. Maintenance windows: a known cadence beats “whenever possible”.
  3. One source of truth: controlled repository for runbooks, diagrams, and change history.

Backup, restore and disaster recovery

Backups are a capability: set RPO/RTO targets for each PI component and prove you can meet them.

What to restore A recovery plan should include:

  • PI Data Archive: archives, configuration, server-level settings.
  • AF and related databases: SQL backups (full/diff/log) aligned to RPO.
  • Interface/connector configuration: interface setups, service configs, connection strings, certificates.
  • Client assets: PI Vision displays (where applicable).
  • Security and identity: trust relationships, AD groups, service accounts.
  • Certificates and secrets: stored and rotated securely.

Treat VM snapshots as convenience, not a substitute for application-consistent backups and tested restores.

Backup design principles

  • Align backup frequency with data criticality.
  • Keep backups local for speed and offsite/immutable for ransomware resilience.
  • Back up configurations as code: exports, scripts, versioned config.
  • Document dependencies required during restore: DNS, time sync, firewall rules, SQL availability.

Restore testing Quarterly restore tests typically give the best ROI.

Test plan:

  1. Restore to an isolated environment with controlled name/IP handling.
  2. Validate: services start cleanly; data is queryable and time-aligned; AF elements and calculations load; acquisition resumes or is simulated.
  3. Measure actual RTO against targets.
  4. Update the runbook with surprises.

DR patterns Architecture determines DR options—see High availability patterns for PI Data Archive, AF, and PI Vision for trade-offs.

Common approaches:

  • Backup/restore to standby: simpler, slower; RTO in hours.
  • Warm standby: faster; requires discipline and validation.
  • Highly available designs: fastest but complex; still require ongoing validation.

Patch and upgrade strategies

Treat patching and upgrades as repeatable processes to reduce unknowns: scope, dependencies, rollback, testing.

Patching vs upgrading

  • Patching: security/stability fixes within a release line; still requires testing.
  • Upgrading: functional and schema changes; needs broader testing, communication and training.

Practical change strategy

  1. Inventory: list PI components, versions and locations.
  2. Define blast radius: expected impact if node is down for 15 minutes, 1 hour.
  3. Use a staging environment that represents critical workflows.
  4. Test actual user workflows: trending, key AF searches, calculations, interfaces.
  5. Plan rollback explicitly as a documented sequence with prerequisites.
  6. Communicate operationally: what changes, when, how to detect issues, and who to call.

Coordinating with data ingestion Many issues appear as data latency rather than outright failure. Treat ingestion as a first-class dependency:

  • Know buffering behaviour and safe buffer duration.
  • Validate time sync, firewall rules, certificates and endpoint names before windows.
  • After change, confirm end-to-end flow from representative sources.

For context, see: How Data Gets Into the PI System: Interfaces, Adapters, and MQTT.

Troubleshooting workflows

Effective troubleshooting is repeatable and evidence-driven.

Fault domains When users say “PI is down”, investigate these zones:

  1. Client layer: network path, DNS, browser, user permissions, client service health.
  2. Application layer: PI Vision, AF services, calculation engines, connectors.
  3. Core PI services: Data Archive/AF services, configuration, performance saturation.
  4. Acquisition layer: interface/adaptor health, buffering, source system downtime.
  5. Infrastructure: storage latency, VM host contention, firewall changes, domain controller, time sync.

A runbook should force you to declare the zone before experimenting.

Incident workflow

  1. Define impact: scope (site, app, everyone) and symptom (delay, loss, query failure).
  2. Check recent changes in the last 24–72 hours.
  3. Validate core service health: status, utilisation, disk latency, errors.
  4. Validate ingestion: buffer state, connections, source availability.
  5. Confirm recovery: data flow resumed, latency normalised, backfill understood.
  6. Capture learning: root cause, detection gap, preventive action.

Evidence to capture

  • Incident timeline (start, detect, mitigate, recover).
  • Symptom examples: tag names, AF paths, screenshots, error messages.
  • Key metrics at the time: CPU, memory, disk queue, network latency.
  • Relevant logs and exported events stored with the ticket.

For performance baselining, see Performance, Scaling & Reliability and for security symptoms read Security, Identity & Compliance.

Monitoring and alerting

Monitoring must change behaviour: alerts should be actionable, routed to owners, and tuned to avoid fatigue.

Minimum viable monitoring Availability

  • Core service uptime (PI/AF services and key connectors).
  • Endpoint reachability from representative client networks.

Data quality and timeliness

  • Stale data rate for critical tags.
  • Ingestion latency (buffer depth, queue age).
  • Gap detection for important measurements.

Capacity and performance

  • Archive disk free space and growth rate.
  • Storage latency indicators.
  • CPU/memory saturation and sustained utilisation.
  • SQL health for AF backends.

Security signals

  • Authentication failures spikes.
  • Certificate expiry and failed handshakes.
  • Unexpected permission changes (where auditable).

Alert design

  • Alert on user-impacting symptoms, not every warning.
  • Use multi-signal conditions (e.g. stale data + buffer growth).
  • Route alerts to the owner of the interface/source, not only to PI Admins.
  • Document first actions: what to check, what’s normal, how to escalate.

For metrics and baselining, start with Performance, Scaling & Reliability.

Documentation and runbooks

If it isn’t written, you don’t own it.

Runbook essentials A runbook should be an operational decision tree with commands, expected outcomes, locations and contacts.

Include:

  • Service map: what runs where, dependencies (DNS, AD, SQL, storage, firewall).
  • SOPs: restart sequences (and what not to restart), certificate renewal, adding a data source, user access flows.
  • Incident playbooks: data delay/buffering, no data from a site, client connection failures, archive disk full.
  • Change records: what changed, who approved, rollback steps, validation checklist.
  • Contacts and escalation: OT owners, network/security teams, storage admins, vendors.

Make docs usable under pressure

  • Put the “first five minutes” at the top of procedures.
  • Prioritise exact paths, service names and log locations over large screenshots.
  • Store runbooks in a controlled repository with version history.
  • Review quarterly and exercise them regularly.

Link runbooks to security guidance: Securing the AVEVA PI System in Modern Enterprise Environments.

Operational maturity roadmap

Most PI estates follow a predictable path: initial success, organic growth, then operational strain. Use a roadmap to prioritise improvements.

Level 1: Reactive

  • Minimal monitoring; users report issues.
  • Backups exist but restores untested.
  • Ad-hoc changes; sparse documentation. Next steps: daily health dashboard, minimal runbook for top incidents, start quarterly restore tests.

Level 2: Managed

  • Basic monitoring and alert routing.
  • Defined maintenance windows and change control.
  • Documented ingestion ownership. Next steps: capacity trending, standardise onboarding, align security to least privilege.

Level 3: Proactive

  • Baselines and SLOs; drift detected before outages.
  • Regular patch cadence and staging environment.
  • Tested DR with known RTO/RPO. Next steps: reduce alert noise, automate common checks, conduct post-incident reviews.

Level 4: Optimised

  • Architecture and operations reinforce each other.
  • Continuous documentation updates tied to change workflow.
  • Platform roadmap and stakeholder governance. Next steps: periodic architecture reviews, improve ingestion resilience and segmentation, treat PI as a funded product.

When to get external help

Involve specialists for major upgrades, repeated data loss, chronic performance issues, or DR redesigns—especially when outage windows are small. A short engagement to validate runbooks, monitoring or upgrade plans often prevents costly downtime.

For hiring or career guidance:

Next reading on PIAdmin.com

Share: