High availability patterns for PI Data Archive, AF and PI Vision
High availability for PI is not just “add a secondary server”. It’s designing for the specific failure modes you must tolerate: a failed node, a site outage, patch windows, certificate rollover, SAN firmware faults, D...
High availability patterns for PI Data Archive, AF and PI Vision
Meta description: Practical PI high availability patterns for PI Data Archive, AF and PI Vision, covering clustering, load balancing, dependencies, trade-offs and anti-patterns.
High availability for PI is not just “add a secondary server”. It’s designing for the specific failure modes you must tolerate: a failed node, a site outage, patch windows, certificate rollover, SAN firmware faults, DNS changes, or a Group Policy that breaks Kerberos. Teams often make one tier resilient (for example PI Data Archive) while leaving single points of failure elsewhere (often AF, PI Vision, SQL Server, load balancers or name resolution).
This article describes pragmatic HA patterns for PI Data Archive, PI AF and PI Vision — what fails in production, what stays running during maintenance, and which operational practices turn “technically redundant” into “actually available”.
If you haven’t recently revisited the foundations, start with the Architecture & Design pillar on PIAdmin.com and then return here: Designing a Scalable and Resilient PI System Architecture.
Define availability the way operators experience it
- Agree what “available” means per workload. Examples:
- PI Data Archive: trends load, new values update, interfaces don’t buffer.
- PI AF: asset navigation, analyses and event frames run.
- PI Vision: displays load, authentication completes.
- For each item decide whether it must survive:
- A node failure
- Routine maintenance (patching, cert renewal, service restarts)
- A full site loss
- Those decisions determine whether you need local HA only, site resilience (DR), and which shared dependencies must be redundant.
PI Data Archive HA: collectives and practical implications
- The standard pattern is a PI collective: multiple archive servers configured so clients can fail between members. This protects server availability and enables rolling updates.
- A collective protects the archive servers, not upstream/downstream dependencies. Common pitfalls:
- Interfaces pointing to a single hostname that resolves to one member.
- Firewall rules bound to a specific host.
- Clients using hard-coded servers instead of collective-aware connection methods.
- Decide write-path behaviour deliberately:
- Some interfaces write to a primary and rely on buffering; that preserves data continuity but may pause ingestion during failover.
- Reads via a secondary member can continue while ingestion buffers — that changes the availability promise.
- Test under load. Buffering reduces data loss risk but does not guarantee uninterrupted user experience.
Data continuity versus user continuity
- Buffering and its health determine data loss risk.
- User continuity (trends, displays) depends on client failover, name resolution and use of collective-aware connection strings.
- If you must prioritise, eliminate single points that prevent reads. Operators tolerate short ingestion delays more readily than a system that appears down while a secondary member is running.
PI AF HA: the SQL Server dependency
- AF Server is an application tier; the AF database lives in SQL Server. You cannot separate AF HA from SQL availability.
- Typical resilient pattern: multiple AF Server nodes pointing at a highly available SQL platform designed to meet your RPO/RTO and patching model.
- If SQL is a single VM, adding AF nodes does not deliver meaningful HA.
- Consider the workloads you depend on: AF Analyses and Event Frames require AF Server availability — you’re protecting a workload, not just a metadata repository.
Identity and trust dependencies
- Failovers can change SPNs, break constrained delegation or expose time skew, producing outages that look like “AF is down” but are authentication-related.
- Revisit Securing the AVEVA PI System in Modern Enterprise Environments during HA design. HA tests that ignore Kerberos, delegation, certificates and service accounts are incomplete.
PI Vision load balancing: scale-out with state awareness
- PI Vision is commonly scaled behind a load balancer to provide resilience and concurrency.
- Use load balancer health checks that reflect user experience, not just “port 443 open”. Health probes should include authentication and basic display rendering where feasible.
- Configure graceful node drains for patching so authentication flows and sessions aren’t disrupted.
- PI Vision depends on AF, PI Data Archive, Active Directory and DNS. Web-tier HA alone only survives IIS crashes; preserving business continuity requires upstream redundancy.
Capacity and failure-mode planning
- Ensure remaining nodes have headroom to absorb extra load without resource exhaustion (CPU, memory, thread pools), especially during peak periods.
- Validate by draining a node during a busy period and observing responsiveness and latency, not just uptime.
Reference architectures Most estates adopt one of three pragmatic approaches:
- Single site, local HA
- PI Data Archive collective; AF servers on redundant nodes with SQL HA; PI Vision scaled behind a load balancer.
- Suitable when the site is reliable and a site outage is handled as a DR event.
- Single site, resilient + recoverable
- As above, but with explicit, tested DR procedures rather than active-active across sites.
- Simpler, often aligns better with actual business needs.
- Multi-site resilience
- Designed to survive a data-centre or campus loss.
- Latency, replication behaviour and identity dependencies dominate complexity. Start with clear RTO/RPO targets and run tabletop exercises with OT and IT.
Trade-offs
- HA increases operational complexity. A PI collective adds coordination: patching order, member health checks, client validation and disciplined DNS/network change control.
- AF HA requires mature SQL operations. If SQL HA is immature, AF inherits slow failovers and untested procedures. Investment in runbooks and DBA collaboration is essential.
- PI Vision scale-out adds load balancer configuration, certificate management across nodes and a larger surface to patch and harden. You need a repeatable certificate rotation and node-draining process.
Anti-patterns
- Redundancy in one tier with single-point dependencies elsewhere: e.g. archive collective with a single AF server, or many PI Vision nodes relying on a fragile AF/SQL stack.
- Treating buffering as a substitute for HA: buffering preserves data but can mask instability and create long delays in current values.
- “HA by DNS trickery”: manual or ad-hoc DNS changes fail under real events due to cached DNS, firewall rules, SPNs and application config inconsistencies. Prefer tested client failover mechanisms and stable naming patterns.
- Health checks that only test a port: a service can respond to HTTPS while upstream authentication or data access fails. Probes should exercise the critical path.
Operational habits that make HA real
- Validate HA in routine maintenance, not just in diagrams.
- Rolling patching is the key test: patch one PI Vision node at a time, drain and reboot an AF node, update a PI Data Archive member without breaking client connectivity.
- Tight change control is essential: certificate renewals, service account updates and cipher policy changes are common failure points.
- Treat identity, certificates and name resolution as first-class HA components with owners, runbooks and test windows.
Getting specialist help If you want a design review, failover test plan or a migration path from single-server to HA, engage implementers who have done this work operationally. You can browse PI System integrators and consultants at https://piadmin.com/directory and focus on organisations experienced in both OT constraints and enterprise infrastructure.
Where to go next on PIAdmin.com
