What is the difference between SwiftUI volumes and immersive spaces in visionOS?

SwiftUI volumes are bounded 3D windows — think of them as depth-aware panels that follow windowed-app conventions. Immersive spaces fill the user's visible environment and require custom spatial navigation. Volumes suit applications whose underlying information is 2D or 2.5D. Immersive spaces are justified only when spatial navigation or full environmental presence is core to the user value — architectural walkthroughs, VR training, or collaborative virtual rooms. Choosing immersive space for a document viewer or dashboard adds complexity without adding value.

When should I use RealityKit versus Unity PolySpatial for Vision Pro development?

Use RealityKit when you are targeting visionOS exclusively, your team has Swift and Apple platform experience, and you need tight integration with ARKit scene understanding or visionOS UI conventions. Use Unity PolySpatial when you need to ship to both Vision Pro and Meta Quest from a shared codebase, or when your team's existing Unity expertise makes native Swift development a bottleneck. PolySpatial adds an abstraction layer that occasionally surfaces missing features or performance surprises, so benchmark your specific use case on device before committing.

How much does it cost to develop an Apple Vision Pro app?

Cost varies significantly by scope. A focused single-mode application — one SwiftUI volume with RealityKit content and basic hand-tracking interaction — can be scoped in weeks. A multi-mode application combining immersive spaces, generative AI integration, and WebGL content like Work Spatial requires months of engineering, UX, and 3D art work. The larger cost driver is usually experience design iteration — teams that skip spatial UX research and prototype directly in code spend more time reworking than teams that validate spatial interaction patterns first.

What are the main challenges of enterprise Vision Pro app development?

Enterprise Vision Pro projects face four consistent friction points: IT infrastructure integration (VPN, identity management, MDM), workflow integration with existing desktop systems, device procurement and management at $3,499 per unit, and change management for staff who have never used a spatial wearable. The technical build is rarely the blocker. Enterprise projects that succeed concentrate on use cases with inherent 3D information — surgical planning, architectural review, industrial assembly guidance — where spatial visualization provides value that no flat-screen tool can replicate.

Does Vision Pro support WebGL and web-based content?

Yes. Safari on visionOS renders WebGL content, and developers can embed web views within native visionOS applications. WebGL content can also be rendered to a texture and displayed on 3D geometry within a RealityKit scene. Work Spatial used this pattern to integrate existing WebGL visualizations into a spatial RealityKit environment without rebuilding them natively. The constraint is interaction: web UI expects pointer input, while Vision Pro uses hand tracking. The practical solution is to treat embedded WebGL as observational content and route detailed interactions through secondary windows or companion interfaces.

What is the visionOS SDK and what does it include?

The visionOS SDK is Apple's development toolkit for building spatial applications on Apple Vision Pro. It includes SwiftUI extensions for spatial layouts, RealityKit for 3D rendering and physics, ARKit for environmental understanding (plane detection, scene reconstruction, image tracking), spatial audio APIs, hand and eye tracking input frameworks, and access to the Vision Pro App Store submission pipeline. The SDK ships inside Xcode and includes a device simulator, though critical testing — especially interaction latency and rendering performance — requires physical hardware via the Apple Vision Pro developer kit.

Vision Pro App Development: How to Decide Your Build Strategy for visionOS

Vision pro app development rarely breaks at the code level. We've seen it stall consistently at a layer earlier than that — the experience design layer, where teams realize that the spatial interface they imagined is not actually spatial. It's a flat workflow wearing a headset. This post is a five-variable decision framework to catch that problem before it costs a sprint.

The framework draws directly from our work on Work Spatial, an AI-driven XR onboarding MVP that combined spatial computing, Generative AI, and WebGL. That project forced explicit decisions at every branch: which renderer, which UI mode, which interaction model, which use case genuinely belonged in spatial space, and whether the enterprise deployment context would actually support what we were building. The same five decisions appear in every visionOS project we've evaluated since.

Work through them in order. Each one narrows the solution space before you write a line of visionOS code.

Variable 1 — Does Your Use Case Have Inherent Spatial Meaning?

This is the gate. If the answer is no, the rest of the framework is moot — you should not be building for Apple Vision Pro at all, or you should be building something different than what you're currently imagining.

A use case has inherent spatial meaning when spatial relationships between elements encode information that would be lost or degraded in two dimensions. Architectural walkthroughs, surgical anatomy planning, industrial assembly guidance, and collaborative virtual environments all qualify. A document editor, a spreadsheet, a chat interface, and a task list do not — regardless of how they look floating in passthrough mode.

The test is simple: if you removed the spatial dimension entirely and displayed the same content on a flat monitor, would the user lose meaningful information or capability? If yes, you have a spatial use case. If no, you have a flat interface seeking a novel display medium — and the additional complexity of visionOS app development will work against you, not for you.

Work Spatial passed this test. Onboarding involves learning physical environments and spatial procedures. A new employee learning a manufacturing floor or a hospital ward needs to understand where things are in relation to each other — that's inherently spatial. The AI-guided exploration component made sense precisely because the environment being learned had real three-dimensional structure. The generative AI layer provided contextual guidance tied to the user's spatial position, which a PDF or a video cannot replicate.

Decision output: If your use case passes the spatial meaning test, continue. If it fails, reconsider the problem framing before evaluating any toolchain.

Variable 2 — SwiftUI Volumes or Fully Immersive Space?

Assuming your use case is genuinely spatial, the next decision is the fundamental structure of the user interface. The visionOS UI architecture offers two primary modes, and choosing the wrong one creates architectural debt that compounds through the entire project.

SwiftUI volumes are bounded three-dimensional windows. They follow familiar windowed-application conventions extended into depth. They use standard SwiftUI layout systems, integrate naturally with visionOS UI conventions, and are appropriate when your application has a clear panel-or-window metaphor — even if that panel contains 3D content. A product configurator, a data visualization dashboard, or a guided step-by-step training tool can all work well as volumes. The visionOS SDK exposes volume sizing, positioning, and RealityKit content embedding through clean SwiftUI APIs, and the development path is familiar to iOS engineers.

Fully immersive spaces fill the user's visible environment and require custom spatial navigation, scene management, and orientation management. They are appropriate when spatial navigation is itself the product — when the user needs to move through an environment, when environmental presence is core to the experience, or when the content is inherently unbounded. VR training scenarios, architectural walkthroughs, and collaborative virtual spaces belong here.

The antipattern we see repeatedly: teams choose immersive space because it looks more impressive in demos, then spend weeks engineering spatial navigation, environmental lighting, and presence mechanics for an application whose underlying content would have worked fine in a volume. On Work Spatial, we used a hybrid approach — an immersive space for the spatial environment exploration component and SwiftUI windows for the AI dialogue interface — because the two components had genuinely different requirements.

Decision output: Default to volumes unless spatial navigation or full environmental presence is a core requirement. Use immersive space only when the user's movement through an environment is part of the value.

Variable 3 — RealityKit or Unity PolySpatial for Vision Pro App Development?

This is the decision that generates the most debate in vision pro app development discussions, and it has a cleaner answer than most teams expect.

Choose RealityKit when:

You are targeting visionOS exclusively, or visionOS is the primary platform with other platforms secondary
Your team has Swift and Apple platform experience
You need deep integration with ARKit (plane detection, scene reconstruction, image tracking, mesh generation)
You need tight conformance with visionOS UI conventions and the Vision Pro App Store review guidelines
Your application is primarily a productivity or enterprise tool rather than a game or interactive entertainment product

RealityKit uses an entity-component-system architecture that maps cleanly to spatial scene management. Physics simulation, spatial audio, animation, and ARKit integration are all first-party and well-tested on apple vision pro hardware. The apple vision pro m2 chip's neural engine and GPU are explicitly optimized for RealityKit workloads.

Choose Unity PolySpatial when:

You need to ship to both Vision Pro and Meta Quest from a shared codebase
Your team's existing Unity expertise makes native Swift development a meaningful bottleneck
You are building a game or interactive entertainment experience where Unity's asset ecosystem provides direct value
You have existing Unity projects whose assets, shaders, or middleware you need to reuse

PolySpatial maps Unity's rendering architecture to visionOS rendering through an abstraction layer. That abstraction layer is the key risk: it occasionally surfaces missing features, performance surprises, or interaction handling differences that require workarounds. Benchmark your specific workload on device before committing to PolySpatial for a production project.

On Work Spatial, we used RealityKit as the primary renderer, with WebGL content embedded via web views for visualizations that already existed in that format. Rebuilding those WebGL assets in RealityKit would have added weeks without adding user value. Pragmatism beat architectural purity.

Decision output: RealityKit for Apple-native single-platform projects with Swift teams. PolySpatial for multi-platform projects or Unity-native teams. Hybrid approaches are legitimate when existing assets justify the integration complexity.

Variable 4 — Which Interaction Model Does Your Use Case Require?

Apple Vision Pro supports hand tracking with pinch gestures, eye gaze combined with hand confirmation, and controller input via the apple vision pro developer strap. These are not interchangeable — each has distinct technical characteristics, user experience implications, and appropriate use cases.

Hand tracking with pinch is the default and the most natural for most spatial applications. Users point at objects and pinch to select. The interaction is intuitive for new users and requires no hardware beyond the headset. The technical constraint is update frequency (approximately 60-90 Hz depending on conditions) and graceful degradation when hands are occluded or in poor lighting. Every application should handle invalid hand tracking states explicitly — a state machine, not an assumption that tracking is always available.

Eye gaze combined with hand confirmation — look at an object, pinch to select — is appropriate for applications where precise targeting of small elements is important. Eye gaze has approximately ±2-3 degrees of visual angle accuracy, which means UI elements must be sized and spaced to accommodate this inherent uncertainty. The interaction is fast and low-fatigue for short sessions but tiring over extended use. Vision os 2 improved gaze tracking calibration, which reduced targeting errors in our testing.

Controller input via the vision pro developer strap provides the most familiar input model for developers with gaming backgrounds and the highest-frequency, most precise input. It is appropriate for applications where precise, repeated input is required and where the hardware dependency is acceptable. Most enterprise and productivity applications should not require controller input; requiring users to manage a separate controller accessory adds friction to deployment.

The architectural decision that teams avoid: which interaction mechanism does the application assume is available? Work Spatial used hand tracking as the primary interaction model for spatial navigation and pinch-to-select for AI dialogue prompts, with web-based UI for detailed content interaction. Segregating interaction paradigms by component — spatial gestures for 3D manipulation, web UI for detailed forms — produced better usability than forcing uniform spatial interaction across all components.

Decision output: Default to hand tracking with pinch for most applications. Add gaze-confirmation for precision-targeting requirements. Avoid controller dependency in enterprise deployments. Design explicit state machines for interaction fallback.

Variable 5 — Is Your Enterprise Context Actually Ready to Deploy?

This variable is specific to enterprise vision pro app development, which is where most of our client work lives. The technical build is rarely the blocker. The deployment context almost always is.

Four friction points appear in every enterprise spatial computing project:

IT infrastructure. Does the enterprise environment support MDM enrollment for Vision Pro? Does VPN work with the headset's network stack? Can the application authenticate through the company's identity provider? These questions have answers, but they require IT involvement before development begins — not after the build is complete.

Workflow integration. A Vision Pro application that exists in isolation from the rest of the enterprise toolchain provides limited value. Work Spatial integrated with existing HR and onboarding systems precisely because a standalone spatial experience without connection to real employee data would have been a demo, not a product.

Device economics. At the current hardware cost, enterprise deployment economics require a clear per-device ROI calculation. The use cases that justify the hardware cost are concentrated: spatial visualization for high-stakes decisions (surgical planning, architectural review, industrial training), remote collaboration where spatial presence reduces travel or error costs, and training scenarios where VR simulation measurably reduces real-world risk.

Change management. Staff who have never used a spatial wearable need onboarding, not just training. The interaction model of vision os apps — hand tracking, spatial audio, passthrough mixed reality — is unfamiliar to most enterprise users. Deployment timelines that do not account for change management consistently underestimate adoption friction.

The Empathy Lab project for the UK rail industry illustrates what happens when these variables align. The client said: "Putting staff through the VR scenarios changed the vocabulary we hear back in the control room." That outcome required not just a technically sound build but an enterprise deployment context where the IT infrastructure was in place, the workflow integration was clear, and the change management process was owned by the client.

Decision output: Resolve IT infrastructure, workflow integration, device economics, and change management questions before finalizing scope. If any of these are unresolved, add discovery phases to the project plan.

The 1-Page Decision Matrix

Copy this matrix into your project brief before scoping any visionOS build.

Variable	Option A	Option B
1. Spatial meaning	Use case has inherent 3D structure → proceed	Use case is 2D content in 3D display → reconsider
2. UI mode	SwiftUI volumes (panel/window metaphor, 2.5D content)	Immersive space (spatial navigation is the product)
3. Renderer	RealityKit (Apple-native, single platform, Swift team)	Unity PolySpatial (multi-platform, Unity team)
4. Interaction	Hand tracking + pinch (default, no accessories)	Gaze + confirm (precision targeting required)
5. Enterprise readiness	IT, MDM, workflow integration resolved → build	Unresolved deployment questions → discovery first

If Variable 1 is Option B, stop and reframe the brief. Every other decision depends on it being Option A.

If Variables 2 and 3 are both the more complex option (immersive space + PolySpatial), expect a significantly longer timeline and allocate explicit budget for cross-platform testing on device — the simulator will not surface the integration edge cases.

If Variable 5 has unresolved questions, add a 2-4 week discovery phase before committing to a build timeline. Enterprise spatial computing projects that skip this step consistently encounter expensive scope changes mid-build.

Variable 1 — Does Your Use Case Have Inherent Spatial Meaning?

Variable 2 — SwiftUI Volumes or Fully Immersive Space?

Variable 3 — RealityKit or Unity PolySpatial for Vision Pro App Development?

Variable 4 — Which Interaction Model Does Your Use Case Require?

Variable 5 — Is Your Enterprise Context Actually Ready to Deploy?

The 1-Page Decision Matrix

Related Reading

Keep reading

Interactive AI Avatar Builds: 6 Things We'd Do Differently

How We Built an AI-Driven XR Onboarding MVP with WebGL, Spatial Computing, and Web3 — Lessons from Work Spatial

Unity WebGL Development: Inside NBK Virtugate, How We Shipped a Gamified Bank Onboarding World