Vision pro app development rarely breaks at the code level. We've seen it stall consistently at a layer earlier than that — the experience design layer, where teams realize that the spatial interface they imagined is not actually spatial. It's a flat workflow wearing a headset. This post is a five-variable decision framework to catch that problem before it costs a sprint.
The framework draws directly from our work on Work Spatial, an AI-driven XR onboarding MVP that combined spatial computing, Generative AI, and WebGL. That project forced explicit decisions at every branch: which renderer, which UI mode, which interaction model, which use case genuinely belonged in spatial space, and whether the enterprise deployment context would actually support what we were building. The same five decisions appear in every visionOS project we've evaluated since.
Work through them in order. Each one narrows the solution space before you write a line of visionOS code.
Variable 1 — Does Your Use Case Have Inherent Spatial Meaning?
This is the gate. If the answer is no, the rest of the framework is moot — you should not be building for Apple Vision Pro at all, or you should be building something different than what you're currently imagining.
A use case has inherent spatial meaning when spatial relationships between elements encode information that would be lost or degraded in two dimensions. Architectural walkthroughs, surgical anatomy planning, industrial assembly guidance, and collaborative virtual environments all qualify. A document editor, a spreadsheet, a chat interface, and a task list do not — regardless of how they look floating in passthrough mode.
The test is simple: if you removed the spatial dimension entirely and displayed the same content on a flat monitor, would the user lose meaningful information or capability? If yes, you have a spatial use case. If no, you have a flat interface seeking a novel display medium — and the additional complexity of visionOS app development will work against you, not for you.
Work Spatial passed this test. Onboarding involves learning physical environments and spatial procedures. A new employee learning a manufacturing floor or a hospital ward needs to understand where things are in relation to each other — that's inherently spatial. The AI-guided exploration component made sense precisely because the environment being learned had real three-dimensional structure. The generative AI layer provided contextual guidance tied to the user's spatial position, which a PDF or a video cannot replicate.
Decision output: If your use case passes the spatial meaning test, continue. If it fails, reconsider the problem framing before evaluating any toolchain.
Variable 2 — SwiftUI Volumes or Fully Immersive Space?
Assuming your use case is genuinely spatial, the next decision is the fundamental structure of the user interface. The visionOS UI architecture offers two primary modes, and choosing the wrong one creates architectural debt that compounds through the entire project.
SwiftUI volumes are bounded three-dimensional windows. They follow familiar windowed-application conventions extended into depth. They use standard SwiftUI layout systems, integrate naturally with visionOS UI conventions, and are appropriate when your application has a clear panel-or-window metaphor — even if that panel contains 3D content. A product configurator, a data visualization dashboard, or a guided step-by-step training tool can all work well as volumes. The visionOS SDK exposes volume sizing, positioning, and RealityKit content embedding through clean SwiftUI APIs, and the development path is familiar to iOS engineers.
Fully immersive spaces fill the user's visible environment and require custom spatial navigation, scene management, and orientation management. They are appropriate when spatial navigation is itself the product — when the user needs to move through an environment, when environmental presence is core to the experience, or when the content is inherently unbounded. VR training scenarios, architectural walkthroughs, and collaborative virtual spaces belong here.
The antipattern we see repeatedly: teams choose immersive space because it looks more impressive in demos, then spend weeks engineering spatial navigation, environmental lighting, and presence mechanics for an application whose underlying content would have worked fine in a volume. On Work Spatial, we used a hybrid approach — an immersive space for the spatial environment exploration component and SwiftUI windows for the AI dialogue interface — because the two components had genuinely different requirements.
Decision output: Default to volumes unless spatial navigation or full environmental presence is a core requirement. Use immersive space only when the user's movement through an environment is part of the value.
Variable 3 — RealityKit or Unity PolySpatial for Vision Pro App Development?
This is the decision that generates the most debate in vision pro app development discussions, and it has a cleaner answer than most teams expect.
Choose RealityKit when:
- You are targeting visionOS exclusively, or visionOS is the primary platform with other platforms secondary
- Your team has Swift and Apple platform experience
- You need deep integration with ARKit (plane detection, scene reconstruction, image tracking, mesh generation)
- You need tight conformance with visionOS UI conventions and the Vision Pro App Store review guidelines
- Your application is primarily a productivity or enterprise tool rather than a game or interactive entertainment product
RealityKit uses an entity-component-system architecture that maps cleanly to spatial scene management. Physics simulation, spatial audio, animation, and ARKit integration are all first-party and well-tested on apple vision pro hardware. The apple vision pro m2 chip's neural engine and GPU are explicitly optimized for RealityKit workloads.
Choose Unity PolySpatial when:
- You need to ship to both Vision Pro and Meta Quest from a shared codebase
- Your team's existing Unity expertise makes native Swift development a meaningful bottleneck
- You are building a game or interactive entertainment experience where Unity's asset ecosystem provides direct value
- You have existing Unity projects whose assets, shaders, or middleware you need to reuse
PolySpatial maps Unity's rendering architecture to visionOS rendering through an abstraction layer. That abstraction layer is the key risk: it occasionally surfaces missing features, performance surprises, or interaction handling differences that require workarounds. Benchmark your specific workload on device before committing to PolySpatial for a production project.
On Work Spatial, we used RealityKit as the primary renderer, with WebGL content embedded via web views for visualizations that already existed in that format. Rebuilding those WebGL assets in RealityKit would have added weeks without adding user value. Pragmatism beat architectural purity.
Decision output: RealityKit for Apple-native single-platform projects with Swift teams. PolySpatial for multi-platform projects or Unity-native teams. Hybrid approaches are legitimate when existing assets justify the integration complexity.
Variable 4 — Which Interaction Model Does Your Use Case Require?
Apple Vision Pro supports hand tracking with pinch gestures, eye gaze combined with hand confirmation, and controller input via the apple vision pro developer strap. These are not interchangeable — each has distinct technical characteristics, user experience implications, and appropriate use cases.
Hand tracking with pinch is the default and the most natural for most spatial applications. Users point at objects and pinch to select. The interaction is intuitive for new users and requires no hardware beyond the headset. The technical constraint is update frequency (approximately 60-90 Hz depending on conditions) and graceful degradation when hands are occluded or in poor lighting. Every application should handle invalid hand tracking states explicitly — a state machine, not an assumption that tracking is always available.
Eye gaze combined with hand confirmation — look at an object, pinch to select — is appropriate for applications where precise targeting of small elements is important. Eye gaze has approximately ±2-3 degrees of visual angle accuracy, which means UI elements must be sized and spaced to accommodate this inherent uncertainty. The interaction is fast and low-fatigue for short sessions but tiring over extended use. Vision os 2 improved gaze tracking calibration, which reduced targeting errors in our testing.
Controller input via the vision pro developer strap provides the most familiar input model for developers with gaming backgrounds and the highest-frequency, most precise input. It is appropriate for applications where precise, repeated input is required and where the hardware dependency is acceptable. Most enterprise and productivity applications should not require controller input; requiring users to manage a separate controller accessory adds friction to deployment.
The architectural decision that teams avoid: which interaction mechanism does the application assume is available? Work Spatial used hand tracking as the primary interaction model for spatial navigation and pinch-to-select for AI dialogue prompts, with web-based UI for detailed content interaction. Segregating interaction paradigms by component — spatial gestures for 3D manipulation, web UI for detailed forms — produced better usability than forcing uniform spatial interaction across all components.
Decision output: Default to hand tracking with pinch for most applications. Add gaze-confirmation for precision-targeting requirements. Avoid controller dependency in enterprise deployments. Design explicit state machines for interaction fallback.
Variable 5 — Is Your Enterprise Context Actually Ready to Deploy?
This variable is specific to enterprise vision pro app development, which is where most of our client work lives. The technical build is rarely the blocker. The deployment context almost always is.
Four friction points appear in every enterprise spatial computing project:
IT infrastructure. Does the enterprise environment support MDM enrollment for Vision Pro? Does VPN work with the headset's network stack? Can the application authenticate through the company's identity provider? These questions have answers, but they require IT involvement before development begins — not after the build is complete.
Workflow integration. A Vision Pro application that exists in isolation from the rest of the enterprise toolchain provides limited value. Work Spatial integrated with existing HR and onboarding systems precisely because a standalone spatial experience without connection to real employee data would have been a demo, not a product.
Device economics. At the current hardware cost, enterprise deployment economics require a clear per-device ROI calculation. The use cases that justify the hardware cost are concentrated: spatial visualization for high-stakes decisions (surgical planning, architectural review, industrial training), remote collaboration where spatial presence reduces travel or error costs, and training scenarios where VR simulation measurably reduces real-world risk.
Change management. Staff who have never used a spatial wearable need onboarding, not just training. The interaction model of vision os apps — hand tracking, spatial audio, passthrough mixed reality — is unfamiliar to most enterprise users. Deployment timelines that do not account for change management consistently underestimate adoption friction.
The Empathy Lab project for the UK rail industry illustrates what happens when these variables align. The client said: "Putting staff through the VR scenarios changed the vocabulary we hear back in the control room." That outcome required not just a technically sound build but an enterprise deployment context where the IT infrastructure was in place, the workflow integration was clear, and the change management process was owned by the client.
Decision output: Resolve IT infrastructure, workflow integration, device economics, and change management questions before finalizing scope. If any of these are unresolved, add discovery phases to the project plan.
The 1-Page Decision Matrix
Copy this matrix into your project brief before scoping any visionOS build.
| Variable | Option A | Option B | Your Answer |
|---|---|---|---|
| 1. Spatial meaning | Use case has inherent 3D structure → proceed | Use case is 2D content in 3D display → reconsider | |
| 2. UI mode | SwiftUI volumes (panel/window metaphor, 2.5D content) | Immersive space (spatial navigation is the product) | |
| 3. Renderer | RealityKit (Apple-native, single platform, Swift team) | Unity PolySpatial (multi-platform, Unity team) | |
| 4. Interaction | Hand tracking + pinch (default, no accessories) | Gaze + confirm (precision targeting required) | |
| 5. Enterprise readiness | IT, MDM, workflow integration resolved → build | Unresolved deployment questions → discovery first |
If Variable 1 is Option B, stop and reframe the brief. Every other decision depends on it being Option A.
If Variables 2 and 3 are both the more complex option (immersive space + PolySpatial), expect a significantly longer timeline and allocate explicit budget for cross-platform testing on device — the simulator will not surface the integration edge cases.
If Variable 5 has unresolved questions, add a 2-4 week discovery phase before committing to a build timeline. Enterprise spatial computing projects that skip this step consistently encounter expensive scope changes mid-build.
Related Reading
- Apple Vision Pro development services and capabilities — our full overview of what we build on visionOS and spatial computing platforms
- Work Spatial project case study — the AI-driven XR onboarding MVP that combined spatial computing, Generative AI, and WebGL in a single build
- Enterprise VR training services — how we approach spatial computing for enterprise L&D and onboarding beyond the Vision Pro ecosystem
If you are scoping a vision pro app development project and want a second opinion on where your build fits this framework, talk to our team. We will tell you plainly whether spatial computing is the right tool for your use case — and if it is, what the build should actually look like.