Utilizing LookAtPoint for Pinpointing User’s Gaze onto the Mobile Screen Plane
In this article, we’ll explore the practical application of eye gaze tracking on a mobile screen. Leveraging the power of ARKit’s ARFaceAnchor and the dynamic lookAtPoint property, this project allows us to precisely determine where a user is looking on their device.
The code and resources you need are conveniently hosted in the GitHub repository. Basics about ARKit, ARFaceAnchor, BlendShapeLocations, and LookAtPoint covered in 1st part.
In a recent Apple event, the frontier of eye tracking technology was pushed further with the Vision Pro mixed-reality headset, showcasing a future where users can interact with applications using just their eyes
Local Coordinate Space Vs World Coordinate System
Local Coordinate Space:
Imagine the local coordinate space of the ARFaceAnchor as a 3D grid centered on the face. The origin of this grid is at the center of the face, and positions and rotations are defined relative to this origin. The lookAtPoint
is a coordinate in this local space and indicates the direction the user is looking concerning their own face.
World Coordinate System:
Now, let’s expand this view to the world coordinate system. In the broader world space (the entire environment captured by the camera), every object, including the face anchor, has its own position and orientation. The world coordinate system provides a global reference frame for all these objects.
The lookAtPoint
is a point expressed in the local coordinate space of the face anchor. To convert it into a meaningful global position, you need to consider the anchor's own position and orientation in the world coordinate system.
Simplified Explanation: Local Coordinate Space Vs World Coordinate System and their relation with
lookAtPoint
&ARFaceAnchor
:Think of the ARFaceAnchor as the center of a little world on your face. This is the Local Coordinate Space. In this space, everything is measured from the center of your face, like how you might give directions using your own body as a reference point (raise your right hand).
Now, imagine you’re in a larger world, like a room. This is the World Coordinate System. In this big world, there are not only things on your face (like the ARFaceAnchor or where you’re looking), but also everything else around you, including the device’s camera capturing your face.
So, when we talk about where you’re looking (lookAtPoint), we first figure it out in the little world on your face (Local Coordinate Space with the ARFaceAnchor as the center). Then, we take that information and translate it to make sense in the bigger world around you (World Coordinate System), where there’s more than just your face — there’s the whole environment captured by the camera. It’s like saying, “I’m pointing to my nose in my personal space, and now let’s find where that is in the entire room.”
lookAtPoint
w.r.t. the world coordinate system
To convert the lookAtPoint
from the local coordinate space of the ARFaceAnchor
to the world coordinate system, we can use matrix multiplication. The process involves applying the transformation matrix of the ARFaceAnchor
to the lookAtPoint
. This will bring the point from the local coordinate space of the face anchor to the world coordinate system.
let lookAtPointInWorld = faceAnchor.transform * simd_float4(lookAtPoint, 1)
In this line, faceAnchor.transform
represents the transformation matrix of the ARFaceAnchor
, and lookAtPoint
is extended to a simd_float4
to facilitate matrix multiplication. The multiplication operation combines the transformation matrix and the point, resulting in a new point that is now in the world coordinate system.
After this transformation, lookAtPointInWorld
contains the coordinates of the lookAtPoint
with respect to the world coordinate system.
Camera Transformations
The camera is a crucial component in AR, responsible for capturing the real-world environment and providing the necessary data for rendering virtual objects in the correct perspective.
In ARKit, the camera.transform
represents the transformation matrix of the device's camera. The transformation matrix is a mathematical construct that encapsulates translation, rotation, and scaling operations in 3D space. Essentially, it contains information about how the camera is positioned and oriented relative to the world coordinate system.
cameraTransform = session.currentFrame?.camera.transform
Breakdown of what the camera.transform
includes:
- Translation (Position): Specifies the location of the camera in 3D space (X, Y, Z coordinates).
- Rotation: Describes the orientation of the camera. This can include rotations around the X, Y, and Z axes.
- Scaling: Defines any scaling factors applied to the camera.
camera.transform
is useful to transform the lookAtPoint (gaze direction) from the local coordinate space of the ARFaceAnchor to the world coordinate system.
In summary, the camera.transform
provides a comprehensive representation of how the device's camera is positioned and oriented in 3D space, enabling accurate transformations between the real world and the virtual world in AR applications.
lookAtPoint
onto the mobile screen
To find the reflection of a point in the XY plane (phone screen plane) of a given camera transform, you can follow these steps:
- Transform the LookAtPoint to Camera Coordinates:
- We have the
lookAtPointInWorld
in global coordinates (world coordinate system). - Use the inverse of the camera transform to convert it to camera coordinates. This new point is referred to as
lookAtPointInCamera
.
let lookAtPointInCamera = simd_mul(simd_inverse(cameraTransform), lookAtPointInWorld)
2. Reflection in the XY Plane (Phone screen plane):
- To reflect a point in the XY plane, we can neglect its z-coordinate since
lookAtPointInCamera
is relative to camera coordinates. - Coordinates — (transformedLookAtPoint.x, transformedLookAtPoint.y)
In conclusion, by transforming the lookAtPoint
from global coordinates to camera coordinates and reflecting it in the XY plane, we successfully obtained the user’s gaze point coordinates in the camera coordinates system.
FocusPoint on Mobile Screen
After obtaining the user’s gaze point coordinates in the camera coordinates system, the next step is to project this point onto the mobile screen. The process involves several key calculations:
let screenX = transformedLookAtPoint.y / (Float(Device.screenSize.width) / 2) * Float(Device.frameSize.width)
let screenY = transformedLookAtPoint.x / (Float(Device.screenSize.height) / 2) * Float(Device.frameSize.height)
let focusPoint = CGPoint(
x: CGFloat(screenX).clamped(to: Ranges.widthRange),
y: CGFloat(screenY).clamped(to: Ranges.heightRange)
)
Normalization:
transformedLookAtPoint.y / (Float(Device.screenSize.width) / 2)
: This step normalizes the y-coordinate of the gaze point relative to half of the screen width, resulting in a value between -1 and 1.transformedLookAtPoint.x / (Float(Device.screenSize.height) / 2)
: Similarly, the x-coordinate is normalized relative to half of the screen height.
Scaling to Screen Size:
* Float(Device.frameSize.width)
: The normalized x-coordinate is then scaled to the full width of the screen.* Float(Device.frameSize.height)
: Likewise, the normalized y-coordinate is scaled to the full height of the screen.
Creating a CGPoint:
let focusPoint = CGPoint(...)
: The scaled coordinates are then used to create a CGPoint, representing the user's focus point on the mobile screen.
Clamping to Screen Boundaries:
clamped(to: Ranges.widthRange)
andclamped(to: Ranges.heightRange)
: To ensure the focus point stays within the screen boundaries, these clamping operations restrict the coordinates to valid ranges.
limitations of eye tracking with ARKit
- Head movement: Head movement can affect the accuracy of eye tracking, especially if the user is moving their head quickly.
- Blinking: Blinking can also affect the accuracy of eye tracking, as the camera may not be able to track the user’s eyes during a blink.
- Ambient lighting: Ambient lighting can also affect the accuracy of eye tracking, as the camera may not be able to see the user’s eyes properly if the lighting is too bright or too dark.
- Other Factors: Glasses and accessories, Eyelid occlusion, Eye conditions, Distance and angle, Limited field of view:, Environmental factors, Hardware limitations