EffectivePose Report: UCF Spring 2024 Independent Study

Ethan Legum | CAP 6908

Abstract

EffectivePose is an Android app that leverages machine learning technology to transform live video streams from a device's camera into 3D body pose estimations. Operating seamlessly in real-time on mobile phones without the need for external tracking devices by accurately capturing and analyzing individuals within the frame, EffectivePose offers an augmented video output that overlays these estimations. Utilizing Google’s ML Kit, I aim to bring the sophistication of advanced 3D body pose estimation to everyday users. Future extensions of this technology have significant implications across a variety of sectors including sports training, security analysis, and virtual/augmented reality experiences.

Introduction

In pursuit of seamless integration of technology into human life, accessible 3D body pose estimation stands out as a foundational missing key. As we progress towards a future where software acts not merely as a tool but as an extension of our daily activities, the necessity for more natural, intuitive methods of input becomes paramount. EffectivePose aims to herald this future by utilizing ML Kit to enable real-time human body tracking using only an easily accessible basic Android mobile device.

The significance of enhancing 3D body pose estimation lies in its vast applications across various domains, each demanding an understanding of human motion:

The extensions of EffectivePose's core technology are poised to revolutionize various sectors by enabling more natural, intuitive forms of human-computer interaction. Preliminary findings suggest that Effective Pose’s high accuracy in body pose estimation holds significant promise for enabling other Android applications ranging from improving sports training methodologies to enhancing accessibility for individuals with physical limitations. By eliminating the need for expensive specialized hardware previously required for precise tracking, the technology within EffectivePose represents a meaningful advancement toward creating intelligent systems that understand and interact with us in three dimensions.

Related Works

The journey to developing EffectivePose was shaped by recent innovations in the field of pose estimation, with three pivotal papers significantly informing development. Each of these papers has not only pushed the boundaries of what's possible in pose estimation but also directly inspired aspects of EffectivePose's approach.

The first paper, Sports Pose [1], helped lay a foundational understanding of 3D human pose estimation and its limitations at the time. SportsPose illuminated the complexities and challenges of accurately estimating poses in high-motion scenarios. Its emphasis on dynamic movements, particularly the nuanced motion of wrists and ankles through the novel "local movement" metric, showcased the importance of capturing the full spectrum of human activity. This insight was crucial in deciding which model to integrate into the app as they were divided based on their size and the data they were trained on. Understanding the importance of high-motion datasets guided the choice to integrate a larger pre-trained model into EffectivePose.

Following this, RTMO [2] emerged as a groundbreaking extension of Sports Pose, introducing a one-stage framework optimized for real-time multi-person pose estimation. RTMO's novel integration of coordinate classification within the YOLO architecture and its use of 1D heatmaps for keypoint representation significantly enhance spatial resolution. This allowed for more accurate predictions without substantially increasing computational costs. The paper showed that RTMO can achieve accuracies comparable to state-of-the-art methods while maintaining remarkable speed, demonstrating an average precision of 74.8% on the COCO dataset and operating at 141 frames per second (FPS) on a single NVIDIA V100 GPU. The efficiencies and accuracies achieved by RTMO inspired the optimization strategies employed in EffectivePose, particularly in enhancing the precision of the 2D overlay with considerations for real-time processing constraints.

The last paper I’ll discuss, SparsePoser [3], represents a significant leap in 3D body pose estimation by enabling full-body motion reconstruction from the sparse input of six 6-DoF tracking devices. Its novel approach, combining a convolutional-based autoencoder with a learned inverse kinematics (IK) component, not only ensures superior pose quality but also addresses end-effector accuracy with unprecedented precision. SparsePoser's method of handling sparse inputs to generate high-quality, continuous human poses adaptable to users with varying body dimensions informed the decision to use 3D pose estimation, rather than 2D, in EffectivePose to refine joint estimations between identified landmark points.

While the aforementioned studies have significantly advanced the field of pose estimation, traditional tracking systems often rely on specialized hardware or complex setups, limiting their accessibility and convenience for everyday users. EffectivePose addresses these challenges by leveraging mobile devices that are already integral to our daily routines. Drawing upon the insights from these foundational papers, EffectivePose bridges the technological divide, melding the precision of high-accuracy, real-time 2D pose estimation with the depth of 3D pose refinement techniques. My goal is to deliver an exceptionally accurate and user-friendly pose estimation experience that seamlessly functions without the need for specialized hardware, thus making advanced pose tracking accessible to a broader audience.

Methods

In embarking on the development of an Android app for pose estimation, I explored two primary methods, aiming to strike an optimal balance between efficiency, precision, and practicality. These criteria are crucial for the success of the project, given the constraints of time and resources available to me.

The initial approach considered was incorporating RTMO’s Python code into the Android app via an open-source transfer API. RTMO stands out for its ability to overcome the limitations of dense prediction models, delivering high accuracy and efficient processing, as evidenced by its performance of 141 FPS on a V100 GPU. This method suggested the possibility of embedding a real-time full-body tracking system within an Android app that interfaces directly with the camera feed. However, the primary challenge with RTMO is its lack of native Android support, necessitating the use of transfer layers to adapt Python code for Android. This adaptation significantly hampers processing speed, dropping frame rates to less than 1 FPS making its use untenable for a real-time camera-based application on Android.

Figure 1: Flow diagram for option 1 ML pipeline utilizing Python

As I continued my research, Google’s ML Kit emerged as a robust alternative, boasting a comprehensive array of machine learning functionalities tailored for mobile development, including pose detection. ML Kit's native compatibility with Android eliminates the need for intermediate layers or modifications, facilitating direct integration into mobile apps. It's optimized for performance on mobile platforms, ensuring an effective compromise between processing speed and detection accuracy. This makes ML Kit a prime candidate for achieving real-time pose estimation within an Android environment, aligning perfectly with the project's objectives of efficiency, reliability, and feasibility.

Figure 2: Flow diagram for option 2 ML pipeline utilizing ML Kit

        

After thorough consideration and analysis, I decided to proceed with ML Kit for this project. This decision was driven by ML Kit’s native Android support, proven track record in mobile applications, and alignment with my project’s core criteria: efficiency, accuracy, and practicality. The platform's design for Android ensures streamlined development and integration, increasing my confidence in the project's timely and successful completion.

        To effectively implement real-time pose estimation on Android devices, it was crucial to establish and meet specific performance benchmarks aimed at minimizing frame processing times for a smooth, real-time experience. Initial testing revealed performance issues primarily due to a noticeable lag between the predicted poses and the subject's actual movements, which was attributed to low frame rates. Substantial optimizations were made to address these challenges.

Notable improvements included upgrading to a fast and accurate pose detector model instead of the base model provided in the ML Kit sample code and leveraging GPU processing, which significantly reduced the pose estimation time to under 40 milliseconds. Additionally, the rendering process for both the camera preview and the graphical overlays was optimized to approximately 1 millisecond each by refactoring the overlay processing code to perform a single draw call and caching the drawable objects used in the pose, rather than instantiating them in every frame. These enhancements collectively led to a dramatic increase in the overall frame rate, achieving approximately 30 frames per second and markedly improving the application's performance and user experience by reducing latency and better matching the overlay performance with the user's movements.

Results

EffectivePose successfully met its performance target of maintaining a frame rate of over 24 fps on modern Android devices equipped with at least an integrated GPU, with achievable frame rates ranging from 30 to 60 fps. On older hardware, reliant solely on CPU processing, the application managed real-time operation albeit with reduced frame rates of 10-15 fps.

Figure 4: Demo running on Google Pixel 8 Pro (CPU+GPU)

Figure 4 presents a demonstration of the application running on a Google Pixel 8 Pro, leveraging both CPU and GPU. As you can see in the clips the app applies well to multiple use cases including but not limited to sports training, security analysis, and human-computer interaction.

Figure 5: Comparison of different hardware levels running Effective Pose

Effective Pose Performance Benchmarks

Device

Samsung A03s (CPU)*

Google Pixel 8 Pro (CPU + GPU)*

Camera Frame Capture

~1 ms

~1 ms

MLKit Processing

<175 ms

<40 ms

Camera Preview Render

~3 ms

~1 ms

Graphic Overlay Render

~5 ms

~1 ms

Final Average FPS

5 - 15 fps

24 - 60 fps

*Specifications

Samsung A03s | CPU: Octa-Core; 1.8GHz                         | Camera resolution: 13 MP; 30 fps

Pixel 8 Pro          | CPU + GPU: Google Tensor G3; 3.0GHz         | Camera resolution: 50 MP; 60 fps

Table 1: Performance benchmarks on powerful and weak Android devices

In the two figures above we can see the test results from two different devices, highlighting the application’s performance across varying hardware specifications, from the budget Samsung A03s to the high-end Pixel 8 Pro. As you can see in these videos, for fast-paced situations such as basketball a higher-end Android phone is required for accurate pose estimation but for slower movements such as the transition between yoga poses a lower-end Android phone is capable of maintaining accurate tracking.

Conclusion

EffectivePose successfully delivered accurate pose estimation at satisfactory frame rates across a spectrum of hardware, from the latest high-end smartphones to older generations only containing a CPU. By leveraging the ubiquitous nature of smartphones and the power of ML Kit’s cutting-edge models for real-time processing, EffectivePose has successfully brought high-level computer vision capabilities to everyday users. The result of our testing highlights EffectivePose's versatility and robustness, underscoring its potential as a widely accessible tool for Android app development involving human-computer interaction.

Looking forward, there are several potential improvements to consider. Such as integrating the phone's inertial measurement unit (IMU) data to enhance the application's stability by compensating for the motion of handheld devices. A more targeted approach could involve cropping the camera's input to focus solely on the subject, rather than reducing the overall resolution. This adjustment would provide more consistent input data to the model, improving the accuracy of pose estimation. This approach would also allow the application to take full advantage of the high resolutions provided by modern-day smartphones potentially reducing the slight jitter in estimated poses between successive frames.

Overall, EffectivePose demonstrates the potential of mobile-based pose estimation technologies to become accessible and practical tools in various applications, from virtual reality to sports training, but also democratizes advanced computer vision technologies that were once confined to high-end systems. Achieving up to 60 frames per second on modern devices, and ensuring functionality even on older models, EffectivePose has bridged the gap between accessibility and high performance. As we continue to advance this technology, apps like EffectivePose are poised to redefine the standards of user-device interaction, making complex pose estimation an integral part of our digital experiences, and shaping the future of how we interact with technology in our everyday lives.

References

[1] Christian Keilstrup Ingwersen, Christian Møller Mikkelstrup, Janus Nørtoft Jensen, Morten Rieger Hannemose, & Anders Bjorholm Dahl. (2023). SportsPose - A Dynamic 3D sports pose dataset. ArXiv (Cornell University). https://doi.org/10.1109/cvprw59228.2023.00550

[2] Lu, P., Jiang, T., Li, Y., Li, X., Chen, K., & Yang, W. (2023). RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation. ArXiv (Cornell University). https://doi.org/10.48550/arxiv.2312.07526

[3] Jose Luis Ponton, Yun, H., Aristidou, A., Andújar, C., & Nuria Pelechano. (2023). SparsePoser: Real-time Full-body Motion Reconstruction from Sparse Data. ACM Transactions on Graphics, 43(1), 1–14. https://doi.org/10.1145/3625264

https://doi.org/10.3390/s21051622

[4] Pose detection | ML Kit. (n.d.). Google Developers. https://developers.google.com/ml-kit/vision/pose-detection

Appendix

Samsung A03s

Pixel 8 Pro