In a recent development, Apple’s research team has released a new paper focusing on artificial intelligence (AI) models, this time honing in on the comprehension and traversal of smartphone user interfaces (UI). The spotlight of this yet-to-be peer-reviewed research centers on a significant breakthrough named Ferret UI, a large language model (LLM) engineered to transcend conventional computer vision boundaries and comprehend intricate smartphone displays. This isn’t the first venture into AI research for the tech giant’s research division; previous publications have delved into multimodal LLMs (MLLMs) and on-device AI models.
The pre-print edition of the research paper has been made accessible on arXiv, an open-access repository housing academic papers. Titled “Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs,” the paper aims to broaden the utility of MLLMs. It underscores that most language models equipped with multimodal capabilities typically falter when confronted with anything beyond natural images, constraining their functionality. There’s a pronounced necessity for AI models to grapple with the intricacies of dynamic interfaces, such as those found on smartphones.
According to the paper, Ferret UI is specifically “crafted to execute precise referring and grounding tasks unique to UI screens, while proficiently deciphering and responding to open-ended language instructions.” Put simply, this vision language model not only sifts through a smartphone screen adorned with myriad elements representing diverse information but also furnishes users with pertinent details upon request.
Illustrated within the paper is an image demonstrating the model’s ability to recognize and categorize widgets and icons. It is proficient in answering queries like “Where is the launch icon?” or “How do I open the Reminders app?” This illustrates that the AI is not merely capable of elucidating the contents of the screen it perceives but can also navigate through various sections of an iPhone based on user prompts.