WO2024254051A1 - Autonomous visual information seeking with machine-learned language models - Google Patents

Autonomous visual information seeking with machine-learned language models Download PDF

Info

Publication number
WO2024254051A1
WO2024254051A1 PCT/US2024/032378 US2024032378W WO2024254051A1 WO 2024254051 A1 WO2024254051 A1 WO 2024254051A1 US 2024032378 W US2024032378 W US 2024032378W WO 2024254051 A1 WO2024254051 A1 WO 2024254051A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
image
model
machine
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2024/032378
Other languages
French (fr)
Inventor
Ahmet ISCEN
Alireza Fathi
Ziniu HU
David Alexander Ross
Cordelia Luise SCHMID
Chen Sun
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Google LLC
Original Assignee
Google LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Google LLC filed Critical Google LLC
Priority to EP24736245.2A priority Critical patent/EP4705905A1/en
Publication of WO2024254051A1 publication Critical patent/WO2024254051A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/532Query formulation, e.g. graphical querying
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present disclosure relates generally to autonomous visual information seeking with machine-learned language models. More particularly, the present disclosure relates to leveraging one or more machine-learned language models for planning data processing calls and information reasoning to generate a response to a visual information query.
  • BACKGROUND Large language models can display impressive language understanding and predictive capabilities.
  • language models may struggle with fact specific queries. Additionally, language models may not be trained for non-text based and/or non- embedding based data. For example, language models may be unable to properly process visual information queries associated with one or more images. [0004] Additionally, understanding the world at large can be difficult. Whether an individual is trying to understand what the object in front of them is, trying to determine where else the object can be found, and/or trying to determine where an image on the internet was captured from, text searching alone can be difficult. In particular, users may struggle to determine which words to use. Additionally, the words may not be descriptive enough and/or abundant enough to generate desired results.
  • One example aspect of the present disclosure is directed to a computing system for visual information seeking.
  • the system can include one or more processors and one or more non-transitory computer-readable media that collectively store instructions that, when executed by the one or more processors, cause the computing system to perform operations.
  • the operations can include obtaining input data.
  • the input data can include image data and text data.
  • the text data can include a query associated with the image data.
  • the operations can include processing the input data with a machine-learned model to generate first planning data.
  • the first planning data can be descriptive of instructions to provide the input data to a first data processing tool.
  • the operations can include transmitting, based on the first planning data, the input data to the first data processing tool to retrieve first output data.
  • the operations can include processing the input data and the first output data with the machine-learned model to generate second planning data.
  • the second planning data can be descriptive of instructions to transmit data to a second data processing tool.
  • the operations can include transmitting, based on the second planning data, data to the second data processing tool to retrieve second output data and processing the input data and the second output data with the machine-learned model to generate response data.
  • the response data can be descriptive of a response to the query.
  • the first data processing tool can include an object detection model.
  • the first output data can include one or more bounding boxes associated with one or more objects in the image data.
  • the first output data can include one or more segmented portions of one or more images of the image data and caption data associated with the one or more segmented portions.
  • the caption data can be descriptive of an object classification associated with one or more objects detected in the one or more segmented portions of one or more images.
  • the second data processing tool can include a search engine.
  • the second planning data can include a model-generated query.
  • the model-generated query can be transmitted to the second data processing tool to retrieve the second output data.
  • the model-generated query can be generated based on the input data and the first output data.
  • the model-generated query can be descriptive of the query of the input data modified based on the first output data.
  • the response data can include a natural language text string that is responsive to the query of the input data.
  • the operations can include processing the input data and the second output data with the machine-learned model to generate third planning data.
  • the third planning data can be descriptive of instructions to transmit data to a third data processing tool.
  • the operations can include transmitting, based on the third planning data, data to the third data processing tool to retrieve third output data and processing the input data and the third output data with the machine-learned model to generate the response data.
  • Another example aspect of the present disclosure is directed to a computer- implemented method for responding to visual prompts.
  • the method can include obtaining, by a computing system including one or more processors, input data.
  • the input data can include image data and text data.
  • the text data can include a query associated with the image data.
  • the method can include processing, by the computing system, the input data with a machine-learned model to generate first planning data.
  • the first planning data can be descriptive of instructions to provide the image data to a first data processing tool.
  • the method can include providing, by the computing system, the image data to the first data processing tool to receive first output data.
  • the first output data can be generated with the first data processing tool based on the image data.
  • the method can include processing, by the computing system, the first output data with the machine-learned model to generate second planning data.
  • the second planning data can be descriptive of instructions to provide a particular portion of the image data to a second data processing tool.
  • the method can include providing, by the computing system, the particular portion of the image data to the second data processing tool to receive second output data.
  • the second output data can be generated with the second data processing tool based on the particular portion of the image data.
  • the method can include processing, by the computing system, the text data and the second output data with the machine-learned model to generate response data.
  • the response data can be descriptive of a response to the query.
  • the machine-learned model may have been conditioned on a training dataset including a plurality of training examples. Each training example of the plurality of training examples can include a training input, a training output, and a training rationale.
  • the training rationale can be descriptive of a sequence of processing instances and tool calls for determining the output data.
  • the machine-learned model may have been conditioned on a training dataset including sequence data.
  • sequence data can be descriptive of a sequence of actions for generating a training response in response to obtaining a particular type of input.
  • the machine-learned model may have been conditioned on human input data descriptive of actions a human selected as being particular actions to perform when a particular input type is received.
  • the human input data may have been obtained via a user interface that provides a plurality of selectable options for a user.
  • the plurality of selectable options can include a plurality of external tools to call and a final output option.
  • the operations can include obtaining input data.
  • the input data can include image data and text data.
  • the text data can include a query associated with the image data.
  • the operations can include processing the input data with a machine-learned model to generate first planning data.
  • the first planning data can be descriptive of instructions to provide the image data to a first data processing tool.
  • the operations can include providing the image data to the first data processing tool and receiving first output data from the first data processing tool based on the image data.
  • the operations can include processing the first output data with the machine-learned model to generate second planning data.
  • the second planning data can be descriptive of instructions to provide a particular portion of the image data to a second data processing tool.
  • the operations can include providing the particular portion of the image data to the second data processing tool and receiving second output data from the second data processing tool based on the particular portion of the image data.
  • the operations can include processing the text data and the second output data with the machine-learned model to generate response data.
  • the response data can be descriptive of a response to the query.
  • the query can be descriptive of a question associated with a particular object depicted in the image data.
  • the first data processing tool can detect a plurality of objects depicted in the image data, generate a plurality of bounding boxes associated with the plurality of objects, generate a plurality of image segments based on the plurality of bounding boxes, classify each of the plurality of objects in the plurality of image segments to generate a plurality of object classifications, and generate the first output data.
  • the first output data can include the plurality of image segments and the plurality of object classifications.
  • the second data processing tool can include an image search engine. The second data processing tool can process one or more of the plurality of image segments with the image search engine to determine one or more web resources associated with the one or more of the plurality of image segments.
  • the one or more of the plurality of image segments can be selected by the machine-learned model based on the input data and the plurality of object classifications.
  • the second data processing tool can generate the second output data based on the one or more web resources.
  • Another example aspect of the present disclosure is directed to a computing system.
  • the system can include one or more processors and one or more non-transitory computer-readable media that collectively store instructions that, when executed by the one or more processors, cause the computing system to perform operations.
  • the operations can include obtaining input data.
  • the input data can include image data and text data.
  • the text data can include a query associated with the image data.
  • the operations can include processing the input data with an artificial intelligence system to generate response data.
  • the response data can be descriptive of a response to the query.
  • the artificial intelligence system can include a language model planner block, a language model reasoner block, and a working memory.
  • the language model planner block may have been conditioned to determine one or more data processing tools to utilize for processing data associated with responding to the query.
  • the language model reasoner block may have been conditioned to process outputs of the one or more data processing tools to determine information associated with responding to the query.
  • the working memory can store acquired information obtained and generated with the artificial intelligence system.
  • the operations can include providing the response data as an output.
  • the artificial intelligence system can include one or more machine-learned models conditioned on user behavior data.
  • the user behavior data can be processed to generate a transition graph that is descriptive of a determined sequence of decisions made by users when performing a particular information seeking task.
  • the transition graph can be descriptive of a plurality of distinct states and can indicate a particular set of actions available at each state of the plurality of distinct states.
  • the user behavior data may have been utilized to condition the language model planner block and the language model reasoner block for particular context-based processing.
  • the one or more data processing tools can include at least one of a computer vision tool, a web search tool, or an image search tool.
  • the computer vision tool can include at least one of an object detection model, an optical character recognition model, an image captioning model, or a visual question-and-answer model.
  • the web search tool can retrieve open world knowledge from web resources.
  • the image search tool can identify relevant information from metadata associated with visually similar images.
  • the language model planner block may have been conditioned to generate a processing tool query to provide to the one or more data processing tools based on the input data.
  • Figure 1 depicts a block diagram of an example visual information determination system according to example embodiments of the present disclosure.
  • Figure 2 depicts a block diagram of an example visual information seeking system according to example embodiments of the present disclosure.
  • Figure 3 depicts a flow chart diagram of an example method to perform visual query processing according to example embodiments of the present disclosure.
  • Figure 4A depicts a block diagram of an example automated visual information seeking system according to example embodiments of the present disclosure.
  • Figure 4B depicts a block diagram of an example automated visual information seeking workflow system according to example embodiments of the present disclosure.
  • Figure 5 depicts a block diagram of an example API tool call system according to example embodiments of the present disclosure.
  • Figure 6 depicts an illustration of an example user data collection interface according to example embodiments of the present disclosure.
  • Figure 7 depicts a flow chart diagram of an example method to perform visual query response generation according to example embodiments of the present disclosure.
  • Figure 8 depicts a flow chart diagram of an example method to perform response generation according to example embodiments of the present disclosure.
  • Figure 9A depicts an illustration of an example visual information determination process according to example embodiments of the present disclosure.
  • Figure 9B depicts an illustration of an example visual query processing system according to example embodiments of the present disclosure.
  • Figure 10 depicts a flow chart diagram of an example method to perform artificial intelligence system processing according to example embodiments of the present disclosure.
  • Figure 11A depicts a block diagram of an example computing system that performs visual query processing according to example embodiments of the present disclosure.
  • Figure 11B depicts a block diagram of an example computing system that performs visual query processing according to example embodiments of the present disclosure.
  • Reference numerals that are repeated across plural figures are intended to identify the same features in various implementations.
  • DETAILED DESCRIPTION [0035] Generally, the present disclosure is directed to systems and methods for visual information query processing and response generation.
  • the systems and methods disclosed herein can leverage one or more machine-learned language models (e.g., a tuned and/or conditioned large language model) for planning and reasoning, which can include application programming interface tool calls.
  • input data including text data and image data can be obtained.
  • the text data can be descriptive of a query associated with the image data (e.g., “When was this object invented?”).
  • the input data can be processed with a machine-learned model to generate first planning data.
  • the first planning data can include an application programming interface call to provide a first set of data (e.g., one or more images of the image data) to a first data processing tool (e.g., an object detection and captioning model).
  • a first data processing tool e.g., an object detection and captioning model
  • First output data (e.g., one or more segmented image patches with captions and/or classifications) can then be obtained from the first data processing tool based on the transmission of the first set of data.
  • the first output data can then be processed with the machine-learned model to determine if a response can be generated and/or to generate second planning data to perform another data processing tool.
  • the systems and methods can iteratively generate application programming interface calls and output data processing until a response (e.g., a response to the query) is generated.
  • the task can present a combinatorial search space that demands a sequence of actions, including invoking APIs, analyzing their responses, and making informed decisions.
  • Existing language models and/or search engines alone may struggle with the task as language understanding and web resource identification separately may not provide an adequate response.
  • the systems and methods disclosed herein can leverage a machine-learned language model and one or more data processing tools to perform visual information seeking.
  • the systems and methods can utilize a machine-learned language model for planning and reasoning.
  • the machine-learned language model can process input data and/or output data from a data processing tool to determine an action (e.g., a next action) for obtaining relevant information for responding to the input data.
  • the machine-learned language model can determine application programming interface calls to request information from one or more data processing tools.
  • the machine-learned language model can generate planning data that includes the API call and may include a model-generated query to be provided to the one or more data processing tools.
  • the machine-learned language model can process the outputs from the one or more data processing tools to determine the relevant information from the outputs.
  • the machine-learned language model can then determine whether further data processing tools are to be performed before generating the response data to provide to the user.
  • the planning and reasoning processing can be performed iteratively until the machine-learned language model determines a response can be generated and provided.
  • the systems and methods of the present disclosure provide a number of technical effects and benefits.
  • the system and methods can be utilized to leverage a machine-learned language model for tool processing planning and reasoning, which can enable the system to accurately and efficiently respond to visual information queries.
  • a language model can be conditioned to iteratively process data to determine when and/or how to utilize one or more data processing tools (e.g., an object detection tool, an image captioning tool, a web search tool, an image search tool, etc.).
  • the language model can be conditioned to process the outputs of the data processing tools to extract relevant information that can then be utilized to determine another API call and/or to generate the final response.
  • a user interface can be utilized to collect user behavior data that can be utilized to condition the language model based on the actions performed by a set of users.
  • Another example technical effect and benefit relates to improved computational efficiency and improvements in the functioning of a computing system.
  • a technical benefit of the systems and methods of the present disclosure is the ability to reduce the computational resources needed for machine-learned model visual information seeking by reducing the instances of useless tool calls.
  • the language model may process data and generate planning data one state at a time in order to mitigate the instances of a tool being utilized in a useless manner.
  • FIG. 1 depicts a block diagram of an example visual information determination system 10 according to example embodiments of the present disclosure.
  • the visual information determination system 10 is configured to receive, and/or obtain, a set of input data 12 descriptive of a prompt associated with requesting information associated with one or more images and, as a result of receipt of the input data 12, generate, determine, and/or provide response data 20 that is descriptive of a response to the prompt.
  • the visual information determination system 10 can include a machine-learned model 14 that is operable to plan data processing tool 18 calls and reason whether further calls are to be performed before generating a response.
  • the visual information determination system 10 can include obtaining input data 12.
  • the input data 12 can include text data and image data descriptive of a prompt.
  • the prompt may be associated with a request to receive information associated with one or more details in one or more images of the image data (e.g., “What is the origin of this building?”).
  • the input data 12 can be processed with a machine-learned model 14 (e.g., a large language model) to generate planning data 16.
  • the planning data 16 can be descriptive of an action to perform.
  • the planning data 16 can include an application programming interface call to transmit data to a data processing tool 18.
  • the planning data 16 can include a model-generated dataset that may be generated to provide the data processing tool 18 with a particular set of data to obtain information.
  • the data processing tool 18 may include an object detection model, an image classification model, an image captioning model, a segmentation model, an object classification model, a computer vision model, an optical character recognition model, an augmentation model, a generative model, a visual question answering model, a web search engine, an image search engine, and/or another tool.
  • the data processing tool 18 may be separate from the machine-learned model 14 that generated the planning data 14.
  • the output of the data processing tool 18 may be obtained and processed with the machine-learned model to determine (or extract) the relevant information from the output.
  • the output and the input data 12 can be processed with the machine-learned model 14 to determine whether another data processing tool call is to be performed.
  • FIG. 1 depicts a block diagram of an example visual information seeking system 200 according to example embodiments of the present disclosure.
  • the visual information seeking system 200 is similar to the visual information determination system 10 of Figure 1 except that the visual information seeking system 200 further includes a first data processing tool 208 and a second data processing tool 214.
  • the visual information seeking system 200 can utilize any number of different data processing tools to perform visual information seeking.
  • input data 202 can be obtained from a user computing system.
  • the input data 202 can be descriptive of a query associated with one or more features in one or more images of the input data 202.
  • the input data 202 may be obtained via one or more user interfaces.
  • the input data 202 can be processed with a machine-learned model 204 to generate first planning data 206.
  • the machine-learned model 204 can include an LLM- powered planner block, an LLM-powered reasoner block, and an active memory.
  • the LLM- powered planner block can determine when, what, and how to utilize one or more data processing tools.
  • the LLM-powered reasoner block can extract relevant information from the outputs of the data processing tools. Additionally and/or alternatively, the LLM-powered planner block can determine when enough information is obtained to respond to the query.
  • the active memory can continually obtain and store the data obtained and/or generated throughout the visual information seeking process.
  • the first planning data 206 can include an application programming interface call to transmit a first set of data to a first data processing tool 208.
  • the first set of data can include a portion of the input data 202 and/or a model-generated dataset.
  • the first data processing tool 208 can process the first set of data to generate first output data 210.
  • the first output data 210 can be obtained then processed with the machine- learned model 204 to generate second planning data 212.
  • the second planning data 212 can include an application programming interface call to transmit a second set of data to a second data processing tool 214.
  • the second set of data can include a portion of the input data 202, a portion of the first output data 210, and/or a model-generated dataset.
  • the second data processing tool 214 can process the second set of data to generate second output data 216.
  • the first data processing tool 208 and the second data processing tool 214 may differ.
  • the first data processing tool 208 may include an image segmentation model and an object classification model
  • the second data processing tool 214 may include one or more search engines.
  • the second output data 216 can be obtained and processed with the machine- learned model 204.
  • the input data 202, the first output data 210, and/or the second output data 216 can then be utilized to generate response data 218 descriptive of a response to the query of the input data 202.
  • Figure 3 depicts a flow chart diagram of an example method to perform according to example embodiments of the present disclosure. Although Figure 3 depicts steps performed in a particular order for purposes of illustration and discussion, the methods of the present disclosure are not limited to the particularly illustrated order or arrangement.
  • a computing system can obtain input data.
  • the input data can include image data and text data.
  • the text data can include a query associated with the image data.
  • the image data can include one or more objects. Additionally and/or alternatively, the text data can be descriptive of one or more questions associated with object details for one or more objects depicted in one or more images of the image data (e.g., “What year was this building built?”, “What type of bird is this?”, and/or “How do you make this thing?”).
  • the computing system can process the input data with a machine-learned model to generate first planning data.
  • the first planning data can be descriptive of instructions to provide the input data to a first data processing tool.
  • the first data processing tool can include an object detection model.
  • the machine-learned model can include an autoregressive language model.
  • the machine- learned model may be conditioned (e.g., parameter tuned and/or few shot example conditioned) for visual information seeking based planning and/or visual information seeking based reasoning.
  • the machine-learned model may be conditioned to determine when and/or what application programming interface calls are to be performed for different visual information seeking tasks.
  • the machine-learned model may be conditioned to process the received outputs from the application programming interface calls to determine when and/or what relevant information was retrieved.
  • the machine- learned model may be conditioned to iteratively determine API calls and process API outputs until a determined end output is received. The end output may then be processed to generate a response.
  • the computing system can transmit, based on the first planning data, the input data to the first data processing tool to retrieve first output data.
  • the first output data can include one or more bounding boxes associated with one or more objects in the image data.
  • the first output data can include one or more segmented portions of one or more images of the image data and caption data associated with the one or more segmented portions.
  • the caption data can be descriptive of an object classification associated with one or more objects detected in the one or more segmented portions of one or more images.
  • the computing system can process the input data and the first output data with the machine-learned model to generate second planning data.
  • the second planning data can be descriptive of instructions to transmit data to a second data processing tool.
  • the second data processing tool can include a search engine.
  • the second planning data can include a model-generated query.
  • the model-generated query can be transmitted to the second data processing tool to retrieve the second output data.
  • the model-generated query can be generated based on the input data and the first output data.
  • the model-generated query can be descriptive of the query of the input data modified based on the first output data.
  • the computing system can transmit, based on the second planning data, data to the second data processing tool to retrieve second output data.
  • the first planning data and/or the second planning data may include a model-generated query that may be provided to and/or processed with the respective data processing model associated with the planning data.
  • the second data processing tool may receive data via an application programming interface that was instructed to transmit the data based on the second planning data.
  • the computing system can process the input data and the second output data with the machine-learned model to generate response data.
  • the response data can be descriptive of a response to the query.
  • the response data can include a natural language text string that is responsive to the query of the input data.
  • the computing system can process the input data and the second output data with the machine-learned model to generate third planning data.
  • the third planning data can be descriptive of instructions to transmit data to a third data processing tool.
  • the computing system can transmit, based on the third planning data, data to the third data processing tool to retrieve third output data and can process the input data and the third output data with the machine-learned model to generate the response data.
  • Figure 4A depicts a block diagram of an example automated visual information seeking system 400 according to example embodiments of the present disclosure.
  • the automated visual information seeking system 400 can obtain an image (e.g., an image depicting a parade ceremony) and a question 402 associated with the image (e.g., “When was the drum first used for this event?”).
  • the question 402 can be processed with a plan with a large language model block 404, which may decompose the question and determine the image is to be transmitted to an object detection and captioning tool 406A. Additionally, the plan with a large language model block 404 may generate a visual question (e.g., “In the image, what is the drum and event?”). The tool call and the visual question may be part of a first planning dataset.
  • the object detection and captioning tool 406A can process the image to generate a list of objects 408A that includes one or more segmented portions of the image with determined captions for each of the respective segmented portions.
  • the segmented portions, respective captions, and the visual question may be processed with the plan with a large language model block 404 to generate second planning data.
  • the second planning data can select a particular segmented portion and respective caption.
  • the selected segmented portion of the image can then be transmitted to an image search tool 406B based on the second planning data.
  • the image search tool 406B can process the selected segmented portion to identify similar images with alt-texts 408B (e.g., web descriptions associated with similar images to the segmented portion).
  • the similar images with alt-texts 408B can then be processed with a reason with the large language model block 410 to generate an answer to the visual question.
  • the machine-learned model can then determine if the visual question is answered 412 and whether or not other segmented portions are to be processed with the image search tool 406B and/or another tool.
  • the visual answer can then be processed with the plan with a large language model block 404 to generate third planning data.
  • the third planning data can include a generated search query that includes the information from the visual answer (e.g., “When was Taiko first used for Aoi Festival?”) and an application programming interface call to transmit the generated search query to a web search tool 406C.
  • the web search tool 406C can process the search query to determine search results including searched web pages 408C.
  • FIG. 4B depicts a block diagram of an example automated visual information seeking workflow system 450 according to example embodiments of the present disclosure.
  • the automated visual information seeking workflow system 450 can obtain a multimodal query 452 (e.g., an image and a question) and generate a response 468 (e.g., an answer generated based on multi-tool usage) based on a plurality of processing instances with a plurality of different tools.
  • a multimodal query 452 can be obtained from a user computing system.
  • the multimodal query 452 can include an image and a text string descriptive of a question about one or more details from the image.
  • the multimodal query 452 may be processed with the planner model to determine one or more processing actions (e.g., using one or more processing tools).
  • An image search may first be performed with a search engine to determine an image search result 454 (e.g., a similar image and related metadata (e.g., a caption, entity tags, and/or location information).
  • the image search may include embedding the image of the multimodal query 452 then performing a nearest neighbor embedding determination. Alternatively and/or additionally, the image search may include image matching.
  • the image search may be determined to be uninformative, which may lead to backtracking.
  • the planner model may process the image search result 454 and the multimodal query 452 to determine a next action (e.g., a different processing tool).
  • a particular object 456 may be selected based on one or more factors, which may include foreground determination, focus determination, and/or selection based on the content of the text string.
  • the particular object 456 can be detected with a detection model then segmented with a segmentation model.
  • the object selection may be determined to be uninformative, which may lead to backtracking.
  • the planner model may process the particular object 456, the image search result 454, and the multimodal query 452 to determine a next action (e.g., a different processing tool).
  • a second particular object 458 may be selected based on one or more factors, which may include foreground determination, focus determination, and/or selection based on the content of the text string.
  • the second particular object 458 can be detected with a detection model then segmented with a segmentation model.
  • the second particular object 458 may be determined to be informative by the reasoner model and/or the planner model.
  • the planner model can then determine a next action based on the second particular object 458 and the multimodal query 452.
  • An image search e.g., with a search engine
  • the respective search result set 460 can include images, knowledge graph data, web resource data, captions, location data, and/or text content data.
  • the respective search result set 460 may be determined to be informative by the reasoner model and/or the planner model.
  • the planner model can then determine a next action based on the respective search result set 460, the second particular object 458, and the multimodal query 452.
  • the respective search result set 460 and the multimodal query 452 can be processed with a generative language model (e.g., the reasoner model and/or planner model) to perform visual question answering to answer a reasoner model generated question 462.
  • the answer to the reasoner model generated question 462, the respective search result set 460, and the multimodal query 452 can be processed with a generative language model (e.g., the reasoner model and/or planner model) to generate a first predicted answer 464.
  • the reasoner model and/or the planner model may determine the first predicted answer 464 is incorrect and/or uninformative.
  • a search (e.g., with a search engine) may then be performed based on the multimodal query 452 and/or the answer to the reasoner model generated question 462 to determine web search results 466.
  • the web search results 466 and the multimodal query 452 may then be processed with a generative model (e.g., the reasoner model and/or the planner model) to generate a second predicted answer.
  • the second predicted answer may be determine to be accurate and may be provided as the response 468.
  • the automated visual information seeking workflow system 450 can leverage a planner model and/or a reasoner model to determine how to process the multimodal query 452 and successive data instances.
  • the process can include performing a plurality of subtasks, which may include utilizing a plurality of different processing tools based on determinations performed by the planner model and/or the reasoner model.
  • the automated visual information seeking workflow system 450 may generate a plurality of model-generated queries based on the initial multimodal query 452 in order to respond to the multimodal query 452.
  • the plurality of model-generated queries may include multimodal queries, which may include all and/or portions of the original image.
  • the plurality of model- generated queries may include outputs from the one or more processing tools as different tasks are being performed.
  • Figure 5 depicts a block diagram of an example API tool call system 500 according to example embodiments of the present disclosure.
  • API tool call system 500 of the systems and methods disclosed herein can include a machine-learned model being leveraged to determine when, what, and how to utilize a plurality of data processing tools 504 to generate a response to an input prompt.
  • the API tool call system 500 can start 502 with an input prompt that may include text data, image data, audio data, latent encoding data, multimodal data, statistical data, and/or other data and may finish 506 with a response to the input prompt.
  • Generating a response to the input prompt can involve a machine-learned model iteratively obtaining and processing data to determine application programming interface calls that can be performed to utilize a plurality of data processing tools 504 for responding to the input prompt.
  • the plurality of data processing tools 504 can include a captioning tool, a select object tool, a visual question answering tool, an image search tool, a web search tool, a large language model short question-and-answer tool, and/or other tools.
  • the plurality of data processing tools 504 may process the input prompt model-generated data, and/or the outputs of other tools.
  • Figure 6 depicts an illustration of an example user data collection interface 600 according to example embodiments of the present disclosure.
  • the user data collection interface 600 can include a plurality of user interface elements that can be interacted with to collect user data to generate few shot examples to condition a machine-learned model for visual information seeking planning and reasoning.
  • the user data collection interface 600 may display an input prompt 602 including an example input image and an example input question.
  • a plurality of image segments 604 of the input example image can then be provided to a user to be selected for obtaining selected image segment data.
  • a tools interface 606 can then be displayed to allow a user to select which data processing tools they would utilize to answer the question including what data they would provide to the data processing tool.
  • the output(s) 610 of a selected data processing tool can be obtained and provided to a user for display.
  • the user may then be provided with a plurality of output reasoning interface elements 608 for display.
  • the user can then select whether the output(s) 610 are useful, whether the end answer has been determined, whether an additional application programming interface call is to be performed, and/or if the answer cannot be found.
  • the user data collection interface 600 may be provided to a plurality of different users for a plurality of different visual information seeking tasks. The collected user behavior data can then be utilized to generate a transition graph that may be utilized by a pre-trained machine-learned language model for planning and reasoning.
  • Figure 7 depicts a flow chart diagram of an example method to perform according to example embodiments of the present disclosure. Although Figure 7 depicts steps performed in a particular order for purposes of illustration and discussion, the methods of the present disclosure are not limited to the particularly illustrated order or arrangement.
  • a computing system can obtain input data.
  • the input data can include image data and text data.
  • the text data can include a query associated with the image data.
  • the input data may include text data, image data, audio data, video data, latent encoding data, statistical data, multimodal data, and/or other data.
  • the image data may be descriptive of one or more images captured with a user computing device. Alternatively and/or additionally, the image data may be descriptive of one or more images obtained from one or more web resources.
  • the computing system can process the input data with a machine-learned model to generate first planning data.
  • the first planning data can be descriptive of instructions to provide the image data to a first data processing tool.
  • the first planning data may include a model-generated text string to be processed with the image data.
  • the machine-learned model may have been conditioned on a training dataset including a plurality of training examples. Each training example of the plurality of training examples can include a training input, a training output, and a training rationale.
  • the training rationale can be descriptive of a sequence of processing instances and tool calls for determining the output data.
  • the machine- learned model may have been conditioned on a training dataset including sequence data.
  • the sequence data can be descriptive of a sequence of actions for generating a training response in response to obtaining a particular type of input.
  • the machine- learned model may have been conditioned on human input data descriptive of actions a human selected as being particular actions to perform when a particular input type is received.
  • the human input data may have been obtained via a user interface that provides a plurality of selectable options for a user.
  • the plurality of selectable options can include a plurality of external tools to call and a final output option.
  • the computing system can provide the image data to the first data processing tool to receive first output data.
  • the first output data can be generated with the first data processing tool based on the image data.
  • model- generated data and/or the text data may be provided to the first data processing tool to generate the first output data.
  • the first data processing tool may include an image processing tool, a text processing tool, a search engine, an augmentation model, and/or another data processing tool.
  • the first output data can include text data, audio data, image data, latent encoding data, multimodal data, and/or other data.
  • the computing system can process the first output data with the machine- learned model to generate second planning data.
  • the second planning data can be descriptive of instructions to provide a particular portion of the image data to a second data processing tool.
  • the second planning data can include model-generated data to provide to the second data processing tool.
  • the model-generated data can include text data, image data, audio data, latent encoding data, multimodal data, and/or other data.
  • the second data processing tool can include an image processing model, a search engine, a segmentation model, an augmentation model, and/or another data processing tool.
  • the computing system can provide the particular portion of the image data to the second data processing tool to receive second output data.
  • the second output data can be generated with the second data processing tool based on the particular portion of the image data.
  • the second data processing tool may perform a reverse image search to obtain data associated with an object depicted in the particular portion of the image data.
  • the computing system can process the text data and the second output data with the machine-learned model to generate response data.
  • the response data can be descriptive of a response to the query.
  • the response data can include text data, image data, audio data, latent encoding data, augmented reality data, virtual reality data, multimodal data, and/or other data.
  • the response data may include a natural language text string descriptive of the response.
  • Figure 8 depicts a flow chart diagram of an example method to perform according to example embodiments of the present disclosure. Although Figure 8 depicts steps performed in a particular order for purposes of illustration and discussion, the methods of the present disclosure are not limited to the particularly illustrated order or arrangement. The various steps of the method 800 can be omitted, rearranged, combined, and/or adapted in various ways without deviating from the scope of the present disclosure.
  • a computing system can obtain input data.
  • the input data can include image data and text data.
  • the text data can include a query associated with the image data.
  • the query can be descriptive of a question associated with a particular object depicted in the image data.
  • the computing system can process the input data with a machine-learned model to generate first planning data.
  • the first planning data can be descriptive of instructions to provide the image data to a first data processing tool.
  • the computing system can provide the image data to the first data processing tool and receive first output data from the first data processing tool based on the image data.
  • the first data processing tool can detect a plurality of objects depicted in the image data, generate a plurality of bounding boxes associated with the plurality of objects, generate a plurality of image segments based on the plurality of bounding boxes, classify each of the plurality of objects in the plurality of image segments to generate a plurality of object classifications, and generate the first output data.
  • the first output data can include the plurality of image segments and the plurality of object classifications.
  • the computing system can process the first output data with the machine- learned model to generate second planning data.
  • the second planning data can be descriptive of instructions to provide a particular portion of the image data to a second data processing tool.
  • the computing system can provide the particular portion of the image data to the second data processing tool and receive second output data from the second data processing tool based on the particular portion of the image data.
  • the second data processing tool can include an image search engine.
  • the second data processing tool can process one or more of the plurality of image segments with the image search engine to determine one or more web resources associated with the one or more of the plurality of image segments.
  • the one or more of the plurality of image segments can be selected by the machine-learned model based on the input data and the plurality of object classifications. Additionally and/or alternatively, the second data processing tool can generate the second output data based on the one or more web resources.
  • Figure 9A depicts an illustration of an example visual information determination process 900 according to example embodiments of the present disclosure. In particular, Figure 9 depicts three example processes (920, 922, & 924).
  • a question 902 (“How many floors does this building have?”) and an image 904 (an image of a building) can be obtained.
  • An object selection 906 can occur to select a relevant image segment associated with the question 902. The image segment can then be processed with an image search tool 908 to identify similar images and relevant text data associated with the identified similar images.
  • the similar image(s) and respective relevant text data can be processed with a language model reasoner block 910 to determine a relevant answer to what the building is.
  • the building name can then be processed with a web search tool 912 to obtain one or more search results.
  • Relevant search result information can then be extracted using the language model reasoner block 910 to determine an answer 916 to the input question 902.
  • a similar process can be performed that may include a second object selection 906 loop based on a determined object identification being irrelevant based on processing a tool output with the language model reasoner block 910.
  • a similar process to 920 and 922 can be performed that may include the utilization of an LLM short QA tool for answering the input question 902.
  • FIG. 9B depicts an illustration of an example visual query processing system 950 according to example embodiments of the present disclosure.
  • the visual query processing system 950 can employ dynamic decision-making to plan (e.g., find optimal tool and query), execute results, and then reason (e.g., estimate whether continue or backtrack).
  • the visual query processing system 950 can obtain a visual query.
  • the visual query can be processed with the planner model 952 (e.g., a generative language model (e.g., an LLM)).
  • the visual query processing system 950 may begin processing with an initial query (e.g., a multimodal visual query). As additional information is retrieved, the initial query may be updated.
  • the planner model 952 e.g., a generative language model (e.g., an LLM)
  • the visual query processing system 950 may begin processing with an initial query (e.g., a multimodal visual query). As additional information is retrieved, the initial query may be updated.
  • an initial query e.g.,
  • the initial query can be processed with the planner model 952 to determine a particular action to perform, which may cause an application programming interface call to be generated.
  • the application programming interface call can then be performed to transmit the initial query and instructions to a tool executor 954 to generate and/or obtain execution results 956.
  • the tool executor may execute the tool interactions to communicate with the one or more processing tools (e.g., image search engine, web search engine, vision language model, captioning model, VQA model processing, detection model, segmentation model, augmentation model, etc.).
  • the execution results 956 may be transmitted to a working memory 958 and may be utilized for future planning instances with the planner model 952.
  • the query state may be updated based on the execution results 956.
  • the execution results 956 can be processed with a reasoner model 960 (e.g., a generative language model (e.g., an LLM)) to determine whether the execution results 956 are informative 962. If the execution results 956 are determined to not be informative, the visual query processing system 950 may backtrack and perform another planning instance with the planner model 952. If the execution results 956 are determined to be informative, the query state may be updated based on the execution results 956 to generate an updated query 964, which may include updating and/or utilizing transition graphs 966 and/or in-context examples.
  • a reasoner model 960 e.g., a generative language model (e.g., an LLM)
  • FIG. 10 depicts a flow chart diagram of an example method to perform according to example embodiments of the present disclosure. Although Figure 10 depicts steps performed in a particular order for purposes of illustration and discussion, the methods of the present disclosure are not limited to the particularly illustrated order or arrangement. The various steps of the method 1000 can be omitted, rearranged, combined, and/or adapted in various ways without deviating from the scope of the present disclosure.
  • a computing system can obtain input data.
  • the input data can include image data and text data.
  • the text data can include a query associated with the image data.
  • the input data may be obtained via a graphical user interface, which may include a query input box, a selection window, and/or a download portal.
  • the computing system can process the input data with an artificial intelligence system to generate response data.
  • the response data can be descriptive of a response to the query.
  • the artificial intelligence system can include a language model planner block, a language model reasoner block, and a working memory.
  • the language model planner block may have been conditioned to determine one or more data processing tools to utilize for processing data associated with responding to the query.
  • the language model reasoner block may have been conditioned to process outputs of the one or more data processing tools to determine information associated with responding to the query.
  • the working memory can store acquired information obtained and generated with the artificial intelligence system.
  • the artificial intelligence system can include one or more machine-learned models conditioned on user behavior data.
  • the user behavior data can be processed to generate a transition graph that is descriptive of a determined sequence of decisions made by users when performing a particular information seeking task.
  • the transition graph can be descriptive of a plurality of distinct states and can indicate a particular set of actions available at each state of the plurality of distinct states.
  • the user behavior data may be utilized to condition the language model planner block and the language model reasoner block for particular context- based processing.
  • the one or more data processing tools can include a computer vision tool, a web search tool, and/or an image search tool.
  • the computer vision tool can include an object detection model, an optical character recognition model, an image captioning model, and/or a visual question-and-answer model.
  • the web search tool can retrieve open world knowledge from web resources.
  • the image search tool may identify relevant information from metadata associated with visually similar images.
  • the language model planner block may have been conditioned to generate a processing tool query to provide to the one or more data processing tools based on the input data.
  • the computing system can provide the response data as an output.
  • the response data may be provided to a user via a graphical user interface.
  • the graphical user interface may be a search interface, and the response data may be provided for display in a panel adjacent to one or more search results.
  • Figure 11A depicts a block diagram of an example computing system 100 that performs visual query processing according to example embodiments of the present disclosure.
  • the system 100 includes a user computing system 102, a server computing system 130, and/or a third computing system 150 that are communicatively coupled over a network 180.
  • the user computing system 102 can include any type of computing device, such as, for example, a personal computing device (e.g., laptop or desktop), a mobile computing device (e.g., smartphone or tablet), a gaming console or controller, a wearable computing device, an embedded computing device, or any other type of computing device.
  • the user computing system 102 includes one or more processors 112 and a memory 114.
  • the one or more processors 112 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, a FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected.
  • the memory 114 can include one or more non-transitory computer-readable storage mediums, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof.
  • the memory 114 can store data 116 and instructions 118 which are executed by the processor 112 to cause the user computing system 102 to perform operations.
  • the user computing system 102 can store or include one or more machine-learned models 120.
  • the machine-learned models 120 can be or can otherwise include various machine-learned models such as neural networks (e.g., deep neural networks) or other types of machine-learned models, including non-linear models and/or linear models.
  • Neural networks can include feed-forward neural networks, recurrent neural networks (e.g., long short-term memory recurrent neural networks), convolutional neural networks or other forms of neural networks.
  • the one or more machine-learned models 120 can be received from the server computing system 130 over network 180, stored in the user computing device memory 114, and then used or otherwise implemented by the one or more processors 112.
  • the user computing system 102 can implement multiple parallel instances of a single machine-learned model 120 (e.g., to perform parallel machine-learned model processing across multiple instances of input data and/or detected features).
  • the one or more machine-learned models 120 may include one or more detection models, one or more classification models, one or more segmentation models, one or more augmentation models, one or more generative models, one or more natural language processing models, one or more optical character recognition models, and/or one or more other machine-learned models.
  • the one or more machine-learned models 120 can include one or more transformer models.
  • the one or more machine-learned models 120 may include one or more neural radiance field models, one or more diffusion models, and/or one or more autoregressive language models.
  • the one or more machine-learned models 120 may be utilized to detect one or more object features. The detected object features may be classified and/or embedded.
  • the classification and/or the embedding may then be utilized to perform a search to determine one or more search results.
  • the one or more detected features may be utilized to determine an indicator (e.g., a user interface element that indicates a detected feature) is to be provided to indicate a feature has been detected.
  • the user may then select the indicator to cause a feature classification, embedding, and/or search to be performed.
  • the classification, the embedding, and/or the searching can be performed before the indicator is selected.
  • the one or more machine-learned models 120 can process image data, text data, audio data, and/or latent encoding data to generate output data that can include image data, text data, audio data, and/or latent encoding data.
  • the one or more machine-learned models 120 may perform optical character recognition, natural language processing, image classification, object classification, text classification, audio classification, context determination, action prediction, image correction, image augmentation, text augmentation, sentiment analysis, object detection, error detection, inpainting, video stabilization, audio correction, audio augmentation, and/or data segmentation (e.g., mask based segmentation).
  • Machine-learned model(s) can be or include one or multiple machine-learned models or model components.
  • Example machine-learned models can include neural networks (e.g., deep neural networks).
  • Example machine-learned models can include non-linear models or linear models.
  • Example machine-learned models can use other architectures in lieu of or in addition to neural networks.
  • Example machine-learned models can include decision tree based models, support vector machines, hidden Markov models, Bayesian networks, linear regression models, k-means clustering models, etc.
  • Example neural networks can include feed-forward neural networks, recurrent neural networks (RNNs), including long short-term memory (LSTM) based recurrent neural networks, convolutional neural networks (CNNs), diffusion models, generative-adversarial networks, or other forms of neural networks.
  • Example neural networks can be deep neural networks.
  • Some example machine-learned models can leverage an attention mechanism such as self-attention.
  • some example machine-learned models can include multi- headed self-attention models.
  • Machine-learned model(s) can include a single or multiple instances of the same model configured to operate on data from input(s).
  • Machine-learned model(s) can include an ensemble of different models that can cooperatively interact to process data from input(s).
  • machine-learned model(s) can employ a mixture-of-experts structure. See, e.g., Zhou et al., Mixture-of-Experts with Expert Choice Routing, ARXIV:2202.09368v2 (Oct.14, 2022).
  • Input(s) can generally include or otherwise represent various types of data. Input(s) can include one type or many different types of data. Output(s) can be data of the same type(s) or of different types of data as compared to input(s). Output(s) can include one type or many different types of data.
  • Example data types for input(s) or output(s) include natural language text data, software code data (e.g., source code, object code, machine code, or any other form of computer-readable instructions or programming languages), machine code data (e.g., binary code, assembly code, or other forms of machine-readable instructions that can be executed directly by a computer's central processing unit), assembly code data (e.g., low-level programming languages that use symbolic representations of machine code instructions to program a processing unit), genetic data or other chemical or biochemical data, image data, audio data, audiovisual data, haptic data, biometric data, medical data, financial data, statistical data, geographical data, astronomical data, historical data, sensor data generally (e.g., digital or analog values, such as voltage or other absolute or relative level measurement values from a real or artificial input, such as from an audio sensor, light sensor, displacement sensor, etc.), and the like.
  • software code data e.g., source code, object code, machine code, or any other form of computer-readable instructions or programming languages
  • Data can be raw or processed and can be in any format or schema.
  • example combinations of data types include image data and audio data, image data and natural language data, natural language data and software code data, image data and biometric data, sensor data and medical data, etc. It is to be understood that any combination of data types in an input or an output can be present.
  • An example input can include one or multiple data types, such as the example data types noted above.
  • An example output can include one or multiple data types, such as the example data types noted above.
  • the data type(s) of input can be the same as or different from the data type(s) of output. It is to be understood that the example data types noted above are provided for illustrative purposes only.
  • one or more machine-learned models 140 can be included in or otherwise stored and implemented by the server computing system 130 that communicates with the user computing system 102 according to a client-server relationship.
  • the machine-learned models 140 can be implemented by the server computing system 140 as a portion of a web service (e.g., a viewfinder service, a visual search service, an image processing service, an ambient computing service, and/or an overlay application service).
  • a web service e.g., a viewfinder service, a visual search service, an image processing service, an ambient computing service, and/or an overlay application service.
  • one or more models 120 can be stored and implemented at the user computing system 102 and/or one or more models 140 can be stored and implemented at the server computing system 130.
  • the user computing system 102 can also include one or more user input component 122 that receives user input.
  • the user input component 122 can be a touch-sensitive component (e.g., a touch-sensitive display screen or a touch pad) that is sensitive to the touch of a user input object (e.g., a finger or a stylus).
  • the touch-sensitive component can serve to implement a virtual keyboard.
  • Other example user input components include a microphone, a traditional keyboard, or other means by which a user can provide user input.
  • the user computing system can store and/or provide one or more user interfaces 124, which may be associated with one or more applications.
  • the one or more user interfaces 124 can be configured to receive inputs and/or provide data for display (e.g., image data, text data, audio data, one or more user interface elements, an augmented-reality experience, a virtual reality experience, and/or other data for display.
  • the user interfaces 124 may be associated with one or more other computing systems (e.g., server computing system 130 and/or third party computing system 150).
  • the user interfaces 124 can include a viewfinder interface, a search interface, a generative model interface, a social media interface, and/or a media content gallery interface.
  • the user computing system 102 may include and/or receive data from one or more sensors 126.
  • the one or more sensors 126 may be housed in a housing component that houses the one or more processors 112, the memory 114, and/or one or more hardware components, which may store, and/or cause to perform, one or more software packets.
  • the one or more sensors 126 can include one or more image sensors (e.g., a camera), one or more lidar sensors, one or more audio sensors (e.g., a microphone), one or more inertial sensors (e.g., inertial measurement unit), one or more biological sensors (e.g., a heart rate sensor, a pulse sensor, a retinal sensor, and/or a fingerprint sensor), one or more infrared sensors, one or more location sensors (e.g., GPS), one or more touch sensors (e.g., a conductive touch sensor and/or a mechanical touch sensor), and/or one or more other sensors.
  • image sensors e.g., a camera
  • lidar sensors e.g., a microphone
  • inertial sensors e.
  • the one or more sensors can be utilized to obtain data associated with a user’s environment (e.g., an image of a user’s environment, a recording of the environment, and/or the location of the user).
  • the user computing system 102 may include, and/or pe part of, a user computing device 104.
  • the user computing device 104 may include a mobile computing device (e.g., a smartphone or tablet), a desktop computer, a laptop computer, a smart wearable, and/or a smart appliance. Additionally and/or alternatively, the user computing system may obtain from, and/or generate data with, the one or more one or more user computing devices 104.
  • a camera of a smartphone may be utilized to capture image data descriptive of the environment, and/or an overlay application of the user computing device 104 can be utilized to track and/or process the data being provided to the user.
  • one or more sensors associated with a smart wearable may be utilized to obtain data about a user and/or about a user’s environment (e.g., image data can be obtained with a camera housed in a user’s smart glasses). Additionally and/or alternatively, the data may be obtained and uploaded from other user devices that may be specialized for data obtainment or generation.
  • the server computing system 130 includes one or more processors 132 and a memory 134.
  • the one or more processors 132 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, a FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected.
  • the memory 134 can include one or more non-transitory computer-readable storage mediums, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof.
  • the memory 134 can store data 136 and instructions 138 which are executed by the processor 132 to cause the server computing system 130 to perform operations.
  • the server computing system 130 includes or is otherwise implemented by one or more server computing devices.
  • the server computing system 130 can store or otherwise include one or more machine-learned models 140.
  • the models 140 can be or can otherwise include various machine-learned models.
  • Example machine-learned models include neural networks or other multi-layer non-linear models.
  • Example neural networks include feed forward neural networks, deep neural networks, recurrent neural networks, and convolutional neural networks.
  • Example models 140 are discussed with reference to Figure 9B.
  • the server computing system 130 can include and/or be communicatively connected with a search engine 142 that may be utilized to crawl one or more databases (and/or resources).
  • the search engine 142 can process data from the user computing system 102, the server computing system 130, and/or the third party computing system 150 to determine one or more search results associated with the input data.
  • the search engine 142 may perform term based search, label based search, Boolean based searches, image search, embedding based search (e.g., nearest neighbor search), multimodal search, and/or one or more other search techniques.
  • the server computing system 130 may store and/or provide one or more user interfaces 144 for obtaining input data and/or providing output data to one or more users.
  • the one or more user interfaces 144 can include one or more user interface elements, which may include input fields, navigation tools, content chips, selectable tiles, widgets, data display carousels, dynamic animation, informational pop-ups, image augmentations, text-to-speech, speech-to-text, augmented-reality, virtual-reality, feedback loops, and/or other interface elements.
  • the user computing system 102 and/or the server computing system 130 can train the models 120 and/or 140 via interaction with the third party computing system 150 that is communicatively coupled over the network 180.
  • the third party computing system 150 can be separate from the server computing system 130 or can be a portion of the server computing system 130.
  • An example machine-learned model can include a generative model (e.g., a large language model, a foundation model, a vision language model, an image generation model, a text-to-image model, an audio generation model, and/or other generative models).
  • Training and/or tuning the machine-learned model can include obtaining a training instance.
  • a set of training data can include a plurality of training instances divided between multiple datasets (e.g., a training dataset, a validation dataset, or testing dataset).
  • a training instance can be labeled or unlabeled.
  • Training and/or tuning can include processing, using one or more machine- learned models, the training instance to generate an output.
  • the output can be directly obtained from the one or more machine-learned models or can be a downstream result of a chain of processing operations that includes an output of the one or more machine-learned models.
  • Training and/or tuning can include receiving an evaluation signal associated with the output. The evaluation signal can be obtained using a loss function.
  • the evaluation signal can be computed using known ground-truth labels (e.g., supervised learning), predicted or estimated labels (e.g., semi- or self-supervised learning), or without labels (e.g., unsupervised learning).
  • the evaluation signal can be a reward (e.g., for reinforcement learning).
  • the reward can be computed using a machine-learned reward model configured to generate rewards based on output(s) received.
  • the reward can be computed using feedback data describing human feedback on the output(s).
  • Training and/or tuning can include updating the machine-learned model using the evaluation signal.
  • values for parameters of the machine-learned model(s) can be learned, in some embodiments, using various training or learning techniques, such as, for example, backwards propagation.
  • the evaluation signal can be backpropagated from the output (or another source of the evaluation signal) through the machine-learned model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the evaluation signal with respect to the parameter value(s)).
  • system(s) containing one or more machine-learned models can be trained in an end-to-end manner. Gradient descent techniques can be used to iteratively update the parameters over a number of training iterations.
  • performing backwards propagation of errors can include performing truncated backpropagation through time.
  • Training and/or tuning can include implementing a number of generalization techniques (e.g., weight decays, dropouts, etc.) to improve the generalization capability of the models being trained.
  • the above training loop can be implemented for training a machine-learned model from an initialized state to a fully trained state (e.g., when the model exhibits a desired performance profile, such as based on accuracy, precision, recall, etc.).
  • the above training loop can be implemented for particular stages of a training procedure. For instance, in some implementations, the above training loop can be implemented for pre-training a machine-learned model.
  • Pre-training can include, for instance, large-scale training over potentially noisy data to achieve a broad base of performance levels across a variety of tasks/data types.
  • the above training loop can be implemented for fine-tuning a machine-learned model.
  • Fine- tuning can include, for instance, smaller-scale training on higher-quality (e.g., labeled, curated, etc.) data.
  • Fine-tuning can affect all or a portion of the parameters of a machine- learned model. For example, various portions of the machine-learned model can be “frozen” for certain training stages.
  • parameters associated with an embedding space can be “frozen” during fine-tuning (e.g., to retain information learned from a broader domain(s) than present in the fine-tuning dataset(s)).
  • An example fine-tuning approach includes reinforcement learning. Reinforcement learning can be based on user feedback on model performance during use.
  • the computing system 100 may utilize one or more soft prompts for conditioning the one or more machine-learned models (120 and/or 140) for downstream tasks.
  • the one or more soft prompts can include a set of tunable parameters that can be trained (or tuned) as the parameters of the one or more machine-learned models (120 and/or 140) are fixed.
  • the one or more soft prompts 124 can be trained for a specific task and/or a specific set of tasks. Alternatively and/or additionally, the one or more soft prompts 124 may be trained to condition the one or more machine-learned models (120 and/or 140) to perform inferences for a particular individual, one or more entities, and/or one or more tasks such that the output is tailored for that particular individual, particular entities, and/or particular task.
  • the one or more soft prompts 124 can be obtained and processed with one or more inputs by the one or more machine-learned models (120 and/or 140). [0160]
  • the one or more soft prompts can include a set of machine-learned weights.
  • the one or more soft prompts can include weights that were trained to condition a generative model to generate model-generated content with one or more particular attributes.
  • the one or more soft prompts can be utilized by a user to generate content based on the fine-tuning.
  • the one or more soft prompts can be extended to a plurality of tasks.
  • the computing system 100 may tune the set of parameters on a plurality of different content attributes and/or types.
  • the one or more soft prompts may include a plurality of learned vector representations that may be model-readable. [0161]
  • a particular soft prompt can be obtained based on a particular task, individual, content type, etc.
  • the particular soft prompt can include a set of learned parameters.
  • the set of learned parameters can be processed with the generative model to generate the model- generated image.
  • the user computing system 102 and/or the server computing system 130 may store one or more soft prompts associated with the particular user and/or particular task.
  • the soft prompt(s) can include a set of parameters.
  • the user computing system 102 and/or the server computing system 130 may leverage the set of parameters of the soft prompt(s) and a generative model to generate a model-generated content item.
  • the model-generated content item can be generated based on the set of parameters associated with the particular individual and/or task.
  • a soft prompt i.e., a set of parameters that can be processed with a generative model for downstream task conditioning
  • the set of parameters can be limited and may be adjusted while the parameters of the pre-trained generative model stay fixed.
  • the set of parameters of the soft prompt can be utilized to condition the pre-trained generative model (e.g., the machine-learned image generation model and/or language model) for particular downstream tasks (e.g., response generation and/or image rendering).
  • the generative language model and/or one or more soft prompts can be trained to generate content with particular attrributes.
  • the server computing system 130 can include a prompt library.
  • the prompt library can store a plurality of prompt templates (e.g., a plurality of hard prompt templates (e.g., text prompt templates)) and/or a plurality of soft prompts.
  • the plurality of prompt templates can include hard prompt templates (e.g., text string data) that may be combined with the user input to generate a more detailed and complete prompt for the generative model to process.
  • the templates can include text descriptive of the request.
  • the templates may be object-specific, user-specific, and/or content-specific.
  • the plurality of prompt templates may include few-shot examples.
  • the prompt library can store a plurality of soft prompts.
  • the plurality of soft prompts may be associated with a plurality of different content attributes and/or a plurality of different individuals.
  • the plurality of soft prompts can include learned parameters and/or learned weights that can be processed with the generative model to condition the generative model to generate content items with particular attributes.
  • the plurality of soft prompts may have been tuned by freezing the parameters of a pre-trained generative model, while the parameters of the soft prompt are learned based on a particular task and/or user.
  • the plurality of soft prompts can include a plurality of different soft prompts associated with a plurality of different users and/or a plurality of different sets of users.
  • the third party computing system 150 can include one or more processors 152 and a memory 154.
  • the one or more processors 152 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, a FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected.
  • the memory 154 can include one or more non-transitory computer-readable storage mediums, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof.
  • the memory 154 can store data 156 and instructions 158 which are executed by the processor 152 to cause the third party computing system 150 to perform operations.
  • the third party computing system 150 includes or is otherwise implemented by one or more server computing devices.
  • the network 180 can be any type of communications network, such as a local area network (e.g., intranet), wide area network (e.g., Internet), or some combination thereof and can include any number of wired or wireless links.
  • communication over the network 180 can be carried via any type of wired and/or wireless connection, using a wide variety of communication protocols (e.g., TCP/IP, HTTP, SMTP, FTP), encodings or formats (e.g., HTML, XML), and/or protection schemes (e.g., VPN, secure HTTP, SSL).
  • TCP/IP Transmission Control Protocol/IP
  • HTTP HyperText Transfer Protocol
  • SMTP Simple Stream Transfer Protocol
  • FTP e.g., HTTP
  • FTP encodings or formats
  • protection schemes e.g., VPN, secure HTTP, SSL
  • the machine-learned models described in this specification may be used in a variety of tasks, applications, and/or use cases.
  • the input to the machine-learned model(s) of the present disclosure can be image data.
  • the machine-learned model(s) can process the image data to generate an output.
  • the machine-learned model(s) can process the image data to generate an image recognition output (e.g., a recognition of the image data, a latent embedding of the image data, an encoded representation of the image data, a hash of the image data, etc.).
  • the machine-learned model(s) can process the image data to generate an image segmentation output.
  • the machine- learned model(s) can process the image data to generate an image classification output.
  • the machine-learned model(s) can process the image data to generate an image data modification output (e.g., an alteration of the image data, etc.).
  • the machine-learned model(s) can process the image data to generate an encoded image data output (e.g., an encoded and/or compressed representation of the image data, etc.).
  • the machine-learned model(s) can process the image data to generate an upscaled image data output.
  • the machine-learned model(s) can process the image data to generate a prediction output.
  • the input to the machine-learned model(s) of the present disclosure can be text or natural language data.
  • the machine-learned model(s) can process the text or natural language data to generate an output.
  • the machine- learned model(s) can process the natural language data to generate a language encoding output.
  • the machine-learned model(s) can process the text or natural language data to generate a latent text embedding output.
  • the machine- learned model(s) can process the text or natural language data to generate a translation output.
  • the machine-learned model(s) can process the text or natural language data to generate a classification output.
  • the machine-learned model(s) can process the text or natural language data to generate a textual segmentation output.
  • the machine-learned model(s) can process the text or natural language data to generate a semantic intent output.
  • the machine-learned model(s) can process the text or natural language data to generate an upscaled text or natural language output (e.g., text or natural language data that is higher quality than the input text or natural language, etc.). As another example, the machine-learned model(s) can process the text or natural language data to generate a prediction output.
  • the input to the machine-learned model(s) of the present disclosure can be speech data.
  • the machine-learned model(s) can process the speech data to generate an output.
  • the machine-learned model(s) can process the speech data to generate a speech recognition output.
  • the machine- learned model(s) can process the speech data to generate a speech translation output.
  • the machine-learned model(s) can process the speech data to generate a latent embedding output.
  • the machine-learned model(s) can process the speech data to generate an encoded speech output (e.g., an encoded and/or compressed representation of the speech data, etc.).
  • the machine-learned model(s) can process the speech data to generate an upscaled speech output (e.g., speech data that is higher quality than the input speech data, etc.).
  • the machine-learned model(s) can process the speech data to generate a textual representation output (e.g., a textual representation of the input speech data, etc.).
  • the machine- learned model(s) can process the speech data to generate a prediction output.
  • the input to the machine-learned model(s) of the present disclosure can be sensor data.
  • the machine-learned model(s) can process the sensor data to generate an output.
  • the machine-learned model(s) can process the sensor data to generate a recognition output.
  • the machine-learned model(s) can process the sensor data to generate a prediction output.
  • the machine-learned model(s) can process the sensor data to generate a classification output.
  • the machine-learned model(s) can process the sensor data to generate a segmentation output.
  • the machine-learned model(s) can process the sensor data to generate a segmentation output.
  • the machine-learned model(s) can process the sensor data to generate a visualization output.
  • the machine-learned model(s) can process the sensor data to generate a diagnostic output.
  • the machine-learned model(s) can process the sensor data to generate a detection output.
  • the input includes visual data and the task is a computer vision task.
  • the input includes pixel data for one or more images and the task is an image processing task.
  • the image processing task can be image classification, where the output is a set of scores, each score corresponding to a different object class and representing the likelihood that the one or more images depict an object belonging to the object class.
  • the image processing task may be object detection, where the image processing output identifies one or more regions in the one or more images and, for each region, a likelihood that region depicts an object of interest.
  • the image processing task can be image segmentation, where the image processing output defines, for each pixel in the one or more images, a respective likelihood for each category in a predetermined set of categories.
  • the set of categories can be foreground and background.
  • the set of categories can be object classes.
  • the image processing task can be depth estimation, where the image processing output defines, for each pixel in the one or more images, a respective depth value.
  • the image processing task can be motion estimation, where the network input includes multiple images, and the image processing output defines, for each pixel of one of the input images, a motion of the scene depicted at the pixel between the images in the network input.
  • the task can be a generative task, and the one or more machine-learned models (e.g., 120 and/or 140) can be configured to output content generated in view of one or more inputs.
  • the inputs can be or otherwise represent data of one or more modalities that encodes context for generating additional content.
  • the task can be a text completion task.
  • the machine- learned models can be configured to process the inputs that represent textual data and to generate the outputs that represent additional textual data that completes a textual sequence that includes the inputs.
  • the machine-learned models can be configured to generate the outputs to complete a sentence, paragraph, or portion of text that follows from a portion of text represented by inputs.
  • the task can be an instruction following task.
  • the machine-learned models can be configured to process the inputs that represent instructions to perform a function and to generate the outputs that advance a goal of satisfying the instruction function (e.g., at least a step of a multi-step procedure to perform the function).
  • the outputs can represent data of the same or of a different modality as the inputs.
  • the inputs can represent textual data (e.g., natural language instructions for a task to be performed) and the machine-learned models can process the inputs to generate the outputs that represent textual data responsive to the instructions (e.g., natural language responses, programming language responses, machine language responses, etc.).
  • the inputs can represent image data (e.g., image-based instructions for a task to be performed, optionally accompanied by textual instructions) and the machine-learned models can process the inputs to generate the outputs that represent textual data responsive to the instructions (e.g., natural language responses, programming language responses, machine language responses, etc.).
  • One or more outputs can be iteratively or recursively generated to sequentially process and accomplish steps toward accomplishing the requested functionality. For instance, an initial output can be executed by an external system or be processed by the machine-learned models to complete an initial step of performing a function. Multiple steps can be performed, with a final output being obtained that is responsive to the initial instructions.
  • the task can be a question answering task.
  • the machine-learned models can be configured to process the inputs that represent a question to answer and to generate the outputs that advance a goal of returning an answer to the question (e.g., at least a step of a multi-step procedure to perform the function).
  • the outputs can represent data of the same or of a different modality as the inputs.
  • the inputs can represent textual data (e.g., natural language instructions for a task to be performed) and the machine-learned models can process the inputs to generate the outputs that represent textual data responsive to the question (e.g., natural language responses, programming language responses, machine language responses, etc.).
  • the inputs can represent image data (e.g., image-based instructions for a task to be performed, optionally accompanied by textual instructions) and the machine-learned models can process the inputs to generate the outputs that represent textual data responsive to the question (e.g., natural language responses, programming language responses, machine language responses, etc.).
  • One or more outputs can be iteratively or recursively generated to sequentially process and accomplish steps toward answering the question.
  • an initial output can be executed by an external system or be processed by the machine-learned models to complete an initial step of obtaining an answer to the question (e.g., querying a database, performing a computation, executing a script, etc.). Multiple steps can be performed, with a final output being obtained that is responsive to the question.
  • the task can be an image generation task.
  • the machine-learned models can be configured to process the inputs that represent context regarding a desired portion of image content.
  • the context can include text data, image data, audio data, etc.
  • Machine-learned models can be configured to generate the outputs that represent image data that depicts imagery related to the context.
  • the machine- learned models can be configured to generate pixel data of an image.
  • Values for channels associated with the pixels in the pixel data can be selected based on the context (e.g., based on a probability determined based on the context).
  • the task can be an audio generation task.
  • Machine- learned models can be configured to process the inputs that represent context regarding a desired portion of audio content.
  • the context can include text data, image data, audio data, etc.
  • the machine-learned models can be configured to generate the outputs that represent audio data related to the context.
  • the machine-learned models can be configured to generate waveform data in the form of an image (e.g., a spectrogram). Values for channels associated with pixels of the image can be selected based on the context.
  • the machine- learned models can be configured to generate waveform data in the form of a sequence of discrete samples of a continuous waveform. Values of the sequence can be selected based on the context (e.g., based on a probability determined based on the context).
  • the task can be a data generation task.
  • Machine- learned models can be configured to process the inputs that represent context regarding a desired portion of data (e.g., data from various data domains, such as sensor data, image data, multimodal data, statistical data, etc.).
  • the desired data can be, for instance, synthetic data for training other machine-learned models.
  • the context can include arbitrary data types.
  • the machine-learned models can be configured to generate the outputs that represent data that aligns with the desired data.
  • the machine-learned models can be configured to generate data values for populating a dataset. Values for the data objects can be selected based on the context (e.g., based on a probability determined based on the context).
  • the user computing system may include a number of applications (e.g., applications 1 through N). Each application may include its own respective machine learning library and machine-learned model(s). For example, each application can include a machine- learned model.
  • Example applications include a text messaging application, an email application, a dictation application, a virtual keyboard application, a browser application, etc.
  • Each application can communicate with a number of other components of the computing device, such as, for example, one or more sensors, a context manager, a device state component, and/or additional components.
  • each application can communicate with each device component using an API (e.g., a public API).
  • the API used by each application is specific to that application.
  • the user computing system 102 can include a number of applications (e.g., applications 1 through N). Each application is in communication with a central intelligence layer.
  • Example applications include a text messaging application, an email application, a dictation application, a virtual keyboard application, a browser application, etc.
  • each application can communicate with the central intelligence layer (and model(s) stored therein) using an API (e.g., a common API across all applications).
  • the central intelligence layer can include a number of machine-learned models. For example a respective machine-learned model (e.g., a model) can be provided for each application and managed by the central intelligence layer.
  • two or more applications can share a single machine-learned model.
  • the central intelligence layer can provide a single model (e.g., a single model) for all of the applications.
  • the central intelligence layer is included within or otherwise implemented by an operating system of the computing system 100.
  • the central intelligence layer can communicate with a central device data layer.
  • the central device data layer can be a centralized repository of data for the computing system 100.
  • the central device data layer may communicate with a number of other components of the computing device, such as, for example, one or more sensors, a context manager, a device state component, and/or additional components.
  • the central device data layer can communicate with each device component using an API (e.g., a private API).
  • Figure 11B depicts a block diagram of an example computing system 50 that performs visual query processing according to example embodiments of the present disclosure.
  • the example computing system 50 can include one or more computing devices 52 that can be utilized to obtain, and/or generate, one or more datasets that can be processed by a sensor processing system 60 and/or an output determination system 80 to feedback to a user that can provide information on features in the one or more obtained datasets.
  • the one or more datasets can include image data, text data, audio data, multimodal data, latent encoding data, etc.
  • the one or more datasets may be obtained via one or more sensors associated with the one or more computing devices 52 (e.g., one or more sensors in the computing device 52). Additionally and/or alternatively, the one or more datasets can be stored data and/or retrieved data (e.g., data retrieved from a web resource).
  • images, text, and/or other content items may be interacted with by a user.
  • the interacted with content items can then be utilized to generate one or more determinations.
  • the one or more computing devices 52 can obtain, and/or generate, one or more datasets based on image capture, sensor tracking, data storage retrieval, content download (e.g., downloading an image or other content item via the internet from a web resource), and/or via one or more other techniques.
  • the one or more datasets can be processed with a sensor processing system 60.
  • the sensor processing system 60 may perform one or more processing techniques using one or more machine-learned models, one or more search engines, and/or one or more other processing techniques.
  • the one or more processing techniques can be performed in any combination and/or individually.
  • the one or more processing techniques can be performed in series and/or in parallel.
  • the one or more datasets can be processed with a context determination block 62, which may determine a context associated with one or more content items.
  • the context determination block 62 may identify and/or process metadata, user profile data (e.g., preferences, user search history, user browsing history, user purchase history, and/or user input data), previous interaction data, global trend data, location data, time data, and/or other data to determine a particular context associated with the user.
  • the context can be associated with an event, a determined trend, a particular action, a particular type of data, a particular environment, and/or another context associated with the user and/or the retrieved or obtained data.
  • the sensor processing system 60 may include an image preprocessing block 64.
  • the image preprocessing block 64 may be utilized to adjust one or more values of an obtained and/or received image to prepare the image to be processed by one or more machine-learned models and/or one or more search engines 74.
  • the image preprocessing block 64 may resize the image, adjust saturation values, adjust resolution, strip and/or add metadata, and/or perform one or more other operations.
  • the sensor processing system 60 can include one or more machine-learned models, which may include a detection model 66, a segmentation model 68, a classification model 70, an embedding model 72, and/or one or more other machine-learned models.
  • the sensor processing system 60 may include one or more detection models 66 that can be utilized to detect particular features in the processed dataset.
  • one or more images can be processed with the one or more detection models 66 to generate one or more bounding boxes associated with detected features in the one or more images.
  • one or more segmentation models 68 can be utilized to segment one or more portions of the dataset from the one or more datasets.
  • the one or more segmentation models 68 may utilize one or more segmentation masks (e.g., one or more segmentation masks manually generated and/or generated based on the one or more bounding boxes) to segment a portion of an image, a portion of an audio file, and/or a portion of text.
  • the segmentation may include isolating one or more detected objects and/or removing one or more detected objects from an image.
  • the one or more classification models 70 can be utilized to process image data, text data, audio data, latent encoding data, multimodal data, and/or other data to generate one or more classifications.
  • the one or more classification models 70 can include one or more image classification models, one or more object classification models, one or more text classification models, one or more audio classification models, and/or one or more other classification models.
  • the one or more classification models 70 can process data to determine one or more classifications.
  • data may be processed with one or more embedding models 72 to generate one or more embeddings.
  • one or more images can be processed with the one or more embedding models 72 to generate one or more image embeddings in an embedding space.
  • the one or more image embeddings may be associated with one or more image features of the one or more images.
  • the one or more embedding models 72 may be configured to process multimodal data to generate multimodal embeddings.
  • the one or more embeddings can be utilized for classification, search, and/or learning embedding space distributions.
  • the sensor processing system 60 may include one or more search engines 74 that can be utilized to perform one or more searches.
  • the one or more search engines 74 may crawl one or more databases (e.g., one or more local databases, one or more global databases, one or more private databases, one or more public databases, one or more specialized databases, and/or one or more general databases) to determine one or more search results.
  • the one or more search engines 74 may perform feature matching, text based search, embedding based search (e.g., k-nearest neighbor search), metadata based search, multimodal search, web resource search, image search, text search, and/or application search.
  • the sensor processing system 60 may include one or more multimodal processing blocks 76, which can be utilized to aid in the processing of multimodal data.
  • the one or more multimodal processing blocks 76 may include generating a multimodal query and/or a multimodal embedding to be processed by one or more machine-learned models and/or one or more search engines 74.
  • the output(s) of the sensor processing system 60 can then be processed with an output determination system 80 to determine one or more outputs to provide to a user.
  • the output determination system 80 may include heuristic based determinations, machine-learned model based determinations, user selection based determinations, and/or context based determinations.
  • the output determination system 80 may determine how and/or where to provide the one or more search results in a search results interface 82.
  • the output determination system 80 may determine how and/or where to provide the one or more machine-learned model outputs in a machine-learned model output interface 84.
  • the one or more search results and/or the one or more machine-learned model outputs may be provided for display via one or more user interface elements.
  • the one or more user interface elements may be overlayed over displayed data.
  • one or more detection indicators may be overlayed over detected objects in a viewfinder.
  • the one or more user interface elements may be selectable to perform one or more additional searches and/or one or more additional machine-learned model processes.
  • the user interface elements may be provided as specialized user interface elements for specific applications and/or may be provided uniformly across different applications.
  • the one or more user interface elements can include pop-up displays, interface overlays, interface tiles and/or chips, carousel interfaces, audio feedback, animations, interactive widgets, and/or other user interface elements.
  • data associated with the output(s) of the sensor processing system 60 may be utilized to generate and/or provide an augmented-reality experience and/or a virtual-reality experience 86.
  • the one or more obtained datasets may be processed to generate one or more augmented-reality rendering assets and/or one or more virtual-reality rendering assets, which can then be utilized to provide an augmented-reality experience and/or a virtual-reality experience 86 to a user.
  • the augmented- reality experience may render information associated with an environment into the respective environment.
  • objects related to the processed dataset(s) may be rendered into the user environment and/or a virtual environment.
  • Rendering dataset generation may include training one or more neural radiance field models to learn a three- dimensional representation for one or more objects.
  • one or more action prompts 88 may be determined based on the output(s) of the sensor processing system 60. For example, a search prompt, a purchase prompt, a generate prompt, a reservation prompt, a call prompt, a redirect prompt, and/or one or more other prompts may be determined to be associated with the output(s) of the sensor processing system 60. The one or more action prompts 88 may then be provided to the user via one or more selectable user interface elements.
  • a respective action of the respective action prompt may be performed (e.g., a search may be performed, a purchase application programming interface may be utilized, and/or another application may be opened).
  • the one or more datasets and/or the output(s) of the sensor processing system 60 may be processed with one or more generative models 90 to generate a model-generated content item that can then be provided to a user.
  • the generation may be prompted based on a user selection and/or may be automatically performed (e.g., automatically performed based on one or more conditions, which may be associated with a threshold amount of search results not being identified).
  • the one or more datasets and/or the output(s) of the sensor processing system 60 may be processed with one or more generative models 90 to generate a model-generated content item that can then be provided to a user.
  • the generation may be prompted based on a user selection and/or may be automatically performed (e.g., automatically performed based on one or more conditions, which may be associated with a threshold amount of search results not being identified).
  • the one or more generative models 90 can include language models (e.g., large language models and/or vision language models), image generation models (e.g., text-to- image generation models and/or image augmentation models), audio generation models, video generation models, graph generation models, and/or other data generation models (e.g., other content generation models).
  • the one or more generative models 90 can include one or more transformer models, one or more convolutional neural networks, one or more recurrent neural networks, one or more feedforward neural networks, one or more generative adversarial networks, one or more self-attention models, one or more embedding models, one or more encoders, one or more decoders, and/or one or more other models.
  • the one or more generative models 90 can include one or more autoregressive models (e.g., a machine-learned model trained to generate predictive values based on previous behavior data) and/or one or more diffusion models (e.g., a machine- learned model trained to generate predicted data based on generating and processing distribution data associated with the input data).
  • the one or more generative models 90 can be trained to process input data and generate model-generated content items, which may include a plurality of predicted words, pixels, signals, and/or other data.
  • the model-generated content items may include novel content items that are not the same as any pre-existing work.
  • the one or more generative models 90 can leverage learned representations, sequences, and/or probability distributions to generate the content items, which may include phrases, storylines, settings, objects, characters, beats, lyrics, and/or other aspects that are not included in pre-existing content items.
  • the one or more generative models 90 may include a vision language model.
  • the vision language model can be trained, tuned, and/or configured to process image data and/or text data to generate a natural language output.
  • the vision language model may leverage a pre-trained large language model (e.g., a large autoregressive language model) with one or more encoders (e.g., one or more image encoders and/or one or more text encoders) to provide detailed natural language outputs that emulate natural language composed by a human.
  • the vision language model may be utilized for zero-shot image classification, few shot image classification, image captioning, multimodal query distillation, multimodal question and answering, and/or may be tuned and/or trained for a plurality of different tasks.
  • the vision language model can perform visual question answering, image caption generation, feature detection (e.g., content monitoring (e.g., for inappropriate content)), object detection, scene recognition, and/or other tasks.
  • the vision language model may leverage a pre-trained language model that may then be tuned for multimodality. Training and/or tuning of the vision language model can include image-text matching, masked-language modeling, multimodal fusing with cross attention, contrastive learning, prefix language model training, and/or other training techniques.
  • the vision language model may be trained to process an image to generate predicted text that is similar to ground truth text data (e.g., a ground truth caption for the image).
  • the vision language model may be trained to replace masked tokens of a natural language template with textual tokens descriptive of features depicted in an input image.
  • the training, tuning, and/or model inference may include multi-layer concatenation of visual and textual embedding features.
  • the vision language model may be trained and/or tuned via jointly learning image embedding and text embedding generation, which may include training and/or tuning a system to map embeddings to a joint feature embedding space that maps text features and image features into a shared embedding space.
  • the joint training may include image-text pair parallel embedding and/or may include triplet training.
  • the images may be utilized and/or processed as prefixes to the language model.
  • the one or more generative models 90 may be stored on-device and/or may be stored on a server computing system.
  • the one or more generative models 90 can perform on-device processing to determine suggested searches, suggested actions, and/or suggested prompts.
  • the one or more generative models 90 may include one or more compact vision language models that may include less parameters than a vision language model stored and operated by the server computing system.
  • the compact vision language model may be trained via distillation training.
  • the visional language model may process the display data to generate suggestions.
  • the display data can include a single image descriptive of a screenshot and/or may include image data, metadata, and/or other data descriptive of a period of time preceding the current displayed content (e.g., the applications, images, videos, messages, and/or other content viewed within the past 30 seconds).
  • the user computing device may generate and store a rolling buffer window (e.g., 30 seconds) of data descriptive of content displayed during the buffer. Once the time has elapsed, the data may be deleted.
  • the rolling buffer window data may be utilized to determine a context, which can be leveraged for query, content, action, and/or prompt suggestion.
  • the generative models 90 can include machine-learned sequence processing models.
  • An example system can pass inputs to sequence processing models.
  • Sequence processing models can include one or more machine-learned components.
  • Sequence processing models can process the data from inputs to obtain an input sequence.
  • Input sequence can include one or more input elements obtained from inputs.
  • the sequence processing model can process the input sequence using prediction layers to generate an output sequence.
  • the output sequence can include one or more output elements generated based on input sequence.
  • the system can generate outputs based on output sequence.
  • Sequence processing models can include one or multiple machine-learned model components configured to ingest, generate, or otherwise reason over sequences of information.
  • some example sequence processing models in the text domain are referred to as “Large Language Models,” or LLMs. See, e.g., PaLM 2 Technical Report, Google, https://ai.google/static/documents/palm2techreport.pdf (n.d.).
  • sequence processing models can operate in other domains, such as image domains, see, e.g., Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, arXiv:2010.11929v2 (Jun.3, 2021), audio domains, see, e.g., Agostinelli et al., MusicLM: Generating Music From Text, arXiv:2301.11325v1 (Jan.26, 2023), biochemical domains, see, e.g., Jumper et al., Highly accurate protein structure prediction with AlphaFold, 596 Nature 583 (Aug.26, 2021), by way of example.
  • image domains see, e.g., Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, arXiv:2010.11929v2 (Jun.3, 2021), audio domains, see, e.g., Agost
  • Sequence processing models can process one or multiple types of data simultaneously. Sequence processing models can include relatively large models (e.g., more parameters, computationally expensive, etc.), relatively small models (e.g., fewer parameters, computationally lightweight, etc.), or both. [0211] In general, sequence processing models can obtain an input sequence using data from inputs. For instance, input sequence can include a representation of data from inputs 2 in a format understood by sequence processing models. One or more machine-learned components of sequence processing models can ingest the data from inputs, parse the data into pieces compatible with the processing architectures of sequence processing models (e.g., via “tokenization”), and project the pieces into an input space associated with prediction layers (e.g., via “embedding”).
  • sequence processing models can include relatively large models (e.g., more parameters, computationally expensive, etc.), relatively small models (e.g., fewer parameters, computationally lightweight, etc.), or both.
  • sequence processing models can obtain an input sequence using data from inputs.
  • Sequence processing models can ingest the data from inputs and parse the data into a sequence of elements to obtain input sequence. For example, a portion of input data from inputs can be broken down into pieces that collectively represent the content of the portion of the input data. The pieces can provide the elements of the sequence.
  • processing the input data can include tokenization.
  • a tokenizer may process a given portion of an input source and output a series of tokens (e.g., corresponding to input elements) that represent the portion of the input source.
  • Various approaches to tokenization can be used. For instance, textual input sources can be tokenized using a byte-pair encoding (BPE) technique.
  • BPE byte-pair encoding
  • Image-based input sources can be tokenized by extracting and serializing patches from an image.
  • arbitrary data types can be serialized and processed into an input sequence.
  • Prediction layers can predict one or more output elements based on the input elements.
  • Prediction layers can include one or more machine-learned model architectures, such as one or more layers of learned parameters that manipulate and transform the inputs to extract higher-order meaning from, and relationships between, input elements. In this manner, for instance, example prediction layers can predict new output elements in view of the context provided by input sequence. [0216] Prediction layers can evaluate associations between portions of input sequence and a particular output element. These associations can inform a prediction of the likelihood that a particular output follows the input context. For example, consider the textual snippet, “The carpenter’s toolbox was small and heavy. It was full of ___.” Example prediction layers can identify that “It” refers back to “toolbox” by determining a relationship between the respective embeddings.
  • Example prediction layers can also link “It” to the attributes of the toolbox, such as “small” and “heavy.” Based on these associations, prediction layers can, for instance, assign a higher probability to the word “nails” than to the word “sawdust.”
  • a transformer is an example architecture that can be used in prediction layers. See, e.g., Vaswani et al., Attention Is All You Need, arXiv:1706.03762v7 (Aug.2, 2023).
  • a transformer is an example of a machine-learned model architecture that uses an attention mechanism to compute associations between items within a context window.
  • the context window can include a sequence that contains input sequence and potentially one or more output elements.
  • a transformer block can include one or more attention layers and one or more post-attention layers (e.g., feedforward layers, such as a multi-layer perceptron).
  • Prediction layers can include other machine-learned model architectures in addition to or in lieu of transformer-based architectures. For example, recurrent neural networks (RNNs) and long short-term memory (LSTM) models can also be used, as well as convolutional neural networks (CNNs).
  • RNNs recurrent neural networks
  • LSTM long short-term memory
  • CNNs convolutional neural networks
  • prediction layers can leverage various kinds of artificial neural networks that can understand or generate sequences of information.
  • Output sequence can include or otherwise represent the same or different data types as input sequence. For instance, input sequence can represent textual data, and output sequence can represent textual data.
  • the input sequence can represent image, audio, or audiovisual data
  • output sequence can represent textual data (e.g., describing the image, audio, or audiovisual data). It is to be understood that prediction layers, and any other interstitial model components of sequence processing models, can be configured to receive a variety of data types in input sequences and output a variety of data types in output sequences.
  • the output sequence can have various relationships to an input sequence. Output sequence can be a continuation of input sequence. The output sequence can be complementary to the input sequence. The output sequence can translate, transform, augment, or otherwise modify input sequence. The output sequence can answer, evaluate, confirm, or otherwise respond to input sequence.
  • the output sequence can implement (or describe instructions for implementing) an instruction provided via an input sequence.
  • the output sequence can be generated autoregressively. For instance, for some applications, an output of one or more prediction layers can be passed through one or more output layers (e.g., softmax layer) to obtain a probability distribution over an output vocabulary (e.g., a textual or symbolic vocabulary) conditioned on a set of input elements in a context window. In this manner, for instance, the output sequence can be autoregressively generated by sampling a likely next output element, adding that element to the context window, and re-generating the probability distribution based on the updated context window, and sampling a likely next output element, and so forth.
  • the output sequence can also be generated non-autoregressively. For instance, multiple output elements of the output sequence can be predicted together without explicit sequential conditioning on each other.
  • the output sequence can include one or multiple portions or elements.
  • the output sequence can include multiple elements corresponding to multiple portions of a generated output sequence (e.g., a textual sentence, values of a discretized waveform, computer code, etc.).
  • the output sequence can include a single element associated with a classification output. For instance, an output “vocabulary” can include a set of classes into which an input sequence is to be classified.
  • a vision transformer block can pass latent state information to a multilayer perceptron that outputs a likely class value associated with an input image.
  • the output determination system 80 may process the one or more datasets and/or the output(s) of the sensor processing system 60 with a data augmentation block 92 to generate augmented data.
  • one or more images can be processed with the data augmentation block 92 to generate one or more augmented images.
  • the data augmentation can include data correction, data cropping, the removal of one or more features, the addition of one or more features, a resolution adjustment, a lighting adjustment, a saturation adjustment, and/or other augmentation.
  • the one or more datasets and/or the output(s) of the sensor processing system 60 may be stored based on a data storage block 94 determination.
  • the output(s) of the output determination system 80 can then be provided to a user via one or more output components of the user computing device 52.
  • one or more user interface elements associated with the one or more outputs can be provided for display via a visual display of the user computing device 52.
  • the processes may be performed iteratively and/or continuously.
  • One or more user inputs to the provided user interface elements may condition and/or affect successive processing loops.
  • the systems and methods disclosed herein can include an autonomous information seeking visual question answering framework.
  • the systems and methods can leverage a large language model (LLM) to dynamically strategize the utilization of external tools and to investigate their outputs, thereby acquiring knowledge that can be utilized to provide answers to the posed questions.
  • LLM large language model
  • the task can present a combinatorial search space that demands a sequence of actions, including invoking application programming interfaces (APIs), analyzing the API tool call responses, and making informed decisions.
  • APIs application programming interfaces
  • the systems and methods can include conducting a user study to collect a variety of instances of human decision-making when faced with one or more particular tasks.
  • the user behavior data (e.g., the collected user decision-making instances) can then be used to design a system that includes three components: a language model planner block (e.g., an LLM-powered planner) that dynamically determines which tool to use next, a language model reasoner block (e.g., an LLM-powered reasoner) that analyzes and extracts key information from the tool outputs, and a working memory component that retains the acquired information throughout the process.
  • the collected user behavior data can serve as a guide for the system in two key ways.
  • the systems and methods may generate a transition graph by analyzing the sequence of decisions made by users. The graph can delineate distinct states and can confine the set of actions available at each state.
  • LLMs Large language models
  • GPT3 Brown et al., “Language models are few-shot learners,” ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, 33:1877– 1901, 2020.
  • LaMDA Kulshreshtha et al., “Towards a human-like open-domain chatbot,” ARXIV PREPRINT, 2020.
  • PALM Crodhery et al., “Palm: Scaling language modeling with pathways,” ARXIV PREPRINT, 2022.
  • BLOOM Scao et al., “BLOOM: A 176b-parameter open-access multilingual language model,” CORR, abs/2211.05100, 2022.
  • LLaMA Touvron et al.
  • the language models can demonstrate emerging abilities like in-context learning, code generation, and common sense reasoning. Additionally and/or alternatively, language models can be adapted to handle multi-modal inputs and outputs involving both vision and language.
  • visual language models VLMs
  • VLMs can be utilized for image captioning, visual question answering, and open vocabulary recognition.
  • LLMs excel beyond human capabilities in tasks involving textual information retrieval
  • the existing VLMs can perform inadequately on datasets designed for visual information seeking. Many of the visual questions in the datasets can be designed in such a way that they pose a challenge even for humans, often requiring the assistance of various APIs and web search to obtain the answer.
  • VLMs vision-language models
  • LLMs state-of-the-art Large Language Models
  • the systems and methods disclosed herein may obtain state-of-the-art results on visual information seeking tasks by integrating LLMs with three types of tools: (i) computer vision tools such as object detection, OCR, image captioning models, and VQA models, which aid in extracting visual information from the image, (ii) a web search tool that assists in retrieving open world knowledge and facts, and (iii) an image search tool that enables us to glean relevant information from metadata associated with visually similar images.
  • the systems and methods may include an LLM agent that uses the tools via tree-search decision making.
  • the systems and methods may utilize an LLM-powered planner to dynamically determine which tool to use at each step and what query to send to it.
  • the systems and methods may employ an LLM-powered reasoner that scrutinizes the output returned by the tools and extracts the crucial information from them.
  • the systems and methods can use a working memory component.
  • Figure 4 can convey an example information seeking process performed by an example implementation.
  • the systems and methods can employ a two-stage strategy, namely plan and execute. Initially, the LLM can break down a question into a plan, typically represented as a structured program or a sequence of instructions. Following the breakdown, the necessary APIs can be activated to collect the required information. However, the systems and methods may struggle in more complex real-world situations. In such cases, a comprehensive plan may not be inferred merely from the initial question.
  • the systems and methods disclosed herein can include a dynamic decision-making capability. Answering visual information seeking questions can be a highly complex task, requiring the planner to take multiple steps. At each of the steps, the planner may determine which API to call and what query to send. The systems and methods disclosed herein can opt for a dynamic approach as the API call outputs may be unpredictable. In particular, the systems and methods may determine decisions at each step based on the information acquired from previous API calls, enhancing the adaptability and effectiveness of the method. [0235] In some implementations, a user study can be performed to gather a wide range of instances of human decision-making when using APIs to answer questions related to visual information seeking.
  • a structured framework can be utilized to direct the Large Language Model (LLM) to use the examples for making informed decisions regarding API selection and query formulation.
  • LLM Large Language Model
  • the collected user behavior can inform the system in two ways. First, by analyzing the sequence of user decisions, the systems and methods can construct a transition graph. The graph may delineate distinct states and may constrain the set of actions available at each state. Second, the systems and methods can use the examples of user decision-making to guide the planner and reasoner with pertinent contextual instances. The contextual examples can contribute to improving the performance and effectiveness of the system.
  • the systems and methods can include a visual question answering framework that leverages a large language model (LLM) to dynamically strategize the utilization of external tools and to investigate their outputs, thereby acquiring the necessary knowledge needed to provide answers to the posed questions. Additionally and/or alternatively, the systems and methods can leverage the human decision-making data collected from a user study to develop a structured framework. The framework can guide the Large Language Model (LLM) to utilize examples of human decision-making in making informed choices concerning API selection and query construction. [0237] In some implementations, the systems and methods can employ a dynamic decision-making strategy designed to respond to visual information seeking queries. For example, the systems and methods can include three primary components.
  • the system can include a planner ⁇ , whose responsibility is to determine the subsequent action, including the appropriate API call and the query to be processed.
  • the system can have a working memory M that retains information about the results obtained from API executions.
  • the system can have a reasoner R, whose role is to process the outputs from the API calls. The reasoner can determine whether the obtained information is sufficient to produce the final response, or if additional data retrieval is determined to be needed.
  • the systems and methods can utilize the data collected from this study to construct a transition graph ⁇ shown in Figure 5, which outlines all the possible actions at each given state. Additionally and/or alternatively, the systems and methods can employ real-life decision-making examples E (i.e., users choose which tool at different states) to guide the planner in choosing the appropriate action at each stage of the process.
  • E real-life decision-making examples
  • the Algorithm 1 below can present the operations of the planner ⁇ .
  • the planner can undertake a series of steps each time a decision is required regarding which tool to employ and what query to send to the tool. Firstly, based on the present state, the planner can provide a range of potential subsequent actions ⁇ ⁇ .
  • the potential action space ⁇ ⁇ may be large, making the search space in- tractable.
  • the planner can refer to the human decisions from the transition graph ⁇ to eliminate irrelevant actions.
  • the planner may exclude the actions that have already been taken before and are stored in the working memory M.
  • the procedure can include ⁇ ⁇ ⁇ ⁇ ( ⁇ ⁇ ⁇ ⁇ ⁇ , ⁇ ,M).
  • the systems and methods can collect a set of relevant in-context examples E ⁇ that are assembled from the decisions previously made by humans during the user study relevant to actions ⁇ ⁇ , that is E ⁇ ⁇ ⁇ (E , ⁇ ⁇ ).
  • the planner can formulate a prompt, denoted by ⁇ ⁇ ⁇ .
  • the prompt ⁇ ⁇ can then be sent to the LLM which returns a structured answer, determining the next tool ⁇ ⁇ to be activated and the query ⁇ ⁇ to be dispatched to the data processing tool.
  • the action can be denoted by ⁇ ⁇ , ⁇ ⁇ ⁇ ⁇ ⁇ ( ⁇ ⁇ ).
  • the design can allow the planner to be invoked multiple times throughout the process, thereby facilitating dynamic decision-making that gradually leads to answering the input query.
  • the Algorithm 2 can be descriptive of an overall decision-making workflow of automatic visual information seeking. The entire process can repeat until a satisfactory answer is produced. Initially, the working memory may be populated with the input visual question ⁇ , and the initial state can be set to START. At each iteration, the system can first invoke the planner ⁇ to determine the next tool and the query to employ, as outlined in Algorithm 1. Subsequently, the selected external tool can execute and can deliver the output ⁇ ⁇ .
  • the system can employ a reasoner R to analyze the output ⁇ ⁇ , can extract the useful information, and can decide into which category the tool output falls: informative, uninformative, or final answer.
  • the method can utilize the LLM with appropriate prompting and in-context examples to perform the reasoning.
  • the model can generate and output the final response, thus concluding the task. If the model determines that the tool output is uninformative, the machine-learned model may revert back to the planner to select another action based on the current state. If the language model finds the tool output to be useful, the machine-learned model may modify the state and transfer control back to the planner to make a new decision at the new state. [0245] To illustrate with a tangible example, the systems and methods can refer to the output that the model would receive as depicted in Figure 6(c). There can be several entities within the answer.
  • the role of the reasoner may be twofold: to determine which entity is pertinent for responding to the question and to assess whether the model has obtained the necessary information to transition to the next state.
  • the approach which employs dynamic decision- making coupled with backtracking, can differ from previous methods that follow a plan-then- execute paradigm.
  • the system can be structured to make decisions grounded to the results of current executions and to conduct iterative searches for tool combinations. The process may eventually yield an effective strategy to accomplish the task.
  • the systems and methods can equip a system with a comprehensive suite of tools.
  • the suite of tools may include an image captioning model, a visual question answering model, an object detection model, an image search engine, a web search engine, an optical character recognition model, a language model tuned (and/or conditioned) for short question-and-answer tasks, and/or other data processing tools.
  • the image captioning model may employ a captioning model (e.g., the PALI 17B (Chern et al., “TPU-KNN: K nearest neighbor search at peak flop/s,” CORR, abs/2206.14286, 2022.)), which obtains state-of- the-art results for image captioning.
  • a captioning model e.g., the PALI 17B (Chern et al., “TPU-KNN: K nearest neighbor search at peak flop/s,” CORR, abs/2206.14286, 2022.)
  • the tool may have the capability to generate captions for either the entire image or for a cropped image corresponding to the bounding box of a detected object.
  • the visual question answering model may utilize a VQA model (e.g., the PALI 17B), which may have been fine-tuned on a visual question-and-answer specific dataset (e.g., the VQA-v2 dataset (Goyal et al., “Making the V in VQA matter: Elevating the role of image understanding in visual question answering,” IN 2017 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 6325–6334, IEEE COMPUTER SOCIETY, 2017.)).
  • VQA model e.g., the PALI 17B
  • a visual question-and-answer specific dataset e.g., the VQA-v2 dataset (Goyal et al., “Making the V in VQA matter
  • the tool can intake an image and a question as inputs and can provide a text-based answer as the output.
  • the object detection model may use an object detector trained on a super-set of Open Images dataset (Kuznetsova et al., “The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,” IJCV, 2020.) categories that may be provided by a visual search application programming interface (e.g., the Google Lens API (“Google lens,” Web interface available at https://images.google.com.)).
  • Google Lens API Google lens
  • the image search engine may utilize a reverse image search system (e.g., Google Image Search) to obtain a broad range of information related to the image crop of a detected box (e.g., as provided in Google Lens API).
  • the information may encompass various details, such as knowledge graph entities, titles of associated products, and captions of analogous or identical images. The availability of these details can vary based on the image crop input provided to the reverse image search system (e.g., Google Image Search).
  • the planner may consider the utilization of each piece of information as a separate action. The consideration may be due to the fact that each information set may contain hundreds of tokens that necessitate complex processing and reasoning.
  • the optical character recognition model may process images with text to determine the depicted text.
  • images may include textual content such as street names or logos.
  • OCR Optical Character Recognition
  • the systems and methods may leverage the Optical Character Recognition (OCR) feature available in an image processing application (e.g., the Google Lens API).
  • OCR Optical Character Recognition
  • the web search engine may enable the system to acquire up-to-date world knowledge and retrieve relevant documents on any topic of interest.
  • the systems and methods may employ a web search platform (e.g., the Google Web Search API).
  • the web search engine may process a text-based query as input and may produce the following outputs: (i) related document links and snippets, (ii) in certain instances, a knowledge panel providing a direct answer to the query, and (iii) a plurality of questions (e.g., five questions) that are related to the input query. If a knowledge panel is available, the system may parse the knowledge panel into a sentence or a few sentences that summarize the knowledge panel information. [0254] In some implementations, the systems and methods may incorporate a Language Model (LLM) powered question-answering component as another tool. The tool may process a query in text form and may generate an answer also in text form.
  • LLM Language Model
  • LLM language model
  • Many of the visual questions in existing datasets may ask for fine-grained answers, which poses a challenge even for humans, often requiring the assistance of various APIs and web searches for answers.
  • systems and methods may include performing a user study. For example, a user study may be performed to understand how humans utilize external tools to answer visual queries that involve seeking information.
  • the user may be provided with an identical set of tools as the tools accessible to the machine-learned model via an API. The users may be presented with the input image and question, along with image crops for each detected object.
  • tools such as an image captioning model, an object detection model, a visual question answering model, a web search engine, and/or an image search engine may be made available to the user.
  • the user may be offered one or multiple buttons associated with each box.
  • the user interface elements e.g., the buttons
  • the diverse information may include details such as corresponding knowledge graph entities, captions of similar images, titles of associated related products, and captions of identical images.
  • An example set of tools and APIs are depicted in Figure 6(b).
  • the image captioning model, the object detection model, the image search engine, and/or the visual question answering model may be invoked, and the resulting output may be displayed to the user.
  • the system may record the sequence of actions taken by the user and the outputs that they receive at each step.
  • FIG. 6 an example of how a user performs four actions to answer the question is displayed: i) display entities in box 2, ii) show the caption of similar images to box 2, iii) conduct a search for “In what year was Harley-Davidson XA built?”, and iv) utilize PALM using the combination of the search output and the question “In what year was Harley-Davidson XA built?”.
  • the user may click on either of the two buttons: “Success! Found the Answer!” or “Couldn’t Find the Answer.” Subsequently, a new visual question may be presented to them.
  • the collected user behavior may serve as a guide for the system in two key ways.
  • the system may construct a transition graph by analyzing the sequence of decisions made by users.
  • the graph can define distinct states and may restrict the available set of actions at each state. For example, at the START state, the system may take one of these three actions: image captioning, visual question answering, or object detection.
  • Figure 5 can illustrate the transition graph that has been constructed based on the decision-making process of the users.
  • the systems and methods may utilize the examples of user decision- making to guide the planner and reasoner with relevant contextual instances.
  • the in-context examples can aid in enhancing the performance and effectiveness of the system.
  • the user study may involve ten or more participants who collectively answered a total of 644 visual questions. During the study, the system may present users with visual questions that were randomly selected from one or more datasets.
  • the approach can allow the system to provide the participants with a varied and diverse set of visual questions to assess and respond to.
  • the systems and methods disclosed herein can include an approach that equips the Large Language Models (LLM) with the ability to use a variety of tools for answering knowledge-intensive visual questions.
  • LLM Large Language Models
  • the methodology anchored in human decision- making data collected from a user study, can employ a structured framework that uses an LLM-powered planner to dynamically decide on tool selection and query formation.
  • An LLM-powered reasoner can be tasked with processing and extracting key information from the output of the selected tool.
  • the systems and methods may iteratively employ the planner and reasoner to leverage different tools until all necessary information required to answer the visual question is amassed.
  • the experiments can follow the decision-making workflow in Alg.2 to implement AVIS to solve visual questions.
  • the system may write the basic instructions for describing each tool, and keep a pool of real user behavior when they select each tool, which may be collected in the user study.
  • the system may prepare the prompt based on the feasible action lists ⁇ ⁇ .
  • the system may write the prompt for all APIs that return a long list of results, including Object Detection, Product Detection, Web Image Search and Web Text Search, that guides reasoner to extract the relevant information.
  • the system may design the reasoner in a way such that the “uninformative” answers can be detected.
  • the system may manually prepare several bad examples that do not provide any useful information, pass it to the reasoner as a part of the prompt.
  • the system may use the frozen PALM 540B language model (Chowdhery et al. “PaLM: Scaling Language Modeling with Pathways,” arXiv (Apr.5, 2022), https://arxiv.org/abs/2204.02311.) for both the planner and the reasoner, with deterministic generation ensured by setting the temperature parameter to zero.
  • the experiment may use 10 examples as in-context prompts for each dataset, and report the VQA accuracy (Goyal et al., “Making the V in VQA matter: Elevating the role of image understanding in visual question answering,” In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, pages 6325–6334 (July 2017).) as the evaluation metric.
  • the experiments can utilize one or more baselines. AVIS can have the ability to dynamically determine the relevant tools according to different states. To show that the design choice is useful, the experiments may add a number of baselines that do not include a LLM-planner for dynamic decision making. Instead, they may follow a pre-determined sequence to call a list of tools.
  • the experiments may propose the following baselines: baseline-PALM w/ PALI ⁇ (which integrates the captions generated by PALI and the visual answers from PALI VQA, PALI ⁇ denotes the combination of both VQA and captioning tool), baseline-PALM w/ (PALI ⁇ + Object) (which in addition calls the object detection tool, and then integrates all object data, including products and text detected by OCR), baseline- PALM w/ (PALI ⁇ + Object + Search) (a model which first selects a relevant object with the help of PALM, then sequentially executes the image search and Google search with the object name). The experiments may then call PALM again to answer the question.
  • the system may prepare a few-shot Chain-Of-Thought (COT) prompting (Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” arXiv (Jan.28, 2022), https://arxiv.org/abs/2201.11903.), in which the COT prompt guides the model to explain why predictions are made based on the provided information.
  • COT Chain-Of-Thought
  • the baselines can utilize a set of tools in a fixed order, without the capacity for dynamic decision making.
  • the experiments may evaluate the usefulness of each tool group (i.e., PALI ⁇ , Object, and Search) through an ablation study.
  • the first four rows can be results from their paper that do not use external knowledge, and the next two can be from their paper that use CLIP as knowledge source.
  • the tool PALI ⁇ denotes the frozen multi-task PALI-17B model for both visual question answering and image captioning.
  • Object means object detection
  • search means image and text search.
  • Table 1 can present the results of AVIS and other baselines on the Infoseek wikidata dataset. Infoseek wikidata can be a challenging dataset that requires identifying highly specific entities.
  • PALI A Jointly-Scaled Multilingual Language-Image Model
  • arXiv Sep.25, 2022
  • AVIS without fine-tuning and by leveraging a complete set of tools guided by 10 in-context examples, can achieve the accuracy of 50.7 and 56.4 on the unseen entity and question splits, respectively.
  • the AVIS system can significantly outperform the fine-tuned results of PALI-17B, which are 16.0 and 20.7, as well as the PALM model augmented with CLIP knowledge, which are 21.9 and 18.6, respectively.
  • Table 1 can illustrate that improvements may not be solely due to the additional information provided by the external tools, but due to the dynamic decision-making pipeline.
  • the experiments can compare the results of AVIS with the three baselines that conduct sequential execution. While these baselines do improve the performance, our AVIS framework outperforms the best baseline model by up to 17.3 accuracy. Note that AVIS and the baselines can use exactly the same set of tools. The considerable performance gap can convey the clear advantage of dynamic decision-making design. Furthermore, the system may show the importance of each tool in the last block of Table 1. Removal of any of the tools can degrade the overall accuracy. Among the three tool groups, Object and Search can be more important than PALI, as they provide more fine-grained information crucial for the Infoseek dataset.
  • Table 2 can include visual question answering results (accuracy) on OK-VQA.
  • the tool PALI ⁇ denotes the frozen multi-task PALI-17B model for both visual question answering and image captioning.
  • Object can denote object detection, and search can denote image and text search.
  • the OK-VQA experiments are depicted in Table 2.
  • AVIS with few-shot in- context examples can achieve an accuracy of 60.2, higher than most of the existing methods tailored for the dataset, including KAT (Gui et al., “KAT: A Knowledge Augmented Transformer for Vision-and-Language,” arXiv (Dec.16, 2021), https://arxiv.org/abs/2112.08614.), ReVIVE (Lin et al., “REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering,” arXiv (Jun.2, 2022), https://arxiv.org/abs/2206.01201.), and REVEAL (Hu et al., “REVEAL: Retrieval- Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory,” arXiv (Dec.10, 2022), https://arxiv.org/abs/2212.05221.).
  • AVIS can achieve lower but comparable performance compared to PALI model fine-tuned on OK-VQA. This difference, compared to Infoseek, may be attributed to the fact that most QA examples in OK-VQA rely more on commonsense knowledge than on fine-grained knowledge. Therefore, it may be feasible to encode such generic knowledge in the model parameters and requires less external knowledge.
  • PALI zero-shot VQA model itself can achieve 41.6 accuracy, which can be significantly higher than in Infoseek, which supports this hypothesis. Table 2 can show that the object detection is less crucial as a tool on this data set, compared to PALI captioning and VQA.
  • One of the features of AVIS can be the ability to dynamically make decisions instead of executing a fixed sequence.
  • Figure 5 can present three examples of AVIS ’s dynamic planning and reasoning process. They can demonstrate the flexibility of AVIS to use different tools at various stages.
  • the reasoner design can enable AVIS to identify irrelevant information, backtrack to a previous state, and repeat the search. For instance, in the second example concerning the taxonomy of fungi, AVIS may make an incorrect decision by selecting a leaf object. However, the reasoner can identify that this is not relevant to the question, prompting AVIS to plan again. This time, the system may successfully select the object related to false turkey-tail fungi, leading to the correct answer, Stereum.
  • the training datasets can include visual question and answer datasets.
  • the training examples can include an image, a question about the question, and an answer to the question.
  • the systems and methods can decompose questions into a visual sub-question and a knowledge sub-question.
  • the systems and methods may be designed and/or utilized for visual question answering and/or other reasoning tasks.
  • the technology discussed herein makes reference to servers, databases, software applications, and other computer-based systems, as well as actions taken and information sent to and from such systems.
  • the inherent flexibility of computer-based systems allows for a great variety of possible configurations, combinations, and divisions of tasks and functionality between and among components. For instance, processes discussed herein can be implemented using a single device or component or multiple devices or components working in combination. Databases and applications can be implemented on a single system or distributed across multiple systems. Distributed components can operate sequentially or in parallel.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

Systems and methods for generating a response to a visual information query can include obtaining input data, processing the input data with one or more machine-learned models, transmitting data to one or more data processing tools, and generating a response based on the input data and the one or more outputs of the one or more data processing tools. The one or more machine-learned models may iteratively process data to generate planning data that may include application programming interface calls to the one or more data processing tools. The one or more machine-learned models may be utilized for planning, reasoning, and response generation when obtaining and generating a response to a visual information query.

Description

AUTONOMOUS VISUAL INFORMATION SEEKING WITH MACHINE-LEARNED LANGUAGE MODELS RELATED APPLICATIONS [0001] The present application is based on and claims priority to United States Provisional Application Number 63/506,924 having a filing date of June 8, 2023. Applicant claims priority to and the benefit of each of such application and incorporate all such application herein by reference in its entirety. FIELD [0002] The present disclosure relates generally to autonomous visual information seeking with machine-learned language models. More particularly, the present disclosure relates to leveraging one or more machine-learned language models for planning data processing calls and information reasoning to generate a response to a visual information query. BACKGROUND [0003] Large language models can display impressive language understanding and predictive capabilities. However, large language models may struggle with fact specific queries. Additionally, language models may not be trained for non-text based and/or non- embedding based data. For example, language models may be unable to properly process visual information queries associated with one or more images. [0004] Additionally, understanding the world at large can be difficult. Whether an individual is trying to understand what the object in front of them is, trying to determine where else the object can be found, and/or trying to determine where an image on the internet was captured from, text searching alone can be difficult. In particular, users may struggle to determine which words to use. Additionally, the words may not be descriptive enough and/or abundant enough to generate desired results. SUMMARY [0005] Aspects and advantages of embodiments of the present disclosure will be set forth in part in the following description, or can be learned from the description, or can be learned through practice of the embodiments. [0006] One example aspect of the present disclosure is directed to a computing system for visual information seeking. The system can include one or more processors and one or more non-transitory computer-readable media that collectively store instructions that, when executed by the one or more processors, cause the computing system to perform operations. The operations can include obtaining input data. The input data can include image data and text data. The text data can include a query associated with the image data. The operations can include processing the input data with a machine-learned model to generate first planning data. The first planning data can be descriptive of instructions to provide the input data to a first data processing tool. The operations can include transmitting, based on the first planning data, the input data to the first data processing tool to retrieve first output data. The operations can include processing the input data and the first output data with the machine-learned model to generate second planning data. The second planning data can be descriptive of instructions to transmit data to a second data processing tool. The operations can include transmitting, based on the second planning data, data to the second data processing tool to retrieve second output data and processing the input data and the second output data with the machine-learned model to generate response data. The response data can be descriptive of a response to the query. [0007] In some implementations, the first data processing tool can include an object detection model. The first output data can include one or more bounding boxes associated with one or more objects in the image data. The first output data can include one or more segmented portions of one or more images of the image data and caption data associated with the one or more segmented portions. The caption data can be descriptive of an object classification associated with one or more objects detected in the one or more segmented portions of one or more images. [0008] In some implementations, the second data processing tool can include a search engine. The second planning data can include a model-generated query. The model-generated query can be transmitted to the second data processing tool to retrieve the second output data. The model-generated query can be generated based on the input data and the first output data. The model-generated query can be descriptive of the query of the input data modified based on the first output data. In some implementations, the response data can include a natural language text string that is responsive to the query of the input data. [0009] In some implementations, the operations can include processing the input data and the second output data with the machine-learned model to generate third planning data. The third planning data can be descriptive of instructions to transmit data to a third data processing tool. The operations can include transmitting, based on the third planning data, data to the third data processing tool to retrieve third output data and processing the input data and the third output data with the machine-learned model to generate the response data. [0010] Another example aspect of the present disclosure is directed to a computer- implemented method for responding to visual prompts. The method can include obtaining, by a computing system including one or more processors, input data. The input data can include image data and text data. In some implementations, the text data can include a query associated with the image data. The method can include processing, by the computing system, the input data with a machine-learned model to generate first planning data. The first planning data can be descriptive of instructions to provide the image data to a first data processing tool. The method can include providing, by the computing system, the image data to the first data processing tool to receive first output data. The first output data can be generated with the first data processing tool based on the image data. The method can include processing, by the computing system, the first output data with the machine-learned model to generate second planning data. In some implementations, the second planning data can be descriptive of instructions to provide a particular portion of the image data to a second data processing tool. The method can include providing, by the computing system, the particular portion of the image data to the second data processing tool to receive second output data. The second output data can be generated with the second data processing tool based on the particular portion of the image data. The method can include processing, by the computing system, the text data and the second output data with the machine-learned model to generate response data. The response data can be descriptive of a response to the query. [0011] In some implementations, the machine-learned model may have been conditioned on a training dataset including a plurality of training examples. Each training example of the plurality of training examples can include a training input, a training output, and a training rationale. The training rationale can be descriptive of a sequence of processing instances and tool calls for determining the output data. The machine-learned model may have been conditioned on a training dataset including sequence data. In some implementations, sequence data can be descriptive of a sequence of actions for generating a training response in response to obtaining a particular type of input. The machine-learned model may have been conditioned on human input data descriptive of actions a human selected as being particular actions to perform when a particular input type is received. The human input data may have been obtained via a user interface that provides a plurality of selectable options for a user. In some implementations, the plurality of selectable options can include a plurality of external tools to call and a final output option. [0012] Another example aspect of the present disclosure is directed to one or more non- transitory computer-readable media that collectively store instructions that, when executed by one or more computing devices, cause the one or more computing devices to perform operations. The operations can include obtaining input data. The input data can include image data and text data. In some implementations, the text data can include a query associated with the image data. The operations can include processing the input data with a machine-learned model to generate first planning data. The first planning data can be descriptive of instructions to provide the image data to a first data processing tool. The operations can include providing the image data to the first data processing tool and receiving first output data from the first data processing tool based on the image data. The operations can include processing the first output data with the machine-learned model to generate second planning data. The second planning data can be descriptive of instructions to provide a particular portion of the image data to a second data processing tool. The operations can include providing the particular portion of the image data to the second data processing tool and receiving second output data from the second data processing tool based on the particular portion of the image data. The operations can include processing the text data and the second output data with the machine-learned model to generate response data. The response data can be descriptive of a response to the query. [0013] In some implementations, the query can be descriptive of a question associated with a particular object depicted in the image data. The first data processing tool can detect a plurality of objects depicted in the image data, generate a plurality of bounding boxes associated with the plurality of objects, generate a plurality of image segments based on the plurality of bounding boxes, classify each of the plurality of objects in the plurality of image segments to generate a plurality of object classifications, and generate the first output data. The first output data can include the plurality of image segments and the plurality of object classifications. In some implementations, the second data processing tool can include an image search engine. The second data processing tool can process one or more of the plurality of image segments with the image search engine to determine one or more web resources associated with the one or more of the plurality of image segments. The one or more of the plurality of image segments can be selected by the machine-learned model based on the input data and the plurality of object classifications. The second data processing tool can generate the second output data based on the one or more web resources. [0014] Another example aspect of the present disclosure is directed to a computing system. The system can include one or more processors and one or more non-transitory computer-readable media that collectively store instructions that, when executed by the one or more processors, cause the computing system to perform operations. The operations can include obtaining input data. The input data can include image data and text data. The text data can include a query associated with the image data. The operations can include processing the input data with an artificial intelligence system to generate response data. The response data can be descriptive of a response to the query. The artificial intelligence system can include a language model planner block, a language model reasoner block, and a working memory. In some implementations, the language model planner block may have been conditioned to determine one or more data processing tools to utilize for processing data associated with responding to the query. The language model reasoner block may have been conditioned to process outputs of the one or more data processing tools to determine information associated with responding to the query. The working memory can store acquired information obtained and generated with the artificial intelligence system. The operations can include providing the response data as an output. [0015] In some implementations, the artificial intelligence system can include one or more machine-learned models conditioned on user behavior data. The user behavior data can be processed to generate a transition graph that is descriptive of a determined sequence of decisions made by users when performing a particular information seeking task. The transition graph can be descriptive of a plurality of distinct states and can indicate a particular set of actions available at each state of the plurality of distinct states. The user behavior data may have been utilized to condition the language model planner block and the language model reasoner block for particular context-based processing. [0016] In some implementations, the one or more data processing tools can include at least one of a computer vision tool, a web search tool, or an image search tool. The computer vision tool can include at least one of an object detection model, an optical character recognition model, an image captioning model, or a visual question-and-answer model. The web search tool can retrieve open world knowledge from web resources. The image search tool can identify relevant information from metadata associated with visually similar images. In some implementations, the language model planner block may have been conditioned to generate a processing tool query to provide to the one or more data processing tools based on the input data. [0017] Other aspects of the present disclosure are directed to various systems, apparatuses, non-transitory computer-readable media, user interfaces, and electronic devices. [0018] These and other features, aspects, and advantages of various embodiments of the present disclosure will become better understood with reference to the following description and appended claims. The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate example embodiments of the present disclosure and, together with the description, serve to explain the related principles. BRIEF DESCRIPTION OF THE DRAWINGS [0019] Detailed discussion of embodiments directed to one of ordinary skill in the art is set forth in the specification, which makes reference to the appended figures, in which: [0020] Figure 1 depicts a block diagram of an example visual information determination system according to example embodiments of the present disclosure. [0021] Figure 2 depicts a block diagram of an example visual information seeking system according to example embodiments of the present disclosure. [0022] Figure 3 depicts a flow chart diagram of an example method to perform visual query processing according to example embodiments of the present disclosure. [0023] Figure 4A depicts a block diagram of an example automated visual information seeking system according to example embodiments of the present disclosure. [0024] Figure 4B depicts a block diagram of an example automated visual information seeking workflow system according to example embodiments of the present disclosure. [0025] Figure 5 depicts a block diagram of an example API tool call system according to example embodiments of the present disclosure. [0026] Figure 6 depicts an illustration of an example user data collection interface according to example embodiments of the present disclosure. [0027] Figure 7 depicts a flow chart diagram of an example method to perform visual query response generation according to example embodiments of the present disclosure. [0028] Figure 8 depicts a flow chart diagram of an example method to perform response generation according to example embodiments of the present disclosure. [0029] Figure 9A depicts an illustration of an example visual information determination process according to example embodiments of the present disclosure. [0030] Figure 9B depicts an illustration of an example visual query processing system according to example embodiments of the present disclosure. [0031] Figure 10 depicts a flow chart diagram of an example method to perform artificial intelligence system processing according to example embodiments of the present disclosure. [0032] Figure 11A depicts a block diagram of an example computing system that performs visual query processing according to example embodiments of the present disclosure. [0033] Figure 11B depicts a block diagram of an example computing system that performs visual query processing according to example embodiments of the present disclosure. [0034] Reference numerals that are repeated across plural figures are intended to identify the same features in various implementations. DETAILED DESCRIPTION [0035] Generally, the present disclosure is directed to systems and methods for visual information query processing and response generation. In particular, the systems and methods disclosed herein can leverage one or more machine-learned language models (e.g., a tuned and/or conditioned large language model) for planning and reasoning, which can include application programming interface tool calls. For example, input data including text data and image data can be obtained. The text data can be descriptive of a query associated with the image data (e.g., “When was this object invented?”). The input data can be processed with a machine-learned model to generate first planning data. The first planning data can include an application programming interface call to provide a first set of data (e.g., one or more images of the image data) to a first data processing tool (e.g., an object detection and captioning model). First output data (e.g., one or more segmented image patches with captions and/or classifications) can then be obtained from the first data processing tool based on the transmission of the first set of data. The first output data can then be processed with the machine-learned model to determine if a response can be generated and/or to generate second planning data to perform another data processing tool. The systems and methods can iteratively generate application programming interface calls and output data processing until a response (e.g., a response to the query) is generated. [0036] Responding to visual questions that necessitate external knowledge, such as “What event is commemorated by the building depicted in this image?”, can be a complex task. The task can present a combinatorial search space that demands a sequence of actions, including invoking APIs, analyzing their responses, and making informed decisions. Existing language models and/or search engines alone may struggle with the task as language understanding and web resource identification separately may not provide an adequate response. [0037] The systems and methods disclosed herein can leverage a machine-learned language model and one or more data processing tools to perform visual information seeking. In particular, the systems and methods can utilize a machine-learned language model for planning and reasoning. For example, the machine-learned language model can process input data and/or output data from a data processing tool to determine an action (e.g., a next action) for obtaining relevant information for responding to the input data. The machine-learned language model can determine application programming interface calls to request information from one or more data processing tools. The machine-learned language model can generate planning data that includes the API call and may include a model-generated query to be provided to the one or more data processing tools. [0038] Additionally and/or alternatively, the machine-learned language model can process the outputs from the one or more data processing tools to determine the relevant information from the outputs. The machine-learned language model can then determine whether further data processing tools are to be performed before generating the response data to provide to the user. [0039] The planning and reasoning processing can be performed iteratively until the machine-learned language model determines a response can be generated and provided. In some implementations, the systems and methods can include a working memory that stores the input data and the outputs of the one or more data processing models to track and utilize the obtained and generated data throughout the different stages of the visual information retrieval and responding process. [0040] The systems and methods can include conditioning the machine-learned model on action example sets. The action example sets can include collected user behavior data descriptive of user selections in a user interface for how the user would perform the visual information seeking task when utilizing a plurality of data processing tools. A transition graph may be constructed and/or learned based on the collected user behavior data. The transition graph may be associated with a particular task and/or a particular group of tasks. The conditioned machine-learned model can then perform information seeking planning and reasoning. [0041] The systems and methods of the present disclosure provide a number of technical effects and benefits. As one example, the system and methods can be utilized to leverage a machine-learned language model for tool processing planning and reasoning, which can enable the system to accurately and efficiently respond to visual information queries. In particular, a language model can be conditioned to iteratively process data to determine when and/or how to utilize one or more data processing tools (e.g., an object detection tool, an image captioning tool, a web search tool, an image search tool, etc.). Additionally and/or alternatively, the language model can be conditioned to process the outputs of the data processing tools to extract relevant information that can then be utilized to determine another API call and/or to generate the final response. In some implementations, a user interface can be utilized to collect user behavior data that can be utilized to condition the language model based on the actions performed by a set of users. [0042] Another example technical effect and benefit relates to improved computational efficiency and improvements in the functioning of a computing system. For example, a technical benefit of the systems and methods of the present disclosure is the ability to reduce the computational resources needed for machine-learned model visual information seeking by reducing the instances of useless tool calls. In particular, the language model may process data and generate planning data one state at a time in order to mitigate the instances of a tool being utilized in a useless manner. In particular, pre-planning of data processing tool uses for an entire pipeline can lead to instances in which an output of one tool may cause the use of another tool to be needless, counterproductive, redundant, and/or illogical. The systems and methods disclosed herein can iteratively utilize the machine-learned language model to generate planning data (e.g., API calls) based on the input data, tool outputs, and/or reasoning data. [0043] With reference now to the Figures, example embodiments of the present disclosure will be discussed in further detail. [0044] Figure 1 depicts a block diagram of an example visual information determination system 10 according to example embodiments of the present disclosure. In some implementations, the visual information determination system 10 is configured to receive, and/or obtain, a set of input data 12 descriptive of a prompt associated with requesting information associated with one or more images and, as a result of receipt of the input data 12, generate, determine, and/or provide response data 20 that is descriptive of a response to the prompt. Thus, in some implementations, the visual information determination system 10 can include a machine-learned model 14 that is operable to plan data processing tool 18 calls and reason whether further calls are to be performed before generating a response. [0045] In particular, the visual information determination system 10 can include obtaining input data 12. The input data 12 can include text data and image data descriptive of a prompt. The prompt may be associated with a request to receive information associated with one or more details in one or more images of the image data (e.g., “What is the origin of this building?”). [0046] The input data 12 can be processed with a machine-learned model 14 (e.g., a large language model) to generate planning data 16. The planning data 16 can be descriptive of an action to perform. For example, the planning data 16 can include an application programming interface call to transmit data to a data processing tool 18. In some implementations, the planning data 16 can include a model-generated dataset that may be generated to provide the data processing tool 18 with a particular set of data to obtain information. [0047] The data processing tool 18 may include an object detection model, an image classification model, an image captioning model, a segmentation model, an object classification model, a computer vision model, an optical character recognition model, an augmentation model, a generative model, a visual question answering model, a web search engine, an image search engine, and/or another tool. The data processing tool 18 may be separate from the machine-learned model 14 that generated the planning data 14. [0048] The output of the data processing tool 18 may be obtained and processed with the machine-learned model to determine (or extract) the relevant information from the output. The output and the input data 12 can be processed with the machine-learned model 14 to determine whether another data processing tool call is to be performed. The generation of planning data 16, processing with data processing tool(s) 18 and processing of the output of the data processing tool(s) 18 may be performed until the machine-learned model 14 determines a response can be generated. If the machine-learned model 14 determines no further API calls are required to respond to the prompt, the machine-learned model 14 (e.g., a generative language model) may generate response data 20. The response data 20 may be descriptive of a response to the prompt and may include one or more natural language text strings. In some implementations, the response data may include image data, links, latent encoding data, audio data, statistical data, multimodal data, and/or other data. [0049] Figure 2 depicts a block diagram of an example visual information seeking system 200 according to example embodiments of the present disclosure. The visual information seeking system 200 is similar to the visual information determination system 10 of Figure 1 except that the visual information seeking system 200 further includes a first data processing tool 208 and a second data processing tool 214. For example, the visual information seeking system 200 can utilize any number of different data processing tools to perform visual information seeking. [0050] In particular, input data 202 can be obtained from a user computing system. The input data 202 can be descriptive of a query associated with one or more features in one or more images of the input data 202. The input data 202 may be obtained via one or more user interfaces. [0051] The input data 202 can be processed with a machine-learned model 204 to generate first planning data 206. The machine-learned model 204 can include an LLM- powered planner block, an LLM-powered reasoner block, and an active memory. The LLM- powered planner block can determine when, what, and how to utilize one or more data processing tools. The LLM-powered reasoner block can extract relevant information from the outputs of the data processing tools. Additionally and/or alternatively, the LLM-powered planner block can determine when enough information is obtained to respond to the query. The active memory can continually obtain and store the data obtained and/or generated throughout the visual information seeking process. [0052] The first planning data 206 can include an application programming interface call to transmit a first set of data to a first data processing tool 208. The first set of data can include a portion of the input data 202 and/or a model-generated dataset. The first data processing tool 208 can process the first set of data to generate first output data 210. [0053] The first output data 210 can be obtained then processed with the machine- learned model 204 to generate second planning data 212. The second planning data 212 can include an application programming interface call to transmit a second set of data to a second data processing tool 214. The second set of data can include a portion of the input data 202, a portion of the first output data 210, and/or a model-generated dataset. The second data processing tool 214 can process the second set of data to generate second output data 216. [0054] The first data processing tool 208 and the second data processing tool 214 may differ. For example, the first data processing tool 208 may include an image segmentation model and an object classification model, and the second data processing tool 214 may include one or more search engines. [0055] The second output data 216 can be obtained and processed with the machine- learned model 204. The input data 202, the first output data 210, and/or the second output data 216 can then be utilized to generate response data 218 descriptive of a response to the query of the input data 202. [0056] Figure 3 depicts a flow chart diagram of an example method to perform according to example embodiments of the present disclosure. Although Figure 3 depicts steps performed in a particular order for purposes of illustration and discussion, the methods of the present disclosure are not limited to the particularly illustrated order or arrangement. The various steps of the method 300 can be omitted, rearranged, combined, and/or adapted in various ways without deviating from the scope of the present disclosure. [0057] At 302, a computing system can obtain input data. The input data can include image data and text data. The text data can include a query associated with the image data. The image data can include one or more objects. Additionally and/or alternatively, the text data can be descriptive of one or more questions associated with object details for one or more objects depicted in one or more images of the image data (e.g., “What year was this building built?”, “What type of bird is this?”, and/or “How do you make this thing?”). [0058] At 304, the computing system can process the input data with a machine-learned model to generate first planning data. The first planning data can be descriptive of instructions to provide the input data to a first data processing tool. In some implementations, the first data processing tool can include an object detection model. The machine-learned model can include an autoregressive language model. In some implementations, the machine- learned model may be conditioned (e.g., parameter tuned and/or few shot example conditioned) for visual information seeking based planning and/or visual information seeking based reasoning. For example, the machine-learned model may be conditioned to determine when and/or what application programming interface calls are to be performed for different visual information seeking tasks. The machine-learned model may be conditioned to process the received outputs from the application programming interface calls to determine when and/or what relevant information was retrieved. In some implementations, the machine- learned model may be conditioned to iteratively determine API calls and process API outputs until a determined end output is received. The end output may then be processed to generate a response. [0059] At 306, the computing system can transmit, based on the first planning data, the input data to the first data processing tool to retrieve first output data. The first output data can include one or more bounding boxes associated with one or more objects in the image data. In some implementations, the first output data can include one or more segmented portions of one or more images of the image data and caption data associated with the one or more segmented portions. The caption data can be descriptive of an object classification associated with one or more objects detected in the one or more segmented portions of one or more images. [0060] At 308, the computing system can process the input data and the first output data with the machine-learned model to generate second planning data. The second planning data can be descriptive of instructions to transmit data to a second data processing tool. The second data processing tool can include a search engine. In some implementations, the second planning data can include a model-generated query. The model-generated query can be transmitted to the second data processing tool to retrieve the second output data. In some implementations, the model-generated query can be generated based on the input data and the first output data. The model-generated query can be descriptive of the query of the input data modified based on the first output data. [0061] At 310, the computing system can transmit, based on the second planning data, data to the second data processing tool to retrieve second output data. The first planning data and/or the second planning data may include a model-generated query that may be provided to and/or processed with the respective data processing model associated with the planning data. The second data processing tool may receive data via an application programming interface that was instructed to transmit the data based on the second planning data. [0062] At 312, the computing system can process the input data and the second output data with the machine-learned model to generate response data. The response data can be descriptive of a response to the query. The response data can include a natural language text string that is responsive to the query of the input data. [0063] In some implementations, the computing system can process the input data and the second output data with the machine-learned model to generate third planning data. The third planning data can be descriptive of instructions to transmit data to a third data processing tool. The computing system can transmit, based on the third planning data, data to the third data processing tool to retrieve third output data and can process the input data and the third output data with the machine-learned model to generate the response data. [0064] Figure 4A depicts a block diagram of an example automated visual information seeking system 400 according to example embodiments of the present disclosure. The automated visual information seeking system 400 can obtain an image (e.g., an image depicting a parade ceremony) and a question 402 associated with the image (e.g., “When was the drum first used for this event?”). The question 402 can be processed with a plan with a large language model block 404, which may decompose the question and determine the image is to be transmitted to an object detection and captioning tool 406A. Additionally, the plan with a large language model block 404 may generate a visual question (e.g., “In the image, what is the drum and event?”). The tool call and the visual question may be part of a first planning dataset. [0065] The object detection and captioning tool 406A can process the image to generate a list of objects 408A that includes one or more segmented portions of the image with determined captions for each of the respective segmented portions. The segmented portions, respective captions, and the visual question may be processed with the plan with a large language model block 404 to generate second planning data. The second planning data can select a particular segmented portion and respective caption. The selected segmented portion of the image can then be transmitted to an image search tool 406B based on the second planning data. [0066] The image search tool 406B can process the selected segmented portion to identify similar images with alt-texts 408B (e.g., web descriptions associated with similar images to the segmented portion). [0067] The similar images with alt-texts 408B can then be processed with a reason with the large language model block 410 to generate an answer to the visual question. The machine-learned model can then determine if the visual question is answered 412 and whether or not other segmented portions are to be processed with the image search tool 406B and/or another tool. [0068] The visual answer can then be processed with the plan with a large language model block 404 to generate third planning data. The third planning data can include a generated search query that includes the information from the visual answer (e.g., “When was Taiko first used for Aoi Festival?”) and an application programming interface call to transmit the generated search query to a web search tool 406C. [0069] The web search tool 406C can process the search query to determine search results including searched web pages 408C. The information scraped (or extracted) from the relevant searched web pages 408C can be processed with the reason with the large language model block 410 to generate an answer to the search query. The machine-learned model can determine whether the search query is answered 412 or whether further searching and scraping is to be performed. [0070] If the search query is answered, the response data descriptive of the response 414 to the search query can then be provided to the user that submitted the question 402 and image. [0071] Figure 4B depicts a block diagram of an example automated visual information seeking workflow system 450 according to example embodiments of the present disclosure. In particular, the automated visual information seeking workflow system 450 can obtain a multimodal query 452 (e.g., an image and a question) and generate a response 468 (e.g., an answer generated based on multi-tool usage) based on a plurality of processing instances with a plurality of different tools. [0072] A multimodal query 452 can be obtained from a user computing system. The multimodal query 452 can include an image and a text string descriptive of a question about one or more details from the image. The multimodal query 452 may be processed with the planner model to determine one or more processing actions (e.g., using one or more processing tools). [0073] An image search may first be performed with a search engine to determine an image search result 454 (e.g., a similar image and related metadata (e.g., a caption, entity tags, and/or location information). The image search may include embedding the image of the multimodal query 452 then performing a nearest neighbor embedding determination. Alternatively and/or additionally, the image search may include image matching. [0074] The image search may be determined to be uninformative, which may lead to backtracking. The planner model may process the image search result 454 and the multimodal query 452 to determine a next action (e.g., a different processing tool). [0075] A particular object 456 may be selected based on one or more factors, which may include foreground determination, focus determination, and/or selection based on the content of the text string. The particular object 456 can be detected with a detection model then segmented with a segmentation model. [0076] The object selection may be determined to be uninformative, which may lead to backtracking. The planner model may process the particular object 456, the image search result 454, and the multimodal query 452 to determine a next action (e.g., a different processing tool). [0077] A second particular object 458 may be selected based on one or more factors, which may include foreground determination, focus determination, and/or selection based on the content of the text string. The second particular object 458 can be detected with a detection model then segmented with a segmentation model. [0078] The second particular object 458 may be determined to be informative by the reasoner model and/or the planner model. The planner model can then determine a next action based on the second particular object 458 and the multimodal query 452. [0079] An image search (e.g., with a search engine) can be performed by processing the image segmented that includes the second particular object 458 to determine a respective search result set 460. The respective search result set 460 can include images, knowledge graph data, web resource data, captions, location data, and/or text content data. [0080] The respective search result set 460 may be determined to be informative by the reasoner model and/or the planner model. The planner model can then determine a next action based on the respective search result set 460, the second particular object 458, and the multimodal query 452. [0081] The respective search result set 460 and the multimodal query 452 can be processed with a generative language model (e.g., the reasoner model and/or planner model) to perform visual question answering to answer a reasoner model generated question 462. The answer to the reasoner model generated question 462, the respective search result set 460, and the multimodal query 452 can be processed with a generative language model (e.g., the reasoner model and/or planner model) to generate a first predicted answer 464. The reasoner model and/or the planner model may determine the first predicted answer 464 is incorrect and/or uninformative. [0082] A search (e.g., with a search engine) may then be performed based on the multimodal query 452 and/or the answer to the reasoner model generated question 462 to determine web search results 466. [0083] The web search results 466 and the multimodal query 452 may then be processed with a generative model (e.g., the reasoner model and/or the planner model) to generate a second predicted answer. The second predicted answer may be determine to be accurate and may be provided as the response 468. [0084] In particular, the automated visual information seeking workflow system 450 can leverage a planner model and/or a reasoner model to determine how to process the multimodal query 452 and successive data instances. The process can include performing a plurality of subtasks, which may include utilizing a plurality of different processing tools based on determinations performed by the planner model and/or the reasoner model. The automated visual information seeking workflow system 450 may generate a plurality of model-generated queries based on the initial multimodal query 452 in order to respond to the multimodal query 452. The plurality of model-generated queries may include multimodal queries, which may include all and/or portions of the original image. The plurality of model- generated queries may include outputs from the one or more processing tools as different tasks are being performed. [0085] Figure 5 depicts a block diagram of an example API tool call system 500 according to example embodiments of the present disclosure. API tool call system 500 of the systems and methods disclosed herein can include a machine-learned model being leveraged to determine when, what, and how to utilize a plurality of data processing tools 504 to generate a response to an input prompt. [0086] The API tool call system 500 can start 502 with an input prompt that may include text data, image data, audio data, latent encoding data, multimodal data, statistical data, and/or other data and may finish 506 with a response to the input prompt. Generating a response to the input prompt can involve a machine-learned model iteratively obtaining and processing data to determine application programming interface calls that can be performed to utilize a plurality of data processing tools 504 for responding to the input prompt. [0087] The plurality of data processing tools 504 can include a captioning tool, a select object tool, a visual question answering tool, an image search tool, a web search tool, a large language model short question-and-answer tool, and/or other tools. The plurality of data processing tools 504 may process the input prompt model-generated data, and/or the outputs of other tools. [0088] Figure 6 depicts an illustration of an example user data collection interface 600 according to example embodiments of the present disclosure. The user data collection interface 600 can include a plurality of user interface elements that can be interacted with to collect user data to generate few shot examples to condition a machine-learned model for visual information seeking planning and reasoning. [0089] For example, the user data collection interface 600 may display an input prompt 602 including an example input image and an example input question. A plurality of image segments 604 of the input example image can then be provided to a user to be selected for obtaining selected image segment data. [0090] A tools interface 606 can then be displayed to allow a user to select which data processing tools they would utilize to answer the question including what data they would provide to the data processing tool. The output(s) 610 of a selected data processing tool can be obtained and provided to a user for display. [0091] The user may then be provided with a plurality of output reasoning interface elements 608 for display. The user can then select whether the output(s) 610 are useful, whether the end answer has been determined, whether an additional application programming interface call is to be performed, and/or if the answer cannot be found. [0092] The user data collection interface 600 may be provided to a plurality of different users for a plurality of different visual information seeking tasks. The collected user behavior data can then be utilized to generate a transition graph that may be utilized by a pre-trained machine-learned language model for planning and reasoning. [0093] Figure 7 depicts a flow chart diagram of an example method to perform according to example embodiments of the present disclosure. Although Figure 7 depicts steps performed in a particular order for purposes of illustration and discussion, the methods of the present disclosure are not limited to the particularly illustrated order or arrangement. The various steps of the method 700 can be omitted, rearranged, combined, and/or adapted in various ways without deviating from the scope of the present disclosure. [0094] At 702, a computing system can obtain input data. The input data can include image data and text data. The text data can include a query associated with the image data. The input data may include text data, image data, audio data, video data, latent encoding data, statistical data, multimodal data, and/or other data. The image data may be descriptive of one or more images captured with a user computing device. Alternatively and/or additionally, the image data may be descriptive of one or more images obtained from one or more web resources. [0095] At 704, the computing system can process the input data with a machine-learned model to generate first planning data. The first planning data can be descriptive of instructions to provide the image data to a first data processing tool. The first planning data may include a model-generated text string to be processed with the image data. [0096] In some implementations, the machine-learned model may have been conditioned on a training dataset including a plurality of training examples. Each training example of the plurality of training examples can include a training input, a training output, and a training rationale. The training rationale can be descriptive of a sequence of processing instances and tool calls for determining the output data. Additionally and/or alternatively, the machine- learned model may have been conditioned on a training dataset including sequence data. The sequence data can be descriptive of a sequence of actions for generating a training response in response to obtaining a particular type of input. In some implementations, the machine- learned model may have been conditioned on human input data descriptive of actions a human selected as being particular actions to perform when a particular input type is received. The human input data may have been obtained via a user interface that provides a plurality of selectable options for a user. The plurality of selectable options can include a plurality of external tools to call and a final output option. [0097] At 706, the computing system can provide the image data to the first data processing tool to receive first output data. The first output data can be generated with the first data processing tool based on the image data. Alternatively and/or additionally, model- generated data and/or the text data may be provided to the first data processing tool to generate the first output data. The first data processing tool may include an image processing tool, a text processing tool, a search engine, an augmentation model, and/or another data processing tool. The first output data can include text data, audio data, image data, latent encoding data, multimodal data, and/or other data. [0098] At 708, the computing system can process the first output data with the machine- learned model to generate second planning data. The second planning data can be descriptive of instructions to provide a particular portion of the image data to a second data processing tool. The second planning data can include model-generated data to provide to the second data processing tool. The model-generated data can include text data, image data, audio data, latent encoding data, multimodal data, and/or other data. The second data processing tool can include an image processing model, a search engine, a segmentation model, an augmentation model, and/or another data processing tool. [0099] At 710, the computing system can provide the particular portion of the image data to the second data processing tool to receive second output data. The second output data can be generated with the second data processing tool based on the particular portion of the image data. The second data processing tool may perform a reverse image search to obtain data associated with an object depicted in the particular portion of the image data. [0100] At 712, the computing system can process the text data and the second output data with the machine-learned model to generate response data. The response data can be descriptive of a response to the query. The response data can include text data, image data, audio data, latent encoding data, augmented reality data, virtual reality data, multimodal data, and/or other data. The response data may include a natural language text string descriptive of the response. [0101] Figure 8 depicts a flow chart diagram of an example method to perform according to example embodiments of the present disclosure. Although Figure 8 depicts steps performed in a particular order for purposes of illustration and discussion, the methods of the present disclosure are not limited to the particularly illustrated order or arrangement. The various steps of the method 800 can be omitted, rearranged, combined, and/or adapted in various ways without deviating from the scope of the present disclosure. [0102] At 802, a computing system can obtain input data. The input data can include image data and text data. The text data can include a query associated with the image data. In some implementations, the query can be descriptive of a question associated with a particular object depicted in the image data. [0103] At 804, the computing system can process the input data with a machine-learned model to generate first planning data. The first planning data can be descriptive of instructions to provide the image data to a first data processing tool. [0104] At 806, the computing system can provide the image data to the first data processing tool and receive first output data from the first data processing tool based on the image data. [0105] In some implementations, the first data processing tool can detect a plurality of objects depicted in the image data, generate a plurality of bounding boxes associated with the plurality of objects, generate a plurality of image segments based on the plurality of bounding boxes, classify each of the plurality of objects in the plurality of image segments to generate a plurality of object classifications, and generate the first output data. The first output data can include the plurality of image segments and the plurality of object classifications. [0106] At 808, the computing system can process the first output data with the machine- learned model to generate second planning data. The second planning data can be descriptive of instructions to provide a particular portion of the image data to a second data processing tool. [0107] At 810, the computing system can provide the particular portion of the image data to the second data processing tool and receive second output data from the second data processing tool based on the particular portion of the image data. [0108] In some implementations, the second data processing tool can include an image search engine. The second data processing tool can process one or more of the plurality of image segments with the image search engine to determine one or more web resources associated with the one or more of the plurality of image segments. The one or more of the plurality of image segments can be selected by the machine-learned model based on the input data and the plurality of object classifications. Additionally and/or alternatively, the second data processing tool can generate the second output data based on the one or more web resources. [0109] At 812, the computing system can process the text data and the second output data with the machine-learned model to generate response data. The response data can be descriptive of a response to the query. [0110] Figure 9A depicts an illustration of an example visual information determination process 900 according to example embodiments of the present disclosure. In particular, Figure 9 depicts three example processes (920, 922, & 924). [0111] At 920, a question 902 (“How many floors does this building have?”) and an image 904 (an image of a building) can be obtained. An object selection 906 can occur to select a relevant image segment associated with the question 902. The image segment can then be processed with an image search tool 908 to identify similar images and relevant text data associated with the identified similar images. The similar image(s) and respective relevant text data can be processed with a language model reasoner block 910 to determine a relevant answer to what the building is. The building name can then be processed with a web search tool 912 to obtain one or more search results. Relevant search result information can then be extracted using the language model reasoner block 910 to determine an answer 916 to the input question 902. [0112] At 922, a similar process can be performed that may include a second object selection 906 loop based on a determined object identification being irrelevant based on processing a tool output with the language model reasoner block 910. [0113] At 924, a similar process to 920 and 922 can be performed that may include the utilization of an LLM short QA tool for answering the input question 902. [0114] Figure 9B depicts an illustration of an example visual query processing system 950 according to example embodiments of the present disclosure. The visual query processing system 950 can employ dynamic decision-making to plan (e.g., find optimal tool and query), execute results, and then reason (e.g., estimate whether continue or backtrack). [0115] The visual query processing system 950 can obtain a visual query. The visual query can be processed with the planner model 952 (e.g., a generative language model (e.g., an LLM)). The visual query processing system 950 may begin processing with an initial query (e.g., a multimodal visual query). As additional information is retrieved, the initial query may be updated. [0116] The initial query can be processed with the planner model 952 to determine a particular action to perform, which may cause an application programming interface call to be generated. The application programming interface call can then be performed to transmit the initial query and instructions to a tool executor 954 to generate and/or obtain execution results 956. The tool executor may execute the tool interactions to communicate with the one or more processing tools (e.g., image search engine, web search engine, vision language model, captioning model, VQA model processing, detection model, segmentation model, augmentation model, etc.). [0117] The execution results 956 may be transmitted to a working memory 958 and may be utilized for future planning instances with the planner model 952. The query state may be updated based on the execution results 956. [0118] The execution results 956 can be processed with a reasoner model 960 (e.g., a generative language model (e.g., an LLM)) to determine whether the execution results 956 are informative 962. If the execution results 956 are determined to not be informative, the visual query processing system 950 may backtrack and perform another planning instance with the planner model 952. If the execution results 956 are determined to be informative, the query state may be updated based on the execution results 956 to generate an updated query 964, which may include updating and/or utilizing transition graphs 966 and/or in-context examples. [0119] The updated query 964, the transition graphs 966, and/or in-context examples may then be transmitted to the planner model to determine additional actions and/or prompts to leverage to perform the response generation. The loop can be performed iteratively and may include five or more loops before response generation. [0120] Figure 10 depicts a flow chart diagram of an example method to perform according to example embodiments of the present disclosure. Although Figure 10 depicts steps performed in a particular order for purposes of illustration and discussion, the methods of the present disclosure are not limited to the particularly illustrated order or arrangement. The various steps of the method 1000 can be omitted, rearranged, combined, and/or adapted in various ways without deviating from the scope of the present disclosure. [0121] At 1002, a computing system can obtain input data. The input data can include image data and text data. The text data can include a query associated with the image data. The input data may be obtained via a graphical user interface, which may include a query input box, a selection window, and/or a download portal. [0122] At 1004, the computing system can process the input data with an artificial intelligence system to generate response data. The response data can be descriptive of a response to the query. The artificial intelligence system can include a language model planner block, a language model reasoner block, and a working memory. In some implementations, the language model planner block may have been conditioned to determine one or more data processing tools to utilize for processing data associated with responding to the query. Additionally and/or alternatively, the language model reasoner block may have been conditioned to process outputs of the one or more data processing tools to determine information associated with responding to the query. The working memory can store acquired information obtained and generated with the artificial intelligence system. [0123] In some implementations, the artificial intelligence system can include one or more machine-learned models conditioned on user behavior data. The user behavior data can be processed to generate a transition graph that is descriptive of a determined sequence of decisions made by users when performing a particular information seeking task. In some implementations, the transition graph can be descriptive of a plurality of distinct states and can indicate a particular set of actions available at each state of the plurality of distinct states. Additionally and/or alternatively, the user behavior data may be utilized to condition the language model planner block and the language model reasoner block for particular context- based processing. [0124] In some implementations, the one or more data processing tools can include a computer vision tool, a web search tool, and/or an image search tool. The computer vision tool can include an object detection model, an optical character recognition model, an image captioning model, and/or a visual question-and-answer model. Additionally and/or alternatively, the web search tool can retrieve open world knowledge from web resources. The image search tool may identify relevant information from metadata associated with visually similar images. In some implementations, the language model planner block may have been conditioned to generate a processing tool query to provide to the one or more data processing tools based on the input data. [0125] At 1006, the computing system can provide the response data as an output. The response data may be provided to a user via a graphical user interface. For example, the graphical user interface may be a search interface, and the response data may be provided for display in a panel adjacent to one or more search results. [0126] Figure 11A depicts a block diagram of an example computing system 100 that performs visual query processing according to example embodiments of the present disclosure. The system 100 includes a user computing system 102, a server computing system 130, and/or a third computing system 150 that are communicatively coupled over a network 180. [0127] The user computing system 102 can include any type of computing device, such as, for example, a personal computing device (e.g., laptop or desktop), a mobile computing device (e.g., smartphone or tablet), a gaming console or controller, a wearable computing device, an embedded computing device, or any other type of computing device. [0128] The user computing system 102 includes one or more processors 112 and a memory 114. The one or more processors 112 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, a FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. The memory 114 can include one or more non-transitory computer-readable storage mediums, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. The memory 114 can store data 116 and instructions 118 which are executed by the processor 112 to cause the user computing system 102 to perform operations. [0129] In some implementations, the user computing system 102 can store or include one or more machine-learned models 120. For example, the machine-learned models 120 can be or can otherwise include various machine-learned models such as neural networks (e.g., deep neural networks) or other types of machine-learned models, including non-linear models and/or linear models. Neural networks can include feed-forward neural networks, recurrent neural networks (e.g., long short-term memory recurrent neural networks), convolutional neural networks or other forms of neural networks. [0130] In some implementations, the one or more machine-learned models 120 can be received from the server computing system 130 over network 180, stored in the user computing device memory 114, and then used or otherwise implemented by the one or more processors 112. In some implementations, the user computing system 102 can implement multiple parallel instances of a single machine-learned model 120 (e.g., to perform parallel machine-learned model processing across multiple instances of input data and/or detected features). [0131] More particularly, the one or more machine-learned models 120 may include one or more detection models, one or more classification models, one or more segmentation models, one or more augmentation models, one or more generative models, one or more natural language processing models, one or more optical character recognition models, and/or one or more other machine-learned models. The one or more machine-learned models 120 can include one or more transformer models. The one or more machine-learned models 120 may include one or more neural radiance field models, one or more diffusion models, and/or one or more autoregressive language models. [0132] The one or more machine-learned models 120 may be utilized to detect one or more object features. The detected object features may be classified and/or embedded. The classification and/or the embedding may then be utilized to perform a search to determine one or more search results. Alternatively and/or additionally, the one or more detected features may be utilized to determine an indicator (e.g., a user interface element that indicates a detected feature) is to be provided to indicate a feature has been detected. The user may then select the indicator to cause a feature classification, embedding, and/or search to be performed. In some implementations, the classification, the embedding, and/or the searching can be performed before the indicator is selected. [0133] In some implementations, the one or more machine-learned models 120 can process image data, text data, audio data, and/or latent encoding data to generate output data that can include image data, text data, audio data, and/or latent encoding data. The one or more machine-learned models 120 may perform optical character recognition, natural language processing, image classification, object classification, text classification, audio classification, context determination, action prediction, image correction, image augmentation, text augmentation, sentiment analysis, object detection, error detection, inpainting, video stabilization, audio correction, audio augmentation, and/or data segmentation (e.g., mask based segmentation). [0134] Machine-learned model(s) can be or include one or multiple machine-learned models or model components. Example machine-learned models can include neural networks (e.g., deep neural networks). Example machine-learned models can include non-linear models or linear models. Example machine-learned models can use other architectures in lieu of or in addition to neural networks. Example machine-learned models can include decision tree based models, support vector machines, hidden Markov models, Bayesian networks, linear regression models, k-means clustering models, etc. [0135] Example neural networks can include feed-forward neural networks, recurrent neural networks (RNNs), including long short-term memory (LSTM) based recurrent neural networks, convolutional neural networks (CNNs), diffusion models, generative-adversarial networks, or other forms of neural networks. Example neural networks can be deep neural networks. Some example machine-learned models can leverage an attention mechanism such as self-attention. For example, some example machine-learned models can include multi- headed self-attention models. [0136] Machine-learned model(s) can include a single or multiple instances of the same model configured to operate on data from input(s). Machine-learned model(s) can include an ensemble of different models that can cooperatively interact to process data from input(s). For example, machine-learned model(s) can employ a mixture-of-experts structure. See, e.g., Zhou et al., Mixture-of-Experts with Expert Choice Routing, ARXIV:2202.09368v2 (Oct.14, 2022). [0137] Input(s) can generally include or otherwise represent various types of data. Input(s) can include one type or many different types of data. Output(s) can be data of the same type(s) or of different types of data as compared to input(s). Output(s) can include one type or many different types of data. [0138] Example data types for input(s) or output(s) include natural language text data, software code data (e.g., source code, object code, machine code, or any other form of computer-readable instructions or programming languages), machine code data (e.g., binary code, assembly code, or other forms of machine-readable instructions that can be executed directly by a computer's central processing unit), assembly code data (e.g., low-level programming languages that use symbolic representations of machine code instructions to program a processing unit), genetic data or other chemical or biochemical data, image data, audio data, audiovisual data, haptic data, biometric data, medical data, financial data, statistical data, geographical data, astronomical data, historical data, sensor data generally (e.g., digital or analog values, such as voltage or other absolute or relative level measurement values from a real or artificial input, such as from an audio sensor, light sensor, displacement sensor, etc.), and the like. Data can be raw or processed and can be in any format or schema. [0139] In multimodal inputs or outputs, example combinations of data types include image data and audio data, image data and natural language data, natural language data and software code data, image data and biometric data, sensor data and medical data, etc. It is to be understood that any combination of data types in an input or an output can be present. [0140] An example input can include one or multiple data types, such as the example data types noted above. An example output can include one or multiple data types, such as the example data types noted above. The data type(s) of input can be the same as or different from the data type(s) of output. It is to be understood that the example data types noted above are provided for illustrative purposes only. Data types contemplated within the scope of the present disclosure are not limited to those examples noted above. [0141] Additionally or alternatively, one or more machine-learned models 140 can be included in or otherwise stored and implemented by the server computing system 130 that communicates with the user computing system 102 according to a client-server relationship. For example, the machine-learned models 140 can be implemented by the server computing system 140 as a portion of a web service (e.g., a viewfinder service, a visual search service, an image processing service, an ambient computing service, and/or an overlay application service). Thus, one or more models 120 can be stored and implemented at the user computing system 102 and/or one or more models 140 can be stored and implemented at the server computing system 130. [0142] The user computing system 102 can also include one or more user input component 122 that receives user input. For example, the user input component 122 can be a touch-sensitive component (e.g., a touch-sensitive display screen or a touch pad) that is sensitive to the touch of a user input object (e.g., a finger or a stylus). The touch-sensitive component can serve to implement a virtual keyboard. Other example user input components include a microphone, a traditional keyboard, or other means by which a user can provide user input. [0143] In some implementations, the user computing system can store and/or provide one or more user interfaces 124, which may be associated with one or more applications. The one or more user interfaces 124 can be configured to receive inputs and/or provide data for display (e.g., image data, text data, audio data, one or more user interface elements, an augmented-reality experience, a virtual reality experience, and/or other data for display. The user interfaces 124 may be associated with one or more other computing systems (e.g., server computing system 130 and/or third party computing system 150). The user interfaces 124 can include a viewfinder interface, a search interface, a generative model interface, a social media interface, and/or a media content gallery interface. [0144] The user computing system 102 may include and/or receive data from one or more sensors 126. The one or more sensors 126 may be housed in a housing component that houses the one or more processors 112, the memory 114, and/or one or more hardware components, which may store, and/or cause to perform, one or more software packets. The one or more sensors 126 can include one or more image sensors (e.g., a camera), one or more lidar sensors, one or more audio sensors (e.g., a microphone), one or more inertial sensors (e.g., inertial measurement unit), one or more biological sensors (e.g., a heart rate sensor, a pulse sensor, a retinal sensor, and/or a fingerprint sensor), one or more infrared sensors, one or more location sensors (e.g., GPS), one or more touch sensors (e.g., a conductive touch sensor and/or a mechanical touch sensor), and/or one or more other sensors. The one or more sensors can be utilized to obtain data associated with a user’s environment (e.g., an image of a user’s environment, a recording of the environment, and/or the location of the user). [0145] The user computing system 102 may include, and/or pe part of, a user computing device 104. The user computing device 104 may include a mobile computing device (e.g., a smartphone or tablet), a desktop computer, a laptop computer, a smart wearable, and/or a smart appliance. Additionally and/or alternatively, the user computing system may obtain from, and/or generate data with, the one or more one or more user computing devices 104. For example, a camera of a smartphone may be utilized to capture image data descriptive of the environment, and/or an overlay application of the user computing device 104 can be utilized to track and/or process the data being provided to the user. Similarly, one or more sensors associated with a smart wearable may be utilized to obtain data about a user and/or about a user’s environment (e.g., image data can be obtained with a camera housed in a user’s smart glasses). Additionally and/or alternatively, the data may be obtained and uploaded from other user devices that may be specialized for data obtainment or generation. [0146] The server computing system 130 includes one or more processors 132 and a memory 134. The one or more processors 132 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, a FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. The memory 134 can include one or more non-transitory computer-readable storage mediums, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. The memory 134 can store data 136 and instructions 138 which are executed by the processor 132 to cause the server computing system 130 to perform operations. [0147] In some implementations, the server computing system 130 includes or is otherwise implemented by one or more server computing devices. In instances in which the server computing system 130 includes plural server computing devices, such server computing devices can operate according to sequential computing architectures, parallel computing architectures, or some combination thereof. [0148] As described above, the server computing system 130 can store or otherwise include one or more machine-learned models 140. For example, the models 140 can be or can otherwise include various machine-learned models. Example machine-learned models include neural networks or other multi-layer non-linear models. Example neural networks include feed forward neural networks, deep neural networks, recurrent neural networks, and convolutional neural networks. Example models 140 are discussed with reference to Figure 9B. [0149] Additionally and/or alternatively, the server computing system 130 can include and/or be communicatively connected with a search engine 142 that may be utilized to crawl one or more databases (and/or resources). The search engine 142 can process data from the user computing system 102, the server computing system 130, and/or the third party computing system 150 to determine one or more search results associated with the input data. The search engine 142 may perform term based search, label based search, Boolean based searches, image search, embedding based search (e.g., nearest neighbor search), multimodal search, and/or one or more other search techniques. [0150] The server computing system 130 may store and/or provide one or more user interfaces 144 for obtaining input data and/or providing output data to one or more users. The one or more user interfaces 144 can include one or more user interface elements, which may include input fields, navigation tools, content chips, selectable tiles, widgets, data display carousels, dynamic animation, informational pop-ups, image augmentations, text-to-speech, speech-to-text, augmented-reality, virtual-reality, feedback loops, and/or other interface elements. [0151] The user computing system 102 and/or the server computing system 130 can train the models 120 and/or 140 via interaction with the third party computing system 150 that is communicatively coupled over the network 180. The third party computing system 150 can be separate from the server computing system 130 or can be a portion of the server computing system 130. Alternatively and/or additionally, the third party computing system 150 may be associated with one or more web resources, one or more web platforms, one or more other users, and/or one or more contexts. [0152] An example machine-learned model can include a generative model (e.g., a large language model, a foundation model, a vision language model, an image generation model, a text-to-image model, an audio generation model, and/or other generative models). [0153] Training and/or tuning the machine-learned model can include obtaining a training instance. A set of training data can include a plurality of training instances divided between multiple datasets (e.g., a training dataset, a validation dataset, or testing dataset). A training instance can be labeled or unlabeled. The runtime inferences can form training instances when a model is trained using an evaluation of the model’s performance on that runtime instance (e.g., online training/learning). Example data types for the training instance and various tasks associated therewith are described throughout the present disclosure. [0154] Training and/or tuning can include processing, using one or more machine- learned models, the training instance to generate an output. The output can be directly obtained from the one or more machine-learned models or can be a downstream result of a chain of processing operations that includes an output of the one or more machine-learned models. [0155] Training and/or tuning can include receiving an evaluation signal associated with the output. The evaluation signal can be obtained using a loss function. Various determinations of loss can be used, such as mean squared error, likelihood loss, cross entropy loss, hinge loss, contrastive loss, or various other loss functions. The evaluation signal can be computed using known ground-truth labels (e.g., supervised learning), predicted or estimated labels (e.g., semi- or self-supervised learning), or without labels (e.g., unsupervised learning). The evaluation signal can be a reward (e.g., for reinforcement learning). The reward can be computed using a machine-learned reward model configured to generate rewards based on output(s) received. The reward can be computed using feedback data describing human feedback on the output(s). [0156] Training and/or tuning can include updating the machine-learned model using the evaluation signal. For example, values for parameters of the machine-learned model(s) can be learned, in some embodiments, using various training or learning techniques, such as, for example, backwards propagation. For example, the evaluation signal can be backpropagated from the output (or another source of the evaluation signal) through the machine-learned model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the evaluation signal with respect to the parameter value(s)). For example, system(s) containing one or more machine-learned models can be trained in an end-to-end manner. Gradient descent techniques can be used to iteratively update the parameters over a number of training iterations. In some implementations, performing backwards propagation of errors can include performing truncated backpropagation through time. Training and/or tuning can include implementing a number of generalization techniques (e.g., weight decays, dropouts, etc.) to improve the generalization capability of the models being trained. [0157] In some implementations, the above training loop can be implemented for training a machine-learned model from an initialized state to a fully trained state (e.g., when the model exhibits a desired performance profile, such as based on accuracy, precision, recall, etc.). [0158] In some implementations, the above training loop can be implemented for particular stages of a training procedure. For instance, in some implementations, the above training loop can be implemented for pre-training a machine-learned model. Pre-training can include, for instance, large-scale training over potentially noisy data to achieve a broad base of performance levels across a variety of tasks/data types. In some implementations, the above training loop can be implemented for fine-tuning a machine-learned model. Fine- tuning can include, for instance, smaller-scale training on higher-quality (e.g., labeled, curated, etc.) data. Fine-tuning can affect all or a portion of the parameters of a machine- learned model. For example, various portions of the machine-learned model can be “frozen” for certain training stages. For example, parameters associated with an embedding space can be “frozen” during fine-tuning (e.g., to retain information learned from a broader domain(s) than present in the fine-tuning dataset(s)). An example fine-tuning approach includes reinforcement learning. Reinforcement learning can be based on user feedback on model performance during use. [0159] In some implementations, the computing system 100 may utilize one or more soft prompts for conditioning the one or more machine-learned models (120 and/or 140) for downstream tasks. The one or more soft prompts can include a set of tunable parameters that can be trained (or tuned) as the parameters of the one or more machine-learned models (120 and/or 140) are fixed. The one or more soft prompts 124 can be trained for a specific task and/or a specific set of tasks. Alternatively and/or additionally, the one or more soft prompts 124 may be trained to condition the one or more machine-learned models (120 and/or 140) to perform inferences for a particular individual, one or more entities, and/or one or more tasks such that the output is tailored for that particular individual, particular entities, and/or particular task. The one or more soft prompts 124 can be obtained and processed with one or more inputs by the one or more machine-learned models (120 and/or 140). [0160] The one or more soft prompts can include a set of machine-learned weights. In particular, the one or more soft prompts can include weights that were trained to condition a generative model to generate model-generated content with one or more particular attributes. For example, the one or more soft prompts can be utilized by a user to generate content based on the fine-tuning. The one or more soft prompts can be extended to a plurality of tasks. For example, the computing system 100 may tune the set of parameters on a plurality of different content attributes and/or types. The one or more soft prompts may include a plurality of learned vector representations that may be model-readable. [0161] A particular soft prompt can be obtained based on a particular task, individual, content type, etc. The particular soft prompt can include a set of learned parameters. The set of learned parameters can be processed with the generative model to generate the model- generated image. [0162] The user computing system 102 and/or the server computing system 130 may store one or more soft prompts associated with the particular user and/or particular task. The soft prompt(s) can include a set of parameters. The user computing system 102 and/or the server computing system 130 may leverage the set of parameters of the soft prompt(s) and a generative model to generate a model-generated content item. In some implementations, the model-generated content item can be generated based on the set of parameters associated with the particular individual and/or task. [0163] The utilization of a soft prompt (i.e., a set of parameters that can be processed with a generative model for downstream task conditioning) can reduce the computational cost for parameter tuning for object-specific content generation by reducing the parameters to be tuned. The set of parameters can be limited and may be adjusted while the parameters of the pre-trained generative model stay fixed. The set of parameters of the soft prompt can be utilized to condition the pre-trained generative model (e.g., the machine-learned image generation model and/or language model) for particular downstream tasks (e.g., response generation and/or image rendering). [0164] In some implementations, the generative language model and/or one or more soft prompts (e.g., a set of machine-learned parameters that can be processed with the input by the generative language model) can be trained to generate content with particular attrributes. [0165] In some implementations, the server computing system 130 can include a prompt library. The prompt library can store a plurality of prompt templates (e.g., a plurality of hard prompt templates (e.g., text prompt templates)) and/or a plurality of soft prompts. The plurality of prompt templates can include hard prompt templates (e.g., text string data) that may be combined with the user input to generate a more detailed and complete prompt for the generative model to process. The templates can include text descriptive of the request. The templates may be object-specific, user-specific, and/or content-specific. The plurality of prompt templates may include few-shot examples. [0166] The prompt library can store a plurality of soft prompts. The plurality of soft prompts may be associated with a plurality of different content attributes and/or a plurality of different individuals. The plurality of soft prompts can include learned parameters and/or learned weights that can be processed with the generative model to condition the generative model to generate content items with particular attributes. The plurality of soft prompts may have been tuned by freezing the parameters of a pre-trained generative model, while the parameters of the soft prompt are learned based on a particular task and/or user. The plurality of soft prompts can include a plurality of different soft prompts associated with a plurality of different users and/or a plurality of different sets of users. [0167] The third party computing system 150 can include one or more processors 152 and a memory 154. The one or more processors 152 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, a FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. The memory 154 can include one or more non-transitory computer-readable storage mediums, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. The memory 154 can store data 156 and instructions 158 which are executed by the processor 152 to cause the third party computing system 150 to perform operations. In some implementations, the third party computing system 150 includes or is otherwise implemented by one or more server computing devices. [0168] The network 180 can be any type of communications network, such as a local area network (e.g., intranet), wide area network (e.g., Internet), or some combination thereof and can include any number of wired or wireless links. In general, communication over the network 180 can be carried via any type of wired and/or wireless connection, using a wide variety of communication protocols (e.g., TCP/IP, HTTP, SMTP, FTP), encodings or formats (e.g., HTML, XML), and/or protection schemes (e.g., VPN, secure HTTP, SSL). [0169] The machine-learned models described in this specification may be used in a variety of tasks, applications, and/or use cases. [0170] In some implementations, the input to the machine-learned model(s) of the present disclosure can be image data. The machine-learned model(s) can process the image data to generate an output. As an example, the machine-learned model(s) can process the image data to generate an image recognition output (e.g., a recognition of the image data, a latent embedding of the image data, an encoded representation of the image data, a hash of the image data, etc.). As another example, the machine-learned model(s) can process the image data to generate an image segmentation output. As another example, the machine- learned model(s) can process the image data to generate an image classification output. As another example, the machine-learned model(s) can process the image data to generate an image data modification output (e.g., an alteration of the image data, etc.). As another example, the machine-learned model(s) can process the image data to generate an encoded image data output (e.g., an encoded and/or compressed representation of the image data, etc.). As another example, the machine-learned model(s) can process the image data to generate an upscaled image data output. As another example, the machine-learned model(s) can process the image data to generate a prediction output. [0171] In some implementations, the input to the machine-learned model(s) of the present disclosure can be text or natural language data. The machine-learned model(s) can process the text or natural language data to generate an output. As an example, the machine- learned model(s) can process the natural language data to generate a language encoding output. As another example, the machine-learned model(s) can process the text or natural language data to generate a latent text embedding output. As another example, the machine- learned model(s) can process the text or natural language data to generate a translation output. As another example, the machine-learned model(s) can process the text or natural language data to generate a classification output. As another example, the machine-learned model(s) can process the text or natural language data to generate a textual segmentation output. As another example, the machine-learned model(s) can process the text or natural language data to generate a semantic intent output. As another example, the machine-learned model(s) can process the text or natural language data to generate an upscaled text or natural language output (e.g., text or natural language data that is higher quality than the input text or natural language, etc.). As another example, the machine-learned model(s) can process the text or natural language data to generate a prediction output. [0172] In some implementations, the input to the machine-learned model(s) of the present disclosure can be speech data. The machine-learned model(s) can process the speech data to generate an output. As an example, the machine-learned model(s) can process the speech data to generate a speech recognition output. As another example, the machine- learned model(s) can process the speech data to generate a speech translation output. As another example, the machine-learned model(s) can process the speech data to generate a latent embedding output. As another example, the machine-learned model(s) can process the speech data to generate an encoded speech output (e.g., an encoded and/or compressed representation of the speech data, etc.). As another example, the machine-learned model(s) can process the speech data to generate an upscaled speech output (e.g., speech data that is higher quality than the input speech data, etc.). As another example, the machine-learned model(s) can process the speech data to generate a textual representation output (e.g., a textual representation of the input speech data, etc.). As another example, the machine- learned model(s) can process the speech data to generate a prediction output. [0173] In some implementations, the input to the machine-learned model(s) of the present disclosure can be sensor data. The machine-learned model(s) can process the sensor data to generate an output. As an example, the machine-learned model(s) can process the sensor data to generate a recognition output. As another example, the machine-learned model(s) can process the sensor data to generate a prediction output. As another example, the machine-learned model(s) can process the sensor data to generate a classification output. As another example, the machine-learned model(s) can process the sensor data to generate a segmentation output. As another example, the machine-learned model(s) can process the sensor data to generate a segmentation output. As another example, the machine-learned model(s) can process the sensor data to generate a visualization output. As another example, the machine-learned model(s) can process the sensor data to generate a diagnostic output. As another example, the machine-learned model(s) can process the sensor data to generate a detection output. [0174] In some cases, the input includes visual data and the task is a computer vision task. In some cases, the input includes pixel data for one or more images and the task is an image processing task. For example, the image processing task can be image classification, where the output is a set of scores, each score corresponding to a different object class and representing the likelihood that the one or more images depict an object belonging to the object class. The image processing task may be object detection, where the image processing output identifies one or more regions in the one or more images and, for each region, a likelihood that region depicts an object of interest. As another example, the image processing task can be image segmentation, where the image processing output defines, for each pixel in the one or more images, a respective likelihood for each category in a predetermined set of categories. For example, the set of categories can be foreground and background. As another example, the set of categories can be object classes. As another example, the image processing task can be depth estimation, where the image processing output defines, for each pixel in the one or more images, a respective depth value. As another example, the image processing task can be motion estimation, where the network input includes multiple images, and the image processing output defines, for each pixel of one of the input images, a motion of the scene depicted at the pixel between the images in the network input. [0175] In some implementations, the task can be a generative task, and the one or more machine-learned models (e.g., 120 and/or 140) can be configured to output content generated in view of one or more inputs. For instance, the inputs can be or otherwise represent data of one or more modalities that encodes context for generating additional content. [0176] In some implementations, the task can be a text completion task. The machine- learned models can be configured to process the inputs that represent textual data and to generate the outputs that represent additional textual data that completes a textual sequence that includes the inputs. For instance, the machine-learned models can be configured to generate the outputs to complete a sentence, paragraph, or portion of text that follows from a portion of text represented by inputs. [0177] In some implementations, the task can be an instruction following task. The machine-learned models can be configured to process the inputs that represent instructions to perform a function and to generate the outputs that advance a goal of satisfying the instruction function (e.g., at least a step of a multi-step procedure to perform the function). The outputs can represent data of the same or of a different modality as the inputs. For instance, the inputs can represent textual data (e.g., natural language instructions for a task to be performed) and the machine-learned models can process the inputs to generate the outputs that represent textual data responsive to the instructions (e.g., natural language responses, programming language responses, machine language responses, etc.). The inputs can represent image data (e.g., image-based instructions for a task to be performed, optionally accompanied by textual instructions) and the machine-learned models can process the inputs to generate the outputs that represent textual data responsive to the instructions (e.g., natural language responses, programming language responses, machine language responses, etc.). One or more outputs can be iteratively or recursively generated to sequentially process and accomplish steps toward accomplishing the requested functionality. For instance, an initial output can be executed by an external system or be processed by the machine-learned models to complete an initial step of performing a function. Multiple steps can be performed, with a final output being obtained that is responsive to the initial instructions. [0178] In some implementations, the task can be a question answering task. The machine-learned models can be configured to process the inputs that represent a question to answer and to generate the outputs that advance a goal of returning an answer to the question (e.g., at least a step of a multi-step procedure to perform the function). The outputs can represent data of the same or of a different modality as the inputs. For instance, the inputs can represent textual data (e.g., natural language instructions for a task to be performed) and the machine-learned models can process the inputs to generate the outputs that represent textual data responsive to the question (e.g., natural language responses, programming language responses, machine language responses, etc.). The inputs can represent image data (e.g., image-based instructions for a task to be performed, optionally accompanied by textual instructions) and the machine-learned models can process the inputs to generate the outputs that represent textual data responsive to the question (e.g., natural language responses, programming language responses, machine language responses, etc.). One or more outputs can be iteratively or recursively generated to sequentially process and accomplish steps toward answering the question. For instance, an initial output can be executed by an external system or be processed by the machine-learned models to complete an initial step of obtaining an answer to the question (e.g., querying a database, performing a computation, executing a script, etc.). Multiple steps can be performed, with a final output being obtained that is responsive to the question. [0179] In some implementations, the task can be an image generation task. The machine-learned models can be configured to process the inputs that represent context regarding a desired portion of image content. The context can include text data, image data, audio data, etc. Machine-learned models can be configured to generate the outputs that represent image data that depicts imagery related to the context. For instance, the machine- learned models can be configured to generate pixel data of an image. Values for channels associated with the pixels in the pixel data can be selected based on the context (e.g., based on a probability determined based on the context). [0180] In some implementations, the task can be an audio generation task. Machine- learned models can be configured to process the inputs that represent context regarding a desired portion of audio content. The context can include text data, image data, audio data, etc. The machine-learned models can be configured to generate the outputs that represent audio data related to the context. For instance, the machine-learned models can be configured to generate waveform data in the form of an image (e.g., a spectrogram). Values for channels associated with pixels of the image can be selected based on the context. The machine- learned models can be configured to generate waveform data in the form of a sequence of discrete samples of a continuous waveform. Values of the sequence can be selected based on the context (e.g., based on a probability determined based on the context). [0181] In some implementations, the task can be a data generation task. Machine- learned models can be configured to process the inputs that represent context regarding a desired portion of data (e.g., data from various data domains, such as sensor data, image data, multimodal data, statistical data, etc.). The desired data can be, for instance, synthetic data for training other machine-learned models. The context can include arbitrary data types. The machine-learned models can be configured to generate the outputs that represent data that aligns with the desired data. For instance, the machine-learned models can be configured to generate data values for populating a dataset. Values for the data objects can be selected based on the context (e.g., based on a probability determined based on the context). [0182] The user computing system may include a number of applications (e.g., applications 1 through N). Each application may include its own respective machine learning library and machine-learned model(s). For example, each application can include a machine- learned model. Example applications include a text messaging application, an email application, a dictation application, a virtual keyboard application, a browser application, etc. [0183] Each application can communicate with a number of other components of the computing device, such as, for example, one or more sensors, a context manager, a device state component, and/or additional components. In some implementations, each application can communicate with each device component using an API (e.g., a public API). In some implementations, the API used by each application is specific to that application. [0184] The user computing system 102 can include a number of applications (e.g., applications 1 through N). Each application is in communication with a central intelligence layer. Example applications include a text messaging application, an email application, a dictation application, a virtual keyboard application, a browser application, etc. In some implementations, each application can communicate with the central intelligence layer (and model(s) stored therein) using an API (e.g., a common API across all applications). [0185] The central intelligence layer can include a number of machine-learned models. For example a respective machine-learned model (e.g., a model) can be provided for each application and managed by the central intelligence layer. In other implementations, two or more applications can share a single machine-learned model. For example, in some implementations, the central intelligence layer can provide a single model (e.g., a single model) for all of the applications. In some implementations, the central intelligence layer is included within or otherwise implemented by an operating system of the computing system 100. [0186] The central intelligence layer can communicate with a central device data layer. The central device data layer can be a centralized repository of data for the computing system 100. The central device data layer may communicate with a number of other components of the computing device, such as, for example, one or more sensors, a context manager, a device state component, and/or additional components. In some implementations, the central device data layer can communicate with each device component using an API (e.g., a private API). [0187] Figure 11B depicts a block diagram of an example computing system 50 that performs visual query processing according to example embodiments of the present disclosure. In particular, the example computing system 50 can include one or more computing devices 52 that can be utilized to obtain, and/or generate, one or more datasets that can be processed by a sensor processing system 60 and/or an output determination system 80 to feedback to a user that can provide information on features in the one or more obtained datasets. The one or more datasets can include image data, text data, audio data, multimodal data, latent encoding data, etc. The one or more datasets may be obtained via one or more sensors associated with the one or more computing devices 52 (e.g., one or more sensors in the computing device 52). Additionally and/or alternatively, the one or more datasets can be stored data and/or retrieved data (e.g., data retrieved from a web resource). For example, images, text, and/or other content items may be interacted with by a user. The interacted with content items can then be utilized to generate one or more determinations. [0188] The one or more computing devices 52 can obtain, and/or generate, one or more datasets based on image capture, sensor tracking, data storage retrieval, content download (e.g., downloading an image or other content item via the internet from a web resource), and/or via one or more other techniques. The one or more datasets can be processed with a sensor processing system 60. The sensor processing system 60 may perform one or more processing techniques using one or more machine-learned models, one or more search engines, and/or one or more other processing techniques. The one or more processing techniques can be performed in any combination and/or individually. The one or more processing techniques can be performed in series and/or in parallel. In particular, the one or more datasets can be processed with a context determination block 62, which may determine a context associated with one or more content items. The context determination block 62 may identify and/or process metadata, user profile data (e.g., preferences, user search history, user browsing history, user purchase history, and/or user input data), previous interaction data, global trend data, location data, time data, and/or other data to determine a particular context associated with the user. The context can be associated with an event, a determined trend, a particular action, a particular type of data, a particular environment, and/or another context associated with the user and/or the retrieved or obtained data. [0189] The sensor processing system 60 may include an image preprocessing block 64. The image preprocessing block 64 may be utilized to adjust one or more values of an obtained and/or received image to prepare the image to be processed by one or more machine-learned models and/or one or more search engines 74. The image preprocessing block 64 may resize the image, adjust saturation values, adjust resolution, strip and/or add metadata, and/or perform one or more other operations. [0190] In some implementations, the sensor processing system 60 can include one or more machine-learned models, which may include a detection model 66, a segmentation model 68, a classification model 70, an embedding model 72, and/or one or more other machine-learned models. For example, the sensor processing system 60 may include one or more detection models 66 that can be utilized to detect particular features in the processed dataset. In particular, one or more images can be processed with the one or more detection models 66 to generate one or more bounding boxes associated with detected features in the one or more images. [0191] Additionally and/or alternatively, one or more segmentation models 68 can be utilized to segment one or more portions of the dataset from the one or more datasets. For example, the one or more segmentation models 68 may utilize one or more segmentation masks (e.g., one or more segmentation masks manually generated and/or generated based on the one or more bounding boxes) to segment a portion of an image, a portion of an audio file, and/or a portion of text. The segmentation may include isolating one or more detected objects and/or removing one or more detected objects from an image. [0192] The one or more classification models 70 can be utilized to process image data, text data, audio data, latent encoding data, multimodal data, and/or other data to generate one or more classifications. The one or more classification models 70 can include one or more image classification models, one or more object classification models, one or more text classification models, one or more audio classification models, and/or one or more other classification models. The one or more classification models 70 can process data to determine one or more classifications. [0193] In some implementations, data may be processed with one or more embedding models 72 to generate one or more embeddings. For example, one or more images can be processed with the one or more embedding models 72 to generate one or more image embeddings in an embedding space. The one or more image embeddings may be associated with one or more image features of the one or more images. In some implementations, the one or more embedding models 72 may be configured to process multimodal data to generate multimodal embeddings. The one or more embeddings can be utilized for classification, search, and/or learning embedding space distributions. [0194] The sensor processing system 60 may include one or more search engines 74 that can be utilized to perform one or more searches. The one or more search engines 74 may crawl one or more databases (e.g., one or more local databases, one or more global databases, one or more private databases, one or more public databases, one or more specialized databases, and/or one or more general databases) to determine one or more search results. The one or more search engines 74 may perform feature matching, text based search, embedding based search (e.g., k-nearest neighbor search), metadata based search, multimodal search, web resource search, image search, text search, and/or application search. [0195] Additionally and/or alternatively, the sensor processing system 60 may include one or more multimodal processing blocks 76, which can be utilized to aid in the processing of multimodal data. The one or more multimodal processing blocks 76 may include generating a multimodal query and/or a multimodal embedding to be processed by one or more machine-learned models and/or one or more search engines 74. [0196] The output(s) of the sensor processing system 60 can then be processed with an output determination system 80 to determine one or more outputs to provide to a user. The output determination system 80 may include heuristic based determinations, machine-learned model based determinations, user selection based determinations, and/or context based determinations. [0197] The output determination system 80 may determine how and/or where to provide the one or more search results in a search results interface 82. Additionally and/or alternatively, the output determination system 80 may determine how and/or where to provide the one or more machine-learned model outputs in a machine-learned model output interface 84. In some implementations, the one or more search results and/or the one or more machine-learned model outputs may be provided for display via one or more user interface elements. The one or more user interface elements may be overlayed over displayed data. For example, one or more detection indicators may be overlayed over detected objects in a viewfinder. The one or more user interface elements may be selectable to perform one or more additional searches and/or one or more additional machine-learned model processes. In some implementations, the user interface elements may be provided as specialized user interface elements for specific applications and/or may be provided uniformly across different applications. The one or more user interface elements can include pop-up displays, interface overlays, interface tiles and/or chips, carousel interfaces, audio feedback, animations, interactive widgets, and/or other user interface elements. [0198] Additionally and/or alternatively, data associated with the output(s) of the sensor processing system 60 may be utilized to generate and/or provide an augmented-reality experience and/or a virtual-reality experience 86. For example, the one or more obtained datasets may be processed to generate one or more augmented-reality rendering assets and/or one or more virtual-reality rendering assets, which can then be utilized to provide an augmented-reality experience and/or a virtual-reality experience 86 to a user. The augmented- reality experience may render information associated with an environment into the respective environment. Alternatively and/or additionally, objects related to the processed dataset(s) may be rendered into the user environment and/or a virtual environment. Rendering dataset generation may include training one or more neural radiance field models to learn a three- dimensional representation for one or more objects. [0199] In some implementations, one or more action prompts 88 may be determined based on the output(s) of the sensor processing system 60. For example, a search prompt, a purchase prompt, a generate prompt, a reservation prompt, a call prompt, a redirect prompt, and/or one or more other prompts may be determined to be associated with the output(s) of the sensor processing system 60. The one or more action prompts 88 may then be provided to the user via one or more selectable user interface elements. In response to a selection of the one or more selectable user interface elements, a respective action of the respective action prompt may be performed (e.g., a search may be performed, a purchase application programming interface may be utilized, and/or another application may be opened). [0200] In some implementations, the one or more datasets and/or the output(s) of the sensor processing system 60 may be processed with one or more generative models 90 to generate a model-generated content item that can then be provided to a user. The generation may be prompted based on a user selection and/or may be automatically performed (e.g., automatically performed based on one or more conditions, which may be associated with a threshold amount of search results not being identified). [0201] In some implementations, the one or more datasets and/or the output(s) of the sensor processing system 60 may be processed with one or more generative models 90 to generate a model-generated content item that can then be provided to a user. The generation may be prompted based on a user selection and/or may be automatically performed (e.g., automatically performed based on one or more conditions, which may be associated with a threshold amount of search results not being identified). [0202] The one or more generative models 90 can include language models (e.g., large language models and/or vision language models), image generation models (e.g., text-to- image generation models and/or image augmentation models), audio generation models, video generation models, graph generation models, and/or other data generation models (e.g., other content generation models). The one or more generative models 90 can include one or more transformer models, one or more convolutional neural networks, one or more recurrent neural networks, one or more feedforward neural networks, one or more generative adversarial networks, one or more self-attention models, one or more embedding models, one or more encoders, one or more decoders, and/or one or more other models. In some implementations, the one or more generative models 90 can include one or more autoregressive models (e.g., a machine-learned model trained to generate predictive values based on previous behavior data) and/or one or more diffusion models (e.g., a machine- learned model trained to generate predicted data based on generating and processing distribution data associated with the input data). [0203] The one or more generative models 90 can be trained to process input data and generate model-generated content items, which may include a plurality of predicted words, pixels, signals, and/or other data. The model-generated content items may include novel content items that are not the same as any pre-existing work. The one or more generative models 90 can leverage learned representations, sequences, and/or probability distributions to generate the content items, which may include phrases, storylines, settings, objects, characters, beats, lyrics, and/or other aspects that are not included in pre-existing content items. [0204] The one or more generative models 90 may include a vision language model. [0205] The vision language model can be trained, tuned, and/or configured to process image data and/or text data to generate a natural language output. The vision language model may leverage a pre-trained large language model (e.g., a large autoregressive language model) with one or more encoders (e.g., one or more image encoders and/or one or more text encoders) to provide detailed natural language outputs that emulate natural language composed by a human. [0206] The vision language model may be utilized for zero-shot image classification, few shot image classification, image captioning, multimodal query distillation, multimodal question and answering, and/or may be tuned and/or trained for a plurality of different tasks. The vision language model can perform visual question answering, image caption generation, feature detection (e.g., content monitoring (e.g., for inappropriate content)), object detection, scene recognition, and/or other tasks. [0207] The vision language model may leverage a pre-trained language model that may then be tuned for multimodality. Training and/or tuning of the vision language model can include image-text matching, masked-language modeling, multimodal fusing with cross attention, contrastive learning, prefix language model training, and/or other training techniques. For example, the vision language model may be trained to process an image to generate predicted text that is similar to ground truth text data (e.g., a ground truth caption for the image). In some implementations, the vision language model may be trained to replace masked tokens of a natural language template with textual tokens descriptive of features depicted in an input image. Alternatively and/or additionally, the training, tuning, and/or model inference may include multi-layer concatenation of visual and textual embedding features. In some implementations, the vision language model may be trained and/or tuned via jointly learning image embedding and text embedding generation, which may include training and/or tuning a system to map embeddings to a joint feature embedding space that maps text features and image features into a shared embedding space. The joint training may include image-text pair parallel embedding and/or may include triplet training. In some implementations, the images may be utilized and/or processed as prefixes to the language model. [0208] The one or more generative models 90 may be stored on-device and/or may be stored on a server computing system. In some implementations, the one or more generative models 90 can perform on-device processing to determine suggested searches, suggested actions, and/or suggested prompts. The one or more generative models 90 may include one or more compact vision language models that may include less parameters than a vision language model stored and operated by the server computing system. The compact vision language model may be trained via distillation training. In some implementations, the visional language model may process the display data to generate suggestions. The display data can include a single image descriptive of a screenshot and/or may include image data, metadata, and/or other data descriptive of a period of time preceding the current displayed content (e.g., the applications, images, videos, messages, and/or other content viewed within the past 30 seconds). The user computing device may generate and store a rolling buffer window (e.g., 30 seconds) of data descriptive of content displayed during the buffer. Once the time has elapsed, the data may be deleted. The rolling buffer window data may be utilized to determine a context, which can be leveraged for query, content, action, and/or prompt suggestion. [0209] In some implementations, the generative models 90 can include machine-learned sequence processing models. An example system can pass inputs to sequence processing models. Sequence processing models can include one or more machine-learned components. Sequence processing models can process the data from inputs to obtain an input sequence. Input sequence can include one or more input elements obtained from inputs. The sequence processing model can process the input sequence using prediction layers to generate an output sequence. The output sequence can include one or more output elements generated based on input sequence. The system can generate outputs based on output sequence. [0210] Sequence processing models can include one or multiple machine-learned model components configured to ingest, generate, or otherwise reason over sequences of information. For example, some example sequence processing models in the text domain are referred to as “Large Language Models,” or LLMs. See, e.g., PaLM 2 Technical Report, Google, https://ai.google/static/documents/palm2techreport.pdf (n.d.). Other example sequence processing models can operate in other domains, such as image domains, see, e.g., Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, arXiv:2010.11929v2 (Jun.3, 2021), audio domains, see, e.g., Agostinelli et al., MusicLM: Generating Music From Text, arXiv:2301.11325v1 (Jan.26, 2023), biochemical domains, see, e.g., Jumper et al., Highly accurate protein structure prediction with AlphaFold, 596 Nature 583 (Aug.26, 2021), by way of example. Sequence processing models can process one or multiple types of data simultaneously. Sequence processing models can include relatively large models (e.g., more parameters, computationally expensive, etc.), relatively small models (e.g., fewer parameters, computationally lightweight, etc.), or both. [0211] In general, sequence processing models can obtain an input sequence using data from inputs. For instance, input sequence can include a representation of data from inputs 2 in a format understood by sequence processing models. One or more machine-learned components of sequence processing models can ingest the data from inputs, parse the data into pieces compatible with the processing architectures of sequence processing models (e.g., via “tokenization”), and project the pieces into an input space associated with prediction layers (e.g., via “embedding”). [0212] Sequence processing models can ingest the data from inputs and parse the data into a sequence of elements to obtain input sequence. For example, a portion of input data from inputs can be broken down into pieces that collectively represent the content of the portion of the input data. The pieces can provide the elements of the sequence. [0213] In some implementations, processing the input data can include tokenization. For example, a tokenizer may process a given portion of an input source and output a series of tokens (e.g., corresponding to input elements) that represent the portion of the input source. Various approaches to tokenization can be used. For instance, textual input sources can be tokenized using a byte-pair encoding (BPE) technique. See, e.g., Kudo et al., SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (System Demonstrations), pages 66–71 (October 31–November 4, 2018), https://aclanthology.org/D18-2012.pdf. Image-based input sources can be tokenized by extracting and serializing patches from an image. [0214] In general, arbitrary data types can be serialized and processed into an input sequence. [0215] Prediction layers can predict one or more output elements based on the input elements. Prediction layers can include one or more machine-learned model architectures, such as one or more layers of learned parameters that manipulate and transform the inputs to extract higher-order meaning from, and relationships between, input elements. In this manner, for instance, example prediction layers can predict new output elements in view of the context provided by input sequence. [0216] Prediction layers can evaluate associations between portions of input sequence and a particular output element. These associations can inform a prediction of the likelihood that a particular output follows the input context. For example, consider the textual snippet, “The carpenter’s toolbox was small and heavy. It was full of ___.” Example prediction layers can identify that “It” refers back to “toolbox” by determining a relationship between the respective embeddings. Example prediction layers can also link “It” to the attributes of the toolbox, such as “small” and “heavy.” Based on these associations, prediction layers can, for instance, assign a higher probability to the word “nails” than to the word “sawdust.” [0217] A transformer is an example architecture that can be used in prediction layers. See, e.g., Vaswani et al., Attention Is All You Need, arXiv:1706.03762v7 (Aug.2, 2023). A transformer is an example of a machine-learned model architecture that uses an attention mechanism to compute associations between items within a context window. The context window can include a sequence that contains input sequence and potentially one or more output elements. A transformer block can include one or more attention layers and one or more post-attention layers (e.g., feedforward layers, such as a multi-layer perceptron). [0218] Prediction layers can include other machine-learned model architectures in addition to or in lieu of transformer-based architectures. For example, recurrent neural networks (RNNs) and long short-term memory (LSTM) models can also be used, as well as convolutional neural networks (CNNs). In general, prediction layers can leverage various kinds of artificial neural networks that can understand or generate sequences of information. [0219] Output sequence can include or otherwise represent the same or different data types as input sequence. For instance, input sequence can represent textual data, and output sequence can represent textual data. The input sequence can represent image, audio, or audiovisual data, and output sequence can represent textual data (e.g., describing the image, audio, or audiovisual data). It is to be understood that prediction layers, and any other interstitial model components of sequence processing models, can be configured to receive a variety of data types in input sequences and output a variety of data types in output sequences. [0220] The output sequence can have various relationships to an input sequence. Output sequence can be a continuation of input sequence. The output sequence can be complementary to the input sequence. The output sequence can translate, transform, augment, or otherwise modify input sequence. The output sequence can answer, evaluate, confirm, or otherwise respond to input sequence. The output sequence can implement (or describe instructions for implementing) an instruction provided via an input sequence. [0221] The output sequence can be generated autoregressively. For instance, for some applications, an output of one or more prediction layers can be passed through one or more output layers (e.g., softmax layer) to obtain a probability distribution over an output vocabulary (e.g., a textual or symbolic vocabulary) conditioned on a set of input elements in a context window. In this manner, for instance, the output sequence can be autoregressively generated by sampling a likely next output element, adding that element to the context window, and re-generating the probability distribution based on the updated context window, and sampling a likely next output element, and so forth. [0222] The output sequence can also be generated non-autoregressively. For instance, multiple output elements of the output sequence can be predicted together without explicit sequential conditioning on each other. See, e.g., Saharia et al., Non-Autoregressive Machine Translation with Latent Alignments, arXiv:2004.07437v3 (Nov.16, 2020). [0223] The output sequence can include one or multiple portions or elements. In an example content generation configuration, the output sequence can include multiple elements corresponding to multiple portions of a generated output sequence (e.g., a textual sentence, values of a discretized waveform, computer code, etc.). In an example classification configuration, the output sequence can include a single element associated with a classification output. For instance, an output “vocabulary” can include a set of classes into which an input sequence is to be classified. For instance, a vision transformer block can pass latent state information to a multilayer perceptron that outputs a likely class value associated with an input image. [0224] The output determination system 80 may process the one or more datasets and/or the output(s) of the sensor processing system 60 with a data augmentation block 92 to generate augmented data. For example, one or more images can be processed with the data augmentation block 92 to generate one or more augmented images. The data augmentation can include data correction, data cropping, the removal of one or more features, the addition of one or more features, a resolution adjustment, a lighting adjustment, a saturation adjustment, and/or other augmentation. [0225] In some implementations, the one or more datasets and/or the output(s) of the sensor processing system 60 may be stored based on a data storage block 94 determination. [0226] The output(s) of the output determination system 80 can then be provided to a user via one or more output components of the user computing device 52. For example, one or more user interface elements associated with the one or more outputs can be provided for display via a visual display of the user computing device 52. [0227] The processes may be performed iteratively and/or continuously. One or more user inputs to the provided user interface elements may condition and/or affect successive processing loops. [0228] The systems and methods disclosed herein can include an autonomous information seeking visual question answering framework. The systems and methods can leverage a large language model (LLM) to dynamically strategize the utilization of external tools and to investigate their outputs, thereby acquiring knowledge that can be utilized to provide answers to the posed questions. Responding to visual questions that necessitate external knowledge, such as “What event is commemorated by the building depicted in this image?”, can be a complex task. The task can present a combinatorial search space that demands a sequence of actions, including invoking application programming interfaces (APIs), analyzing the API tool call responses, and making informed decisions. Additionally and/or alternatively, the systems and methods can include conducting a user study to collect a variety of instances of human decision-making when faced with one or more particular tasks. The user behavior data (e.g., the collected user decision-making instances) can then be used to design a system that includes three components: a language model planner block (e.g., an LLM-powered planner) that dynamically determines which tool to use next, a language model reasoner block (e.g., an LLM-powered reasoner) that analyzes and extracts key information from the tool outputs, and a working memory component that retains the acquired information throughout the process. The collected user behavior data can serve as a guide for the system in two key ways. First, the systems and methods may generate a transition graph by analyzing the sequence of decisions made by users. The graph can delineate distinct states and can confine the set of actions available at each state. Second, the systems and methods can use examples of user decision-making to provide the LLM-powered planner and reasoner with relevant contextual instances, enhancing their capacity to make informed decisions. [0229] Large language models (LLMs) (e.g., GPT3 (Brown et al., “Language models are few-shot learners,” ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, 33:1877– 1901, 2020.), LaMDA (Kulshreshtha et al., “Towards a human-like open-domain chatbot,” ARXIV PREPRINT, 2020.), PALM (Chowdhery et al., “Palm: Scaling language modeling with pathways,” ARXIV PREPRINT, 2022.), BLOOM (Scao et al., “BLOOM: A 176b-parameter open-access multilingual language model,” CORR, abs/2211.05100, 2022.), and LLaMA (Touvron et al., “Llama: Open and efficient foundation language models,” ARXIV PREPRINT, 2023.) may have the capacity to memorize and utilize a significant amount of world knowledge. The language models can demonstrate emerging abilities like in-context learning, code generation, and common sense reasoning. Additionally and/or alternatively, language models can be adapted to handle multi-modal inputs and outputs involving both vision and language. In particular, visual language models (VLMs) can be utilized for image captioning, visual question answering, and open vocabulary recognition. [0230] While LLMs excel beyond human capabilities in tasks involving textual information retrieval, the existing VLMs can perform inadequately on datasets designed for visual information seeking. Many of the visual questions in the datasets can be designed in such a way that they pose a challenge even for humans, often requiring the assistance of various APIs and web search to obtain the answer. Examples of such questions can include “where is this church located?”, “what species of butterfly is this?”, or “what is the brand of this dress?”. [0231] Existing vision-language models (VLMs) can struggle to answer such questions for several reasons. Firstly, they may not be trained with objectives that encourage them to discern fine-grained categories and details within images. Secondly, they may utilize a relatively smaller language model compared to state-of-the-art Large Language Models (LLMs), which can constrain their reasoning capabilities. Lastly, they may not compare the query image against a substantial corpus of images associated with varying metadata, unlike systems that employ image search techniques. [0232] To overcome these challenges, the systems and methods disclosed herein may obtain state-of-the-art results on visual information seeking tasks by integrating LLMs with three types of tools: (i) computer vision tools such as object detection, OCR, image captioning models, and VQA models, which aid in extracting visual information from the image, (ii) a web search tool that assists in retrieving open world knowledge and facts, and (iii) an image search tool that enables us to glean relevant information from metadata associated with visually similar images. The systems and methods may include an LLM agent that uses the tools via tree-search decision making. The systems and methods may utilize an LLM-powered planner to dynamically determine which tool to use at each step and what query to send to it. Additionally and/or alternatively, the systems and methods may employ an LLM-powered reasoner that scrutinizes the output returned by the tools and extracts the crucial information from them. To retain the information throughout the process, the systems and methods can use a working memory component. Figure 4 can convey an example information seeking process performed by an example implementation. [0233] In some implementations, the systems and methods can employ a two-stage strategy, namely plan and execute. Initially, the LLM can break down a question into a plan, typically represented as a structured program or a sequence of instructions. Following the breakdown, the necessary APIs can be activated to collect the required information. However, the systems and methods may struggle in more complex real-world situations. In such cases, a comprehensive plan may not be inferred merely from the initial question. Instead, the systems and methods may necessitate dynamic modifications based on real-time feedback. [0234] To address the difficulty, the systems and methods disclosed herein can include a dynamic decision-making capability. Answering visual information seeking questions can be a highly complex task, requiring the planner to take multiple steps. At each of the steps, the planner may determine which API to call and what query to send. The systems and methods disclosed herein can opt for a dynamic approach as the API call outputs may be unpredictable. In particular, the systems and methods may determine decisions at each step based on the information acquired from previous API calls, enhancing the adaptability and effectiveness of the method. [0235] In some implementations, a user study can be performed to gather a wide range of instances of human decision-making when using APIs to answer questions related to visual information seeking. From the user data, a structured framework can be utilized to direct the Large Language Model (LLM) to use the examples for making informed decisions regarding API selection and query formulation. The collected user behavior can inform the system in two ways. First, by analyzing the sequence of user decisions, the systems and methods can construct a transition graph. The graph may delineate distinct states and may constrain the set of actions available at each state. Second, the systems and methods can use the examples of user decision-making to guide the planner and reasoner with pertinent contextual instances. The contextual examples can contribute to improving the performance and effectiveness of the system. [0236] In some implementations, the systems and methods can include a visual question answering framework that leverages a large language model (LLM) to dynamically strategize the utilization of external tools and to investigate their outputs, thereby acquiring the necessary knowledge needed to provide answers to the posed questions. Additionally and/or alternatively, the systems and methods can leverage the human decision-making data collected from a user study to develop a structured framework. The framework can guide the Large Language Model (LLM) to utilize examples of human decision-making in making informed choices concerning API selection and query construction. [0237] In some implementations, the systems and methods can employ a dynamic decision-making strategy designed to respond to visual information seeking queries. For example, the systems and methods can include three primary components. First, the system can include a planner ^^^^, whose responsibility is to determine the subsequent action, including the appropriate API call and the query to be processed. Second, the system can have a working memory ℳ that retains information about the results obtained from API executions. Lastly, the system can have a reasoner ℛ, whose role is to process the outputs from the API calls. The reasoner can determine whether the obtained information is sufficient to produce the final response, or if additional data retrieval is determined to be needed. [0238] Considering the potential intricacy of the task, a user study can be conducted to gather a broad range of examples of human decision-making process, when using tools to respond to visual information seeking queries. A structured framework for decision-making can then be utilized based on the user behavior data. The systems and methods can utilize the data collected from this study to construct a transition graph ^^^^ shown in Figure 5, which outlines all the possible actions at each given state. Additionally and/or alternatively, the systems and methods can employ real-life decision-making examples ℰ (i.e., users choose which tool at different states) to guide the planner in choosing the appropriate action at each stage of the process. [0239] The Algorithm 1 below can present the operations of the planner ^^^^. The planner can undertake a series of steps each time a decision is required regarding which tool to employ and what query to send to the tool. Firstly, based on the present state, the planner can provide a range of potential subsequent actions ^^^^ ^^^^. The potential action space ^^^^ ^^^^ may be large, making the search space in- tractable. To address this issue, the planner can refer to the human decisions from the transition graph ^^^^ to eliminate irrelevant actions. The planner may exclude the actions that have already been taken before and are stored in the working memory ℳ. The procedure can include ^^^^ ^^^^ ← ^^^^( ^^^^ ^^^^ ^^^^ ^^^^ ^^^^, ^^^^,ℳ). [0240] Algorithm 1 Planner ^^^^( ^^^^ ^^^^ ^^^^ ^^^^ ^^^^, ^^^^,ℰ ,ℳ): 1: ^^^^ ^^^^ ← ^^^^( ^^^^ ^^^^ ^^^^ ^^^^ ^^^^, ^^^^,ℳ) ▷ Get the list of feasible actions ^^^^ ^^^^ given the current state from a transition graph and the information in the working memory. 2: ℰ ^^^^ ← ^^^^(ℰ , ^^^^ ^^^^) ▷ Get a list of in-context examples related to actions ^^^^ ^^^^. 3: ^^^^ ^^^^ ← ^^^^(ℰ ^^^^,ℳ) ▷ Build a prompt based on the in-context examples ℰ ^^^^ and the current working memory ℳ. 4: ^^^^ ^^^^, ^^^^ ^^^^ ← ^^^^ ^^^^ ^^^^( ^^^^ ^^^^) ▷ Decide the next tool ^^^^ ^^^^ to use and the query ^^^^ ^^^^ to pass by feeding the prompt ^^^^ ^^^^ to LLM. [0241] Next, the systems and methods can collect a set of relevant in-context examples ℰ ^^^^ that are assembled from the decisions previously made by humans during the user study relevant to actions ^^^^ ^^^^, that is ℰ ^^^^ ← ^^^^(ℰ , ^^^^ ^^^^). With the gathered in-context examples ℰ ^^^^ and the working memory ℳ that holds data collected from past tool interactions, the planner can formulate a prompt, denoted by ^^^^ ^^^^
Figure imgf000054_0001
. The prompt ^^^^ ^^^^ can then be sent to the LLM which returns a structured answer, determining the next tool ^^^^ ^^^^ to be activated and the query ^^^^ ^^^^ to be dispatched to the data processing tool. The action can be denoted by ^^^^ ^^^^, ^^^^ ^^^^ ← ^^^^ ^^^^ ^^^^( ^^^^ ^^^^). The design can allow the planner to be invoked multiple times throughout the process, thereby facilitating dynamic decision-making that gradually leads to answering the input query. [0242] The Algorithm 2 can be descriptive of an overall decision-making workflow of automatic visual information seeking. The entire process can repeat until a satisfactory answer is produced. Initially, the working memory may be populated with the input visual question ^^^^, and the initial state can be set to START. At each iteration, the system can first invoke the planner ^^^^ to determine the next tool and the query to employ, as outlined in Algorithm 1. Subsequently, the selected external tool can execute and can deliver the output ^^^^ ^^^^. The output from the tools can be quite diverse, ranging from a list of identified objects, to a collection of similar images with their captions, to snippets of search results or knowledge graph entities. [0243] Algorithm 2 AVIS Decision Making Workflow: 1: ℳ ← { ^^^^ ^^^^ ^^^^ ^^^^ ^^^^}, ^^^^ ^^^^ ^^^^ ^^^^ ^^^^ ← START 2: ^^^^ ^^^^, ^^^^ ^^^^ ← ^^^^( ^^^^ ^^^^ ^^^^ ^^^^ ^^^^, ^^^^,ℰ ,ℳ) ▷ Call the planner ^^^^ to decide the next tool to use ^^^^ ^^^^ and the query to pass to it ^^^^ ^^^^. 3: ^^^^ ^^^^ ← Exec( ^^^^ ^^^^, ^^^^ ^^^^) ▷ Call tool ^^^^ ^^^^ with query ^^^^ ^^^^ and get output ^^^^ ^^^^. 4: ^^�^^ ^^^^ ← ℛ( ^^^^ ^^^^,ℳ) ▷ Process the output and extract the key info ^^�^^ ^^^^ using the reasoner ℛ. 5: ℳ. ^^^^ ^^^^ ^^^^( ^^�^^ ^^^^) ▷ Update the working memory. 6: switch ^^�^^ ^^^^ do. 7: case ^^�^^ ^^^^ is not informative. 8: ^^^^ ^^^^ ^^^^ ^^^^(2) ▷ Go to line 2 to make decision at the same state, excluding ^^^^ ^^^^. 9: case ^^�^^ ^^^^ has useful information. 10: ^^^^ ^^^^ ^^^^ ^^^^ ^^^^ ← ^^^^ ^^^^ ▷ Update state. 11: ^^^^ ^^^^ ^^^^ ^^^^(2) ▷ Go to line 2 to make decision for the next state. 12: case ^^�^^ ^^^^ is ready as a final answer. 13: ^^^^ ^^^^ ^^^^ ← ^^�^^ ^^^^ ▷ Output answer. [0244] Therefore, the system can employ a reasoner ℛ to analyze the output ^^^^ ^^^^, can extract the useful information, and can decide into which category the tool output falls: informative, uninformative, or final answer. The method can utilize the LLM with appropriate prompting and in-context examples to perform the reasoning. If the reasoner concludes that the model is ready to provide an answer, the model can generate and output the final response, thus concluding the task. If the model determines that the tool output is uninformative, the machine-learned model may revert back to the planner to select another action based on the current state. If the language model finds the tool output to be useful, the machine-learned model may modify the state and transfer control back to the planner to make a new decision at the new state. [0245] To illustrate with a tangible example, the systems and methods can refer to the output that the model would receive as depicted in Figure 6(c). There can be several entities within the answer. The role of the reasoner may be twofold: to determine which entity is pertinent for responding to the question and to assess whether the model has obtained the necessary information to transition to the next state. [0246] In some implementations, the approach, which employs dynamic decision- making coupled with backtracking, can differ from previous methods that follow a plan-then- execute paradigm. The system can be structured to make decisions grounded to the results of current executions and to conduct iterative searches for tool combinations. The process may eventually yield an effective strategy to accomplish the task. [0247] To respond effectively to visual queries that necessitate in-depth information retrieval, the systems and methods can equip a system with a comprehensive suite of tools. In particular, the suite of tools may include an image captioning model, a visual question answering model, an object detection model, an image search engine, a web search engine, an optical character recognition model, a language model tuned (and/or conditioned) for short question-and-answer tasks, and/or other data processing tools. [0248] The image captioning model may employ a captioning model (e.g., the PALI 17B (Chern et al., “TPU-KNN: K nearest neighbor search at peak flop/s,” CORR, abs/2206.14286, 2022.)), which obtains state-of- the-art results for image captioning. The tool may have the capability to generate captions for either the entire image or for a cropped image corresponding to the bounding box of a detected object. [0249] The visual question answering model may utilize a VQA model (e.g., the PALI 17B), which may have been fine-tuned on a visual question-and-answer specific dataset (e.g., the VQA-v2 dataset (Goyal et al., “Making the V in VQA matter: Elevating the role of image understanding in visual question answering,” IN 2017 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 6325–6334, IEEE COMPUTER SOCIETY, 2017.)). The tool can intake an image and a question as inputs and can provide a text-based answer as the output. [0250] The object detection model may use an object detector trained on a super-set of Open Images dataset (Kuznetsova et al., “The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,” IJCV, 2020.) categories that may be provided by a visual search application programming interface (e.g., the Google Lens API (“Google lens,” Web interface available at https://images.google.com.)). The systems and methods may use a high confidence threshold to only keep the top-ranked detected boxes for the input image. [0251] The image search engine may utilize a reverse image search system (e.g., Google Image Search) to obtain a broad range of information related to the image crop of a detected box (e.g., as provided in Google Lens API). The information may encompass various details, such as knowledge graph entities, titles of associated products, and captions of analogous or identical images. The availability of these details can vary based on the image crop input provided to the reverse image search system (e.g., Google Image Search). The planner may consider the utilization of each piece of information as a separate action. The consideration may be due to the fact that each information set may contain hundreds of tokens that necessitate complex processing and reasoning. [0252] The optical character recognition model may process images with text to determine the depicted text. For example, in some implementations, images may include textual content such as street names or logos. To detect and utilize this text, the systems and methods may leverage the Optical Character Recognition (OCR) feature available in an image processing application (e.g., the Google Lens API). [0253] The web search engine may enable the system to acquire up-to-date world knowledge and retrieve relevant documents on any topic of interest. For example, the systems and methods may employ a web search platform (e.g., the Google Web Search API). The web search engine may process a text-based query as input and may produce the following outputs: (i) related document links and snippets, (ii) in certain instances, a knowledge panel providing a direct answer to the query, and (iii) a plurality of questions (e.g., five questions) that are related to the input query. If a knowledge panel is available, the system may parse the knowledge panel into a sentence or a few sentences that summarize the knowledge panel information. [0254] In some implementations, the systems and methods may incorporate a Language Model (LLM) powered question-answering component as another tool. The tool may process a query in text form and may generate an answer also in text form. The use of the LLM here as a question-answering tool may be distinct from language model’s role in the planner or reasoner as outlined in Alg.1 and Alg.2. [0255] Many of the visual questions in existing datasets may ask for fine-grained answers, which poses a challenge even for humans, often requiring the assistance of various APIs and web searches for answers. In order to gather insights into human decision-making process, then systems and methods may include performing a user study. For example, a user study may be performed to understand how humans utilize external tools to answer visual queries that involve seeking information. [0256] The user may be provided with an identical set of tools as the tools accessible to the machine-learned model via an API. The users may be presented with the input image and question, along with image crops for each detected object. Additionally and/or alternatively, tools such as an image captioning model, an object detection model, a visual question answering model, a web search engine, and/or an image search engine may be made available to the user. Based on the information obtained through image search for each cropped image, the user may be offered one or multiple buttons associated with each box. The user interface elements (e.g., the buttons) may provide the user with the ability to access diverse information pertaining to the image crop of the box. The diverse information may include details such as corresponding knowledge graph entities, captions of similar images, titles of associated related products, and captions of identical images. An example set of tools and APIs are depicted in Figure 6(b). [0257] When the user initiates an action, such as clicking on a button or submitting a query to web search, the image captioning model, the object detection model, the image search engine, and/or the visual question answering model, the corresponding tool may be invoked, and the resulting output may be displayed to the user. The system may record the sequence of actions taken by the user and the outputs that they receive at each step. For instance, in Figure 6, an example of how a user performs four actions to answer the question is displayed: i) display entities in box 2, ii) show the caption of similar images to box 2, iii) conduct a search for “In what year was Harley-Davidson XA built?”, and iv) utilize PALM using the combination of the search output and the question “In what year was Harley- Davidson XA built?”. When the user is prepared to proceed to the next question, the user may click on either of the two buttons: “Success! Found the Answer!” or “Couldn’t Find the Answer.” Subsequently, a new visual question may be presented to them. [0258] The collected user behavior may serve as a guide for the system in two key ways. Firstly, the system may construct a transition graph by analyzing the sequence of decisions made by users. The graph can define distinct states and may restrict the available set of actions at each state. For example, at the START state, the system may take one of these three actions: image captioning, visual question answering, or object detection. Figure 5 can illustrate the transition graph that has been constructed based on the decision-making process of the users. Secondly, the systems and methods may utilize the examples of user decision- making to guide the planner and reasoner with relevant contextual instances. The in-context examples can aid in enhancing the performance and effectiveness of the system. [0259] The user study may involve ten or more participants who collectively answered a total of 644 visual questions. During the study, the system may present users with visual questions that were randomly selected from one or more datasets. The approach can allow the system to provide the participants with a varied and diverse set of visual questions to assess and respond to. [0260] The systems and methods disclosed herein can include an approach that equips the Large Language Models (LLM) with the ability to use a variety of tools for answering knowledge-intensive visual questions. The methodology, anchored in human decision- making data collected from a user study, can employ a structured framework that uses an LLM-powered planner to dynamically decide on tool selection and query formation. An LLM-powered reasoner can be tasked with processing and extracting key information from the output of the selected tool. The systems and methods may iteratively employ the planner and reasoner to leverage different tools until all necessary information required to answer the visual question is amassed. [0261] Experiments can be leveraged to evaluate AVIS on two visual question answering datasets: i) OK-VQA (Marino et al., “OK-VQA: A visual question answering benchmark requiring external knowledge,” In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pages 3195–3204 (2019).), which may rely on common-sense knowledge not observed in given image; and ii) Infoseek wikidata (Chen et al., “Can Pre- trained Vision and Language Models Answer Visual Information-Seeking Questions?,” arXiv (Feb.23, 2022), https://arxiv.org/abs/2302.11713.), which further necessitates more fine- grained information that cannot be covered by common sense knowledge. [0262] The experiments can follow the decision-making workflow in Alg.2 to implement AVIS to solve visual questions. For the Planner, the system may write the basic instructions for describing each tool, and keep a pool of real user behavior when they select each tool, which may be collected in the user study. At each step ^^^^, the system may prepare the prompt based on the feasible action lists ^^^^ ^^^^. For the Reasoner, the system may write the prompt for all APIs that return a long list of results, including Object Detection, Product Detection, Web Image Search and Web Text Search, that guides reasoner to extract the relevant information. The system may design the reasoner in a way such that the “uninformative” answers can be detected. In order to support this, the system may manually prepare several bad examples that do not provide any useful information, pass it to the reasoner as a part of the prompt. [0263] The system may use the frozen PALM 540B language model (Chowdhery et al. “PaLM: Scaling Language Modeling with Pathways,” arXiv (Apr.5, 2022), https://arxiv.org/abs/2204.02311.) for both the planner and the reasoner, with deterministic generation ensured by setting the temperature parameter to zero. The experiment may use 10 examples as in-context prompts for each dataset, and report the VQA accuracy (Goyal et al., “Making the V in VQA matter: Elevating the role of image understanding in visual question answering,” In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, pages 6325–6334 (July 2017).) as the evaluation metric. [0264] The experiments can utilize one or more baselines. AVIS can have the ability to dynamically determine the relevant tools according to different states. To show that the design choice is useful, the experiments may add a number of baselines that do not include a LLM-planner for dynamic decision making. Instead, they may follow a pre-determined sequence to call a list of tools. The experiments may propose the following baselines: baseline-PALM w/ PALI (which integrates the captions generated by PALI and the visual answers from PALI VQA, PALI denotes the combination of both VQA and captioning tool), baseline-PALM w/ (PALI + Object) (which in addition calls the object detection tool, and then integrates all object data, including products and text detected by OCR), baseline- PALM w/ (PALI + Object + Search) (a model which first selects a relevant object with the help of PALM, then sequentially executes the image search and Google search with the object name). The experiments may then call PALM again to answer the question. For each of the three baselines, the system may prepare a few-shot Chain-Of-Thought (COT) prompting (Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” arXiv (Jan.28, 2022), https://arxiv.org/abs/2201.11903.), in which the COT prompt guides the model to explain why predictions are made based on the provided information. The baselines can utilize a set of tools in a fixed order, without the capacity for dynamic decision making. [0265] The experiments may evaluate the usefulness of each tool group (i.e., PALI , Object, and Search) through an ablation study. The experiments can involve removing each tool group from our framework individually, and assessing the impact on performance. Model Unseen Entity Unseen Question PALM (Q-only, few-shot) 3.7 5.1 OFA (fine-tune) 9.7 14.8 PALI (VQA, zero-shot) 1.8 2.2 PALI (fine-tune) 16 20.7 PALM w/ CLIP (few-shot + external knowledge) 21.9 18.6 FiD w/ CLIP (fine-tune + external knowledge) 20.7 18.1 (—baselines without dynamic decision making, sequentially execute the tools—) baseline- PALM w/ (PALI∗, few-shot) 12.8 14.9 baseline-PALM w/ (PALI∗ + Object, few-shot) 31.3 36.1 baseline-PALM w/ (PALI∗ + Object + Search, few- shot) 36.1 38.2 AVIS (ours, few-shot) 50.7 56.4 w/o PALI∗ 47.9 54.2 w/o Object 41.2 48.4 w/o Search 42.5 49.6 [0266] Table 1 can include visual question answering results (accuracy) on Infoseek Wikidata. The first four rows can be results from their paper that do not use external knowledge, and the next two can be from their paper that use CLIP as knowledge source. The tool PALI denotes the frozen multi-task PALI-17B model for both visual question answering and image captioning. Object means object detection, and search means image and text search. [0267] Table 1 can present the results of AVIS and other baselines on the Infoseek wikidata dataset. Infoseek wikidata can be a challenging dataset that requires identifying highly specific entities. Even robust visual-language models, such as OFA (Lu et al., “Unified-io: A unified model for vision, language, and multi-modal tasks,” CoRR, abs/2206.08916, (2022).) and PALI (Chen et al. “PaLI: A Jointly-Scaled Multilingual Language-Image Model,” arXiv (Sep.25, 2022), https://arxiv.org/abs/2209.06794.), can fail to yield high accuracy when fine-tuned on this dataset. However, AVIS , without fine-tuning and by leveraging a complete set of tools guided by 10 in-context examples, can achieve the accuracy of 50.7 and 56.4 on the unseen entity and question splits, respectively. The AVIS system can significantly outperform the fine-tuned results of PALI-17B, which are 16.0 and 20.7, as well as the PALM model augmented with CLIP knowledge, which are 21.9 and 18.6, respectively. [0268] Table 1 can illustrate that improvements may not be solely due to the additional information provided by the external tools, but due to the dynamic decision-making pipeline. The experiments can compare the results of AVIS with the three baselines that conduct sequential execution. While these baselines do improve the performance, our AVIS framework outperforms the best baseline model by up to 17.3 accuracy. Note that AVIS and the baselines can use exactly the same set of tools. The considerable performance gap can convey the clear advantage of dynamic decision-making design. Furthermore, the system may show the importance of each tool in the last block of Table 1. Removal of any of the tools can degrade the overall accuracy. Among the three tool groups, Object and Search can be more important than PALI, as they provide more fine-grained information crucial for the Infoseek dataset.
Figure imgf000061_0001
Figure imgf000062_0001
[0269] Table 2 can include visual question answering results (accuracy) on OK-VQA. The tool PALI denotes the frozen multi-task PALI-17B model for both visual question answering and image captioning. Object can denote object detection, and search can denote image and text search. [0270] The OK-VQA experiments are depicted in Table 2. AVIS with few-shot in- context examples can achieve an accuracy of 60.2, higher than most of the existing methods tailored for the dataset, including KAT (Gui et al., “KAT: A Knowledge Augmented Transformer for Vision-and-Language,” arXiv (Dec.16, 2021), https://arxiv.org/abs/2112.08614.), ReVIVE (Lin et al., “REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering,” arXiv (Jun.2, 2022), https://arxiv.org/abs/2206.01201.), and REVEAL (Hu et al., “REVEAL: Retrieval- Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory,” arXiv (Dec.10, 2022), https://arxiv.org/abs/2212.05221.). AVIS can achieve lower but comparable performance compared to PALI model fine-tuned on OK-VQA. This difference, compared to Infoseek, may be attributed to the fact that most QA examples in OK-VQA rely more on commonsense knowledge than on fine-grained knowledge. Therefore, it may be feasible to encode such generic knowledge in the model parameters and requires less external knowledge. PALI zero-shot VQA model itself can achieve 41.6 accuracy, which can be significantly higher than in Infoseek, which supports this hypothesis. Table 2 can show that the object detection is less crucial as a tool on this data set, compared to PALI captioning and VQA. [0271] One of the features of AVIS can be the ability to dynamically make decisions instead of executing a fixed sequence. Figure 5 can present three examples of AVIS ’s dynamic planning and reasoning process. They can demonstrate the flexibility of AVIS to use different tools at various stages. The reasoner design can enable AVIS to identify irrelevant information, backtrack to a previous state, and repeat the search. For instance, in the second example concerning the taxonomy of fungi, AVIS may make an incorrect decision by selecting a leaf object. However, the reasoner can identify that this is not relevant to the question, prompting AVIS to plan again. This time, the system may successfully select the object related to false turkey-tail fungi, leading to the correct answer, Stereum. [0272] The training datasets can include visual question and answer datasets. The training examples can include an image, a question about the question, and an answer to the question. [0273] The systems and methods can decompose questions into a visual sub-question and a knowledge sub-question. [0274] The systems and methods may be designed and/or utilized for visual question answering and/or other reasoning tasks. [0275] The technology discussed herein makes reference to servers, databases, software applications, and other computer-based systems, as well as actions taken and information sent to and from such systems. The inherent flexibility of computer-based systems allows for a great variety of possible configurations, combinations, and divisions of tasks and functionality between and among components. For instance, processes discussed herein can be implemented using a single device or component or multiple devices or components working in combination. Databases and applications can be implemented on a single system or distributed across multiple systems. Distributed components can operate sequentially or in parallel. [0276] While the present subject matter has been described in detail with respect to various specific example embodiments thereof, each example is provided by way of explanation, not limitation of the disclosure. Those skilled in the art, upon attaining an understanding of the foregoing, can readily produce alterations to, variations of, and equivalents to such embodiments. Accordingly, the subject disclosure does not preclude inclusion of such modifications, variations and/or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art. For instance, features illustrated or described as part of one embodiment can be used with another embodiment to yield a still further embodiment. Thus, it is intended that the present disclosure cover such alterations, variations, and equivalents.

Claims

WHAT IS CLAIMED IS: 1. A computing system for visual information seeking, the system comprising: one or more processors; and one or more non-transitory computer-readable media that collectively store instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising: obtaining input data, wherein the input data comprises image data and text data, wherein the text data comprises a query associated with the image data; processing the input data with a machine-learned model to generate first planning data, wherein the first planning data is descriptive of instructions to provide the input data to a first data processing tool; transmitting, based on the first planning data, the input data to the first data processing tool to retrieve first output data; processing the input data and the first output data with the machine-learned model to generate second planning data, wherein the second planning data is descriptive of instructions to transmit data to a second data processing tool; transmitting, based on the second planning data, data to the second data processing tool to retrieve second output data; and processing the input data and the second output data with the machine-learned model to generate response data, wherein the response data is descriptive of a response to the query.
2. The system of claim 1, wherein the first data processing tool comprises an object detection model.
3. The system of claim 2, wherein the first output data comprises one or more bounding boxes associated with one or more objects in the image data.
4. The system of claim 2, wherein the first output data comprises one or more segmented portions of one or more images of the image data and caption data associated with the one or more segmented portions.
5. The system of claim 4, wherein the caption data is descriptive of an object classification associated with one or more objects detected in the one or more segmented portions of one or more images.
6. The system of claim 1, wherein the second data processing tool comprises a search engine.
7. The system of claim 6, wherein the second planning data comprises a model- generated query, and wherein the model-generated query is transmitted to the second data processing tool to retrieve the second output data.
8. The system of claim 7, wherein the model-generated query is generated based on the input data and the first output data, wherein the model-generated query is descriptive of the query of the input data modified based on the first output data.
9. The system of claim 1, wherein the response data comprises a natural language text string that is responsive to the query of the input data.
10. The system of claim 1, wherein the operations further comprise: processing the input data and the second output data with the machine-learned model to generate third planning data, wherein the third planning data is descriptive of instructions to transmit data to a third data processing tool; transmitting, based on the third planning data, data to the third data processing tool to retrieve third output data; and processing the input data and the third output data with the machine-learned model to generate the response data.
11. A computer-implemented method for responding to visual prompts, the method comprising: obtaining, by a computing system comprising one or more processors, input data, wherein the input data comprises image data and text data, wherein the text data comprises a query associated with the image data; processing, by the computing system, the input data with a machine-learned model to generate first planning data, wherein the first planning data is descriptive of instructions to provide the image data to a first data processing tool; providing, by the computing system, the image data to the first data processing tool to receive first output data, wherein the first output data is generated with the first data processing tool based on the image data; processing, by the computing system, the first output data with the machine-learned model to generate second planning data, wherein the second planning data is descriptive of instructions to provide a particular portion of the image data to a second data processing tool; providing, by the computing system, the particular portion of the image data to the second data processing tool to receive second output data, wherein the second output data is generated with the second data processing tool based on the particular portion of the image data; and processing, by the computing system, the text data and the second output data with the machine-learned model to generate response data, wherein the response data is descriptive of a response to the query.
12. The method of claim 11, wherein the machine-learned model was conditioned on a training dataset comprising a plurality of training examples, wherein each training example of the plurality of training examples comprises a training input, a training output, and a training rationale, wherein the training rationale is descriptive of a sequence of processing instances and tool calls for determining the output data.
13. The method of claim 11, wherein the machine-learned model was conditioned on a training dataset comprising sequence data, wherein sequence data is descriptive of a sequence of actions for generating a training response in response to obtaining a particular type of input.
14. The method of claim 11, wherein the machine-learned model was conditioned on human input data descriptive of actions a human selected as being particular actions to perform when a particular input type is received.
15. The method of claim 14, wherein the human input data was obtained via a user interface that provides a plurality of selectable options for a user.
16. The method of claim 15, wherein the plurality of selectable options comprise a plurality of external tools to call and a final output option.
17. One or more non-transitory computer-readable media that collectively store instructions that, when executed by one or more computing devices, cause the one or more computing devices to perform operations, the operations comprising: obtaining input data, wherein the input data comprises image data and text data, wherein the text data comprises a query associated with the image data; processing the input data with a machine-learned model to generate first planning data, wherein the first planning data is descriptive of instructions to provide the image data to a first data processing tool; providing the image data to the first data processing tool; receiving first output data from the first data processing tool based on the image data; processing the first output data with the machine-learned model to generate second planning data, wherein the second planning data is descriptive of instructions to provide a particular portion of the image data to a second data processing tool; providing the particular portion of the image data to the second data processing tool; receiving second output data from the second data processing tool based on the particular portion of the image data; and processing the text data and the second output data with the machine-learned model to generate response data, wherein the response data is descriptive of a response to the query.
18. The one or more non-transitory computer-readable media of claim 17, wherein the query is descriptive of a question associated with a particular object depicted in the image data.
19. The one or more non-transitory computer-readable media of claim 18, wherein the first data processing tool: detects a plurality of objects depicted in the image data; generates a plurality of bounding boxes associated with the plurality of objects; generates a plurality of image segments based on the plurality of bounding boxes; classifies each of the plurality of objects in the plurality of image segments to generate a plurality of object classifications; and generates the first output data, wherein the first output data comprises the plurality of image segments and the plurality of object classifications.
20. The one or more non-transitory computer-readable media of claim 19, wherein the second data processing tool comprises an image search engine, wherein the second data processing tool: processes one or more of the plurality of image segments with the image search engine to determine one or more web resources associated with the one or more of the plurality of image segments, wherein the one or more of the plurality of image segments are selected by the machine-learned model based on the input data and the plurality of object classifications; and generates the second output data based on the one or more web resources.
21. A computing system, the system comprising: one or more processors; and one or more non-transitory computer-readable media that collectively store instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising: obtaining input data, wherein the input data comprises image data and text data, wherein the text data comprises a query associated with the image data; processing the input data with an artificial intelligence system to generate response data, wherein the response data is descriptive of a response to the query, wherein the artificial intelligence system comprises: a language model planner block, wherein the language model planner block was conditioned to determine one or more data processing tools to utilize for processing data associated with responding to the query; a language model reasoner block, wherein the language model reasoner block was conditioned to process outputs of the one or more data processing tools to determine information associated with responding to the query; and a working memory, wherein the working memory stores acquired information obtained and generated with the artificial intelligence system; and providing the response data as an output.
22. The system of claim 21, wherein the artificial intelligence system comprises one or more machine-learned models conditioned on user behavior data.
23. The system of claim 22, wherein the user behavior data is processed to generate a transition graph that is descriptive of a determined sequence of decisions made by users when performing a particular information seeking task.
24. The system of claim 23, wherein the transition graph is descriptive of a plurality of distinct states and indicates a particular set of actions available at each state of the plurality of distinct states.
25. The system of claim 22, wherein the user behavior data was utilized to condition the language model planner block and the language model reasoner block for particular context-based processing.
26. The system of claim 21, wherein the one or more data processing tools comprise at least one of a computer vision tool, a web search tool, or an image search tool.
27. The system of claim 26, wherein the computer vision tool comprises at least one of an object detection model, an optical character recognition model, an image captioning model, or a visual question-and-answer model.
28. The system of claim 26, wherein the web search tool retrieves open world knowledge from web resources.
29. The system of claim 26, wherein the image search tool identifies relevant information from metadata associated with visually similar images.
30. The system of claim 21, wherein the language model planner block was conditioned to generate a processing tool query to provide to the one or more data processing tools based on the input data.
PCT/US2024/032378 2023-06-08 2024-06-04 Autonomous visual information seeking with machine-learned language models Ceased WO2024254051A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP24736245.2A EP4705905A1 (en) 2023-06-08 2024-06-04 Autonomous visual information seeking with machine-learned language models

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202363506924P 2023-06-08 2023-06-08
US63/506,924 2023-06-08

Publications (1)

Publication Number Publication Date
WO2024254051A1 true WO2024254051A1 (en) 2024-12-12

Family

ID=91664626

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2024/032378 Ceased WO2024254051A1 (en) 2023-06-08 2024-06-04 Autonomous visual information seeking with machine-learned language models

Country Status (2)

Country Link
EP (1) EP4705905A1 (en)
WO (1) WO2024254051A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119672499A (en) * 2025-02-19 2025-03-21 鹏城实验室 Visual decision model training method and related methods, devices, equipment and media
CN119885136A (en) * 2024-12-30 2025-04-25 同济大学 Multi-mode identity verification method based on attention mechanism
US20250193491A1 (en) * 2023-12-12 2025-06-12 Lumiere AI LLC Dynamic Conversation-Based Video Feedback System
CN120875272A (en) * 2025-09-25 2025-10-31 浪潮通用软件有限公司 Business planning intelligent decomposition method, equipment and medium based on large model fine tuning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200019628A1 (en) * 2018-07-16 2020-01-16 Microsoft Technology Licensing, Llc Visual intent triggering for visual search
US20210224332A1 (en) * 2020-01-22 2021-07-22 Adobe Inc. Chart question answering
US20230027713A1 (en) * 2021-07-21 2023-01-26 International Business Machines Corporation Neural-Symbolic Action Transformers for Video Question Answering
US20230077508A1 (en) * 2021-09-16 2023-03-16 Fujitsu Limited Method of generating inference model and information processing apparatus

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200019628A1 (en) * 2018-07-16 2020-01-16 Microsoft Technology Licensing, Llc Visual intent triggering for visual search
US20210224332A1 (en) * 2020-01-22 2021-07-22 Adobe Inc. Chart question answering
US20230027713A1 (en) * 2021-07-21 2023-01-26 International Business Machines Corporation Neural-Symbolic Action Transformers for Video Question Answering
US20230077508A1 (en) * 2021-09-16 2023-03-16 Fujitsu Limited Method of generating inference model and information processing apparatus

Non-Patent Citations (25)

* Cited by examiner, † Cited by third party
Title
AGOSTINELLI ET AL.: "Music. M: Generating Music From Text,", ARXIV:2301.11325V1, 26 January 2023 (2023-01-26)
BROWN ET AL.: "Language models are few-shot learners", ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, vol. 33, 2020, pages 1877 - 1901
CHEN ET AL.: "Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?", ARXIV, 23 February 2022 (2022-02-23), Retrieved from the Internet <URL:https://arxiv.org/abs/2302.11713.>
CHEN ET AL.: "PaLI: A Jointly-Scaled Multilingual Language-Image Model", ARXIV, 25 September 2022 (2022-09-25)
CHERN ET AL.: "TPU-KNN: K nearest neighbor search at peak flop/s", CORR, ABS/2206.14286, 2022
CHOWDHERY ET AL.: "Palm: Scaling language modeling with pathways", ARXIV PREPRINT, 2022
CHOWDHERY ET AL.: "PaLM: Scaling Language Modeling with Pathways", ARXIV, 5 April 2022 (2022-04-05)
DOSOVITSKIY ET AL.: "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", ARXIV:2010.11929V2, 3 June 2021 (2021-06-03)
GOYAL ET AL.: "IN 2017 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION", 2017, IEEE COMPUTER SOCIETY, article "Making the V in VQA matter: Elevating the role of image understanding in visual question answering", pages: 6325 - 6334
GOYAL ET AL.: "Making the V in VQA matter: Elevating the role of image understanding in visual question answering", IN 2017 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, July 2017 (2017-07-01), pages 6325 - 6334, XP033249996, DOI: 10.1109/CVPR.2017.670
GUI ET AL.: "KAT: A Knowledge Augmented Transformer for Vision-and-Language", ARXIV, 16 December 2021 (2021-12-16), Retrieved from the Internet <URL:https://arxiv.org/abs/2112.08614.>
HU ET AL.: "REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory", ARXIV, 10 December 2022 (2022-12-10), Retrieved from the Internet <URL:https://arxiv.org/abs/2212.05221.>
JUMPER ET AL.: "Highly accurate protein structure prediction with AlphaFold", NATURE, vol. 596, 26 August 2021 (2021-08-26), pages 583, XP055888904, DOI: 10.1038/s41586-021-03819-2
KUDO ET AL.: "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing", PROCEEDINGS OF THE 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (SYSTEM DEMONSTRATIONS, 31 October 2018 (2018-10-31), pages 66 - 71, Retrieved from the Internet <URL:https://aclanthology.org/D18-2012.pdf>
KULSHRESHTHA ET AL.: "Towards a human-like open-domain chatbot", ARXIV PREPRINT, 2020
KUZNETSOVA ET AL.: "The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale", IJCV, 2020
LIN ET AL.: "REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering", ARXIV, 2 June 2022 (2022-06-02), Retrieved from the Internet <URL:https://arxiv.org/abs/2206.01201.>
LU ET AL.: "Unified-io: A unified model for vision, language, and multi-modal tasks", CORR, ABS/2206.08916, 2022
MARINO ET AL.: "OK-VQA: A visual question answering benchmark requiring external knowledge", IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2019, pages 3195 - 3204
SAHARIA ET AL.: "Non-Autoregressive Machine Translation with Latent Alignments", ARXIV:2004.07437V3, 16 November 2020 (2020-11-16)
SCAO ET AL.: "BLOOM: A 176b-parameter open-access multilingual language model", CORR, ABS/2211.05100, 2022
TOUVRON ET AL.: "Llama: Open and efficient foundation language models", ARXIV PREPRINT, 2023
VASWANI ET AL.: "Attention Is All You Need", ARXIV: 1706.03762V7, 2 August 2023 (2023-08-02)
WEI ET AL.: "Chain-of- Thought Prompting Elicits Reasoning in Large Language Models", ARXIV, 28 January 2022 (2022-01-28), Retrieved from the Internet <URL:https://arxiv.org/abs/2201.11903.>
ZHOU ET AL.: "Mixture-of Experts with Expert Choice Routing,", ARXIV:2202.09368V2, 14 October 2022 (2022-10-14)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20250193491A1 (en) * 2023-12-12 2025-06-12 Lumiere AI LLC Dynamic Conversation-Based Video Feedback System
US12489948B2 (en) * 2023-12-12 2025-12-02 Lumiere AI LLC Dynamic conversation-based video feedback system
CN119885136A (en) * 2024-12-30 2025-04-25 同济大学 Multi-mode identity verification method based on attention mechanism
CN119672499A (en) * 2025-02-19 2025-03-21 鹏城实验室 Visual decision model training method and related methods, devices, equipment and media
CN120875272A (en) * 2025-09-25 2025-10-31 浪潮通用软件有限公司 Business planning intelligent decomposition method, equipment and medium based on large model fine tuning

Also Published As

Publication number Publication date
EP4705905A1 (en) 2026-03-11

Similar Documents

Publication Publication Date Title
US20240004677A1 (en) Machine-Learned Models for User Interface Prediction, Generation, and Interaction Understanding
US11113598B2 (en) Dynamic memory network
US12266065B1 (en) Visual indicators of generative model response details
US20250054322A1 (en) Attribute Recognition with Image-Conditioned Prefix Language Modeling
US20160350653A1 (en) Dynamic Memory Network
WO2024254051A1 (en) Autonomous visual information seeking with machine-learned language models
US12346386B2 (en) Visual and audio multimodal searching system
US20240378256A1 (en) Artificial Intelligence Generated Badges for Search
Halvardsson et al. Interpretation of swedish sign language using convolutional neural networks and transfer learning
US12602429B2 (en) Video and audio multimodal searching system
CN111434118A (en) Apparatus and method for generating user interest information
KR20250042677A (en) Proactive query and content suggestion with generative model generated question and answer
WO2025183818A1 (en) Content snippet generation and storage with generative model content grouping
US20260119571A1 (en) Artificial Intelligence-Based Image Search Refinement
US20250355958A1 (en) On-Demand Generative Response Simplification
WO2025101175A1 (en) Llm-centric agile image classification
US20260030905A1 (en) Vision-Language-Model-Based System for Assessing the Consistency Between Images and Their Textual Description
US12321401B1 (en) Multimodal query prediction
US20260064787A1 (en) Audience-Based Content Modification
WO2025221508A1 (en) Machine-learned text alignment prediction for providing an augmented-reality translation interface
EP4677616A1 (en) Medical condition visual search
CN117980894A (en) Cognitive image search based on personalized image components of synthetic images
CN117121021A (en) Machine-learned models for user interface prediction and generation
US20260127228A1 (en) Progressing Search Instances in Weak Search Signal Instances
US20260073717A1 (en) Machine-Learned Model to Correct an Entity Label for an Entity in an Image

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24736245

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2024736245

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2024736245

Country of ref document: EP

Effective date: 20251203

ENP Entry into the national phase

Ref document number: 2024736245

Country of ref document: EP

Effective date: 20251203

ENP Entry into the national phase

Ref document number: 2024736245

Country of ref document: EP

Effective date: 20251203

ENP Entry into the national phase

Ref document number: 2024736245

Country of ref document: EP

Effective date: 20251203

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2024736245

Country of ref document: EP

Effective date: 20251203

ENP Entry into the national phase

Ref document number: 2024736245

Country of ref document: EP

Effective date: 20251203

ENP Entry into the national phase

Ref document number: 2024736245

Country of ref document: EP

Effective date: 20251203

ENP Entry into the national phase

Ref document number: 2024736245

Country of ref document: EP

Effective date: 20251203

WWP Wipo information: published in national office

Ref document number: 2024736245

Country of ref document: EP