Here is the way it works. First, the system splits tasks between two language fashions: A “privileged LLM” (P-LLM) generates code that defines the steps to take—like calling a operate to get the final electronic mail or sending a message. Consider this because the “planner module” that solely processes direct person directions.
Subsequent, a “quarantined LLM” (Q-LLM) solely parses unstructured information into structured outputs. Consider it as a short lived, remoted helper AI. It has no entry to instruments or reminiscence and can’t take any actions, stopping it from being straight exploited. That is the “reader module” that extracts info however lacks permissions to execute actions. To additional forestall info leakage, the Q-LLM makes use of a particular boolean flag (“have_enough_information”) to sign if it may well fulfill a parsing request, somewhat than probably returning manipulated textual content again to the P-LLM if compromised.
The P-LLM by no means sees the content material of emails or paperwork. It sees solely {that a} worth exists, equivalent to “electronic mail = get_last_email()” after which writes code that operates on it. This separation ensures that malicious textual content can’t affect which actions the AI decides to take.
CaMeL’s innovation extends past the dual-LLM method. CaMeL converts the person’s immediate right into a sequence of steps which can be described utilizing code. Google DeepMind selected to make use of a locked-down subset of Python as a result of each obtainable LLM is already adept at writing Python.
From immediate to safe execution
For instance, Willison provides the instance immediate “Discover Bob’s electronic mail in my final electronic mail and ship him a reminder about tomorrow’s assembly,” which might convert into code like this:
electronic mail = get_last_email() tackle = query_quarantined_llm( "Discover Bob's electronic mail tackle in [email]", output_schema=EmailStr ) send_email( topic="Assembly tomorrow", physique="Bear in mind our assembly tomorrow", recipient=tackle, )
On this instance, electronic mail is a possible supply of untrusted tokens, which implies the e-mail tackle may very well be a part of a immediate injection assault as properly.
Through the use of a particular, safe interpreter to run this Python code, CaMeL can monitor it intently. Because the code runs, the interpreter tracks the place each bit of knowledge comes from, which known as a “information path.” As an illustration, it notes that the tackle variable was created utilizing info from the doubtless untrusted electronic mail variable. It then applies safety insurance policies primarily based on this information path. This course of entails CaMeL analyzing the construction of the generated Python code (utilizing the ast library) and working it systematically.