Gain insights into the process of building AI Agents in production at Parcha. Discover the challenges faced and the solutions implemented to ensure the seamless deployment and performance of these agents
For the past six months, we at Parcha have been building enterprise-grade AI Agents that instantly automate manual workflows in compliance and operations using existing policies, procedures, and tools. Now that we have the first set of Parcha agents in production for our initial design partners, here are some reflections and the lessons we have learned.
An agent is fundamentally software that uses Large Language Models to achieve a goal by constructing a plan and guiding its execution. In the context of Parcha, these agents are designed with certain components in mind. Let's explore them in more detail.
This is the agent's profile and basic instructions about how they operate. Agents have expertise in a particular topic or field; they have a role and means to do a job. They also have tools and commands they can use to perform their job. Here is a simple example of the specifications of an agent:
This is a space in the prompt to the language model where the agents add results from tools and observations as they execute their plan. These observations are used as tool inputs to guide the execution plan or the final assessment.
Example of the scratchpad for an agent that performed one of the commands above:
An SOP is a set of instructions the agent needs to perform to complete a tasks. The agent uses the SOP to construct a plans using the tools it has available and to assess if it has all the information needed to make an assessment.
Here is an example of a simple SOP used to perform Know Your Business in a customer:
When a new company is onboarding onto BaaS’s Issuing platform, we are required to complete a KYB (Know Your Business) check on the customer based on FinCEN regulations. To complete the check we must carry out each of the following steps:
Our agents have specific instructions that dictate the output of the agent. This could be an assessment of approving/denying/escalating to humans an application, writing a report based on the observations retrieved, or a specific output the user wants to action on.
Example of the Know Your Business final assessment:
Our initial approach to building agents was fairly naive. Our objective was to see what was possible and validate that we could build AI agents using the same instructions humans would use to perform the task. The agent was simple: we used Langchain Agents with a standard operating procedure (SOP) embedded in the agent’s scratchpad. We wrapped custom-built API integrations into tools and made them available to the agent. The agent was triggered from a web front-end through a websocket connection, which stayed open, receiving updates from the agent until it completed a task. While this approach helped us get something built quickly to get validation from our design partners, the approach had multiple setbacks we had to improve over time to prepare our agents to perform production-grade tasks.
Some of the particular setbacks of this approach include:
We now run our agents as long-running processes asynchronously. A web service can still trigger agents, but instead of communicating bi-directionally through WebSockets, they post updates using pub/sub. This helped us simplify the communication interface between the agent and the customer. It also made our agents more useful beyond a synchronous web service. Agents can still provide real-time status through server-sent events. They can still request actions from the customer, like asking for clarifications or waiting to receive a missing piece of information. Furthermore, agents can be triggered through an API, followed through a Slack channel (they start threads and provide updates as replies until completion), and evaluated at scale as headless processes. Since agents can be triggered and consumed through REST (polling and SSE), our customers can integrate them with their workflows without relying on a web interface.
After evaluating multiple real-world SOPs and shadow sessions with design partners, we realized instructions could be decoupled into multiple SOPs. We developed the ability for agents to trigger and consume the output from other agents. Now, instead of adding a very complex SOP into one agent’s scratchpad, we have a coordinator <> worker model. The coordinator develops an initial plan using the master SOP and delegates subsets of the SOP to “worker” agents that gather evidence, make conclusions on a local set of tasks, and report back to the coordinator. Then, the coordinator uses all the evidence workers gather to develop a final recommendation.
For example, in a KYB process, it’s common to perform tasks like verifying the applicant's identity, verifying if a certificate of incorporation provided as a PDF is valid, and checking if the owners are on any watchlist. These are multi-step processes (checking the certificate involves performing OCR in the document, validating it, extracting information from it, and comparing it with the information provided by the applicant (usually fetched from an API). Instead of having one agent performing all these steps, a coordinator triggers a worker agent to perform each and report back. The coordinator would decide if - per the SOP - the customer should be approved or not. Since each agent has its scratchpad, this helped us steer them to complete tasks with less noise in the context window.
Don’t rush to judge
We faced a similar challenge when asking our Agents to verify information from a document. In the KYB example above, the customer enters information (e.g., company name) into a form and provides some evidence about it (e.g., an incorporation document). Our initial approach to a verification task using an LLM was to enter the text extracted from the document using OCR and the self-attested information into the context window and ask the LLM to verify if the information matched. Since documents are long and full of potentially irrelevant information, the results of such a task were not great. We made this significantly better by separating the extraction process from judgment. We now ask the LLM to extract the relevant information from the document (e.g., is it valid? What’s the company being mentioned, and which state/country is it incorporated in?). Then in a second trip to the LLM, we ask it to compare the information (removing the unnecessary elements from the document) from the self-attested information (e.g., the company information entered in the form). This not only worked way better but also didn’t increase token count or execution time significantly since the second step has a significantly reduced amount of prompt tokens in it.
Keep it simple
The coordinator <> worker agent model introduced a challenge: since each agent has a scratchpad, how do we share information? We had agents who needed the same information for two tasks, and initially, they both had to execute the same tools. This was inefficient. The obvious answer was to incorporate memory. There are a lot of mechanisms for memory in agents or LLM apps out there: using vector DBs, adding a subset of the previous steps into the scratchpad, etc. Rather than complicating our stack or risking polluting the scratchpad with noise, we decided to leverage an in-memory store we were already using for communication, Redis. We now tell the agent pieces of information they have available (the keys to the information in Redis) and built our tool interface to incorporate pulling inputs from the in-memory store. By only adding relevant memory to the scratchpad, we save on LLM tokens, keep the context window clean, and ensure any worker agents pick the right information whenever needed.
Mistakes happen
Each Parcha agent interacts with multiple homegrown libraries and third-party services. Since the workflows are complex, there are cases where a step fails. An HTTP connection to an API may return an internal error, or the agent may choose the wrong input for a tool. Given how our agents were configured, any error would cause the agent to fail with no opportunity to recover. We now treat agents as asynchronous services with multiple failover mechanisms. They are queued and executed by worker processes (we use RQ for this), and we get alerted when they fail. More importantly, we are now leveraging well-typed exceptions in our tools to feed them back to the agent. If a tool fails, we send the exception name and message to the agent, allowing it to recover independently. This has helped us reduce the number of catastrophic failures significantly.
Building blocks
After taking multiple weeks to develop each of our first set of agents, we decided to focus on reusability and speed of building. We want to enable our customers to build their agents quickly. To that end, we developed our agent and its tools interface, focusing on composability and extensibility. We now also have enough workflows from design partners to understand which building blocks we need to invest in. For example, many of our customers do some document extraction. We now have a tool that can easily be applied to any document extraction task. We built this tool once but already use it in most workflows. Our customers can use it to extract information from an incorporation document and validate its veracity or to calculate an individual's income from a pay stub. The work to adapt the tool to one specific, new workflow is minimal.
We are hiring a full-stack founding engineer in San Francisco. This person will be responsible for leading the architecture and development of the platform powering our AI Agents. They will partner with our founders and AI engineer to build bleeding-edge technology for real enterprise customers. Read more about this role here and contact miguel (@miguelriosEN on X or miguel@parcha.ai) to learn more.