Michael Goldverg from BNY Mellon presented on their journey with automating processes within the bank across thousands of people in multiple business departments. They needed to deal with interdependencies between departments, variations due to account/customer types, SLAs at the departmental and individual level, and thousands of daily process instances.
They use the approach of a single base model with thousands of variations – the “super model” – where the base model appears to include smaller ad hoc models (mostly snippets surrounding a single task that were initially all manual operations) that are assembled dynamically for any specific type of process. Sort of an accidental case management model at first glance, although I’d love to get a closer look at their model. There was a question about the number of elements in their model, which Michael estimated as “three dozen or so” tasks and a similar number of gateways, but can’t share the actual model for confidentiality reasons.
They have a deployment architecture that allows for multiple clusters accessing a single operational database, where each cluster could have a unique version of the process model. Applications could then be configured to select the server cluster – and therefore the model version – at runtime, allowing for multiple models to be tested in a live environment. There’s also an automated process instance migration service that moves the live process instances if the old and new process models are not compatible. Their model changes constantly, and they update the production model at least once per week.
They’ve had to deal with optimistic locking exceptions (fairly common when you have a lot of parallel gateways and multiple instances of the engine) by introducing their own external locking mechanism, and by offloading some of this to the Camunda JobExecutor using asynchronous continuations although that can cause a hit on performance. The hope is that this will be resolved when they move to the V8 engine – V8 doesn’t rely on a single relational database and is also highly distributed by design.
They run 50-100k transactions per day, and have hundreds of millions of tasks in the history database. They manage this with aggressive cleaning of the history database – currently set to 60 days – by archiving the task history as PDFs in their content management system where it’s discoverable. They are also very careful about the types of queries that they allow directly on the Camunda database, since a single poorly-constructed search can bring the database to its knees: this is why Camunda, like other vendors, discourage the direct querying of their database.
There are a lot of trade offs to be understood when it comes to performance optimization at this scale. Also some good points about starting your deployment with a more complex configuration, e.g., two servers even if one is adequate, so that you’re not working out the details of how to run the more complex configuration when you’re also trying to scale up quickly. Lots of details in Michael’s presentation that I’m not able to capture here, definitely check out the recorded video later if you have large deployment requirements.