Over the past few years, Bank of England has radically changed the way it collects and analyzes its data by adopting enterprise open source technologies to modernize its infrastructure. At the Dataworks Summit last week in Barcelona, ​​Adrian Waddy, technical manager of the bank’s big data platform, and Nick Vaughan, data analytics and modeling expert, explained how they designed this new data platform and what they fired.

When Mark Carney took over as Governor of Bank of England in 2013, he commissioned an independent study of the company which, when published, highlighted the need to make better use of data by centralizing it in one platform. Bank of England is indeed responsible for £485 billion in assets. And thanks to its real-time gross settlement feature, it processes an average of £650bn in trades per day. The technical teams carry out regular stress tests to assess whether the bank could withstand various financial shocks.

More than 50 million transactions with EMIR

With the introduction of EMIR (European market infrastructure regulation) in 2012, the bank’s IT department had to collect even more data and automate reporting where possible. This led him to rethink his architecture from scratch. At the time, the financial institution operated 128 different data analysis systems and had to manage the costs generated by their operation at the same time.

“The bank’s analysts relied heavily on the internal network to find the data they needed,” Adrian Waddy said. “But even when they did manage to find it, sometimes the data to be combined was too big to process on their laptops and they had no specific place to do it. » Given London’s position as a financial center and the central bank’s regulatory role for institutions within the United Kingdom, Bank of England needs to collect around 50 million transactions each day (with many peak at 85 million). “It was a radical change for us and we needed a different architecture,” pointed out Nick Vaughan.

Failed on first attempt

In the first architecture implemented, data from exchange repositories was decompressed into CSV files and stored in a “raw area” where a set of unique schemas and structures were applied. They were then loaded into tables where they were structured to be searchable via Apache Hive. This first iteration of architecture did not go without a hitch. “We had problems because we didn’t have the necessary skills internally at the time to build our own clusters,” explains Nick Vaughan. “And we couldn’t afford to invest in a huge data lake. And we quickly realized that we had to be accompanied. The chosen vendor – whom the bank does not name – had to build and configure the storage infrastructure and install software on top of it. But, a few weeks before the planned go-live date, the vendor discontinued the product. Nick Vaughan is sorry: “The defeatists and others have had a great time accusing us of having made a disastrous mistake. And the communications around that were difficult at that time.

Afterwards, we realized that we had nevertheless gone from nothing to providing our analysts with access to EMEA data”, underlines the analytical expert. “We also developed our skills enormously in a short time, until we gained the confidence to build our second data hub. It will be five times larger and will work in multiple data centers. At the moment it’s off to a slow start but we’re going to get a lot of value out of it. An example of an immediate contribution is that this hub allowed the bank’s researchers to study the impact that Brexit could have on derivatives markets in the United Kingdom. “We have a responsibility to report what we consider to be a risk to the financial system,” says Vaughan. “There was a risk that there were open derivatives trading contracts around £41 trillion which would be uncertain – months away from leaving the European Union. Fortunately, because we had these datasets and made this information public, those affected by this significant risk were able to manage and mitigate it, so the situation has now improved. »

Second iteration with Cloudera

Bank of England is therefore preparing today to launch the second iteration of its data hub. Produced in close collaboration with Cloudera – Hadoop specialist which has now merged with Hortonworks – this data concentrator should be ready next year. “One of the main differences in terms of environment is that we will go from a single cluster to three separate production clusters,” said Adrian Waddy. “It largely mirrors Azure’s cloud offering and means we will be able to scale these clusters to the workload. The idea is to have the datalake, the acquisition cluster which will be the engine running the job, then a query cluster tuned to have low latency and contain very modern data and an analytics cluster composed of the same data to which some raw data could be added, if the analysts so wish. »

For governance, the bank will use Apache Atlas to create a data audit platform. Vaughan added that this will help the organization gain better visibility into what employees are doing, what data they are using and how they are getting insights from it. The new data hub architecture runs on hyperconverged hardware from VMware (VxRack) with EMC Isilon storage, offering 320TB of “usable” storage and approximately 10TB of RAM.

Lower the costs

Bank of England was recently criticized by a select committee for its costly and inefficient IT spending. So would these changes be part of a larger effort to improve the situation? Nick Vaughan replied to our colleagues at Computerworld that “the 2014 Strategic Review called for more efficient use of technology, reducing its silos, so that every time we build a new system, we don’t need a user acceptance test instance or an instance with its own infrastructure, management, and licensing overhead. In fact, open source technology, on which we can put all our data and use it more efficiently, allows us to reduce costs and above all gives us the ability to exploit new data sets. »