Debashis Singh, CIO, Mphasis, discusses the need for systems to be operational 24×7 and how Mphasis developed a predictive mechanism to do just that.
With extensive experience in startups and large global corporates, Debashis Singh has an in depth understanding of technology, traditional IT services including application development maintenance, systems architecture, product and systems solutions, business process outsourcing, as well as cost and risk management. He has spent more than 24 years in the IT industry, and has been instrumental in helping drive business at the organisations he’s worked with. Currently the CIO at Mphasis, Debashis talks to us about how the company has developed a predictive mechanism that flags 85 per cent of failures before they occur.
The need for the predictive model
“Today, the biggest challenge for a CIO is predictability of infrastructure availability. And while most can provide the best partners and OEMs to support you when a problem occurs, it’s a different question altogether when it comes to avoiding those problems before they actually occur. Proactive and predictive support, however, is what we are moving towards, and its ability will decide how one is technologically aligned to meet customer requirements. In banking, for example, systems are expected to be operational 24×7 without fail. This is not possible unless you have a mechanism in place to detect potential problems and correct them before they occur. Thus, with this aim in mind we leveraged internal IT data to test a number of usecases, and develop a suitable data model to predict possible failures ahead of time. Every system generates a humongous number of logs, and we found that if we correlated each and every log over a period of time, and used deep learning and the analytics engine, we could easily develop a system that can predict up to 85 per cent of the failures.”
A collaborative effort
“An issue with a firewall, network switch, load balancer, IPS, servers, or any other physical device can result in anything from a possible disk failure and memory failure to a network throughput issue and more. Mphasis in-house team of data scientists and analytics experts worked together with the technology team for a few quarters, taking all the logs from every system, correlating them with each other, testing out all the use cases, and finally developed the predictive model that we needed.
“Today, all large enterprises have something like an SIEM that collects data from each and every system, in the form of logs. Similarly, Mphasis uses industry standard tools for log collection from different systems for our internal purposes, using which we transcribed those logs into a centralised system, applied a number of filters and correlations, and analysed it based on the statistical model using tools like R. The initial prototype took a few months to build, while tuning and testing took subsequent few quarters. After a series of internal tests we found that the solution had proved itself to be exceptionally effective. Based on this model we created our new service offering for our customers. The response till now has been almost overwhelming. We are currently working on the beta version and are looking to optimise its working and offerings to provide an even better service.”
“The biggest challenge that we faced was connecting the data with the technology, in other words the alignment of thoughts between scientists and engineers. For example, the line ‘11% error code XAV’, while perfectly interpretable by the technology team, was gibberish to the data scientists as they do not understand whether to treat the 11% error as good or bad? The translation of the data to identify whether a particular error log is a critical one, a major one, or a minor one, needed the two teams to work together constantly, which significantly increased the turnaround time. The error messages have a number of different connotations and roles, and while the threshold being overshot is an incredibly dangerous occurrence for a router, for a switch it does not hold the same amount of criticality. Each line had to be correctly interpreted to define the statistical model that would give us the correct predictions. However, this being an emerging technology, there was no specific right or wrong way of doing things, except simply learning on the go, and adapting to the requirements as best as possible. So we had to carefully and precisely analyse each and every bit of data.”
“In the current consumer driven digital world, the expectation from end user is to have 100% uptime of all services—with anywhere, anytime access. Can you imagine mailing system/collaboration tool not being available even for few minutes? Having a predictive system to mitigate the possible downtime situations proactively is the need of the hour for every enterprise to provide customer experience.”
In the works
“We are currently also aiming to bring all our legacy systems on to the new age computing environment, and are considering three different layers to use for the same. Using a middle layer we would simply compile all the data from the legacy system and present it in the new age system directly, or we could try a shift and replace strategy, where we’d replace certain legacy systems with new age products and solutions, or lastly, leave it as is. Each of these models varies in their size, complexity, criticality, geographical location, and more. Once we have done this we would be uniquely proficient in providing a new age digital experience to our employees and at the same time we will also build the capability as a service offering to our customers.”