Diving into data lakes and standardization with big data architect, Brian Flüg

As credit unions embrace data and analytics, the subjects of Big Data and Data Lakes are increasingly topics of interest. Just as “Data Warehouse” and “Business Intelligence” were novel (now nearly obsolete) concepts just a few years ago, credit unions today are trying to understand how vast pools of data can benefit their bottom lines.

As a naturally collaborative industry, credit unions have a great advantage over their counterparts in the community bank space when it comes to pooling data. By voluntarily contributing their member and financial data to an industry-focused data lake, credit unions provide themselves the opportunity to improve not only their own organizational results, but also improve the competitiveness and security of the industry overall.

A further advantage credit unions possess in this area is data standardization. What is meant by data standardization and why is it important? For an explanation, OnApproach’s Big Data Architect, Brian Flüg, offers some insight into this important topic:

Peter Keers (PK): Tell me a little about your background.

Brian Flüg (BF): I go way back. Three decades in data; large data sets and volumes starting out in the super computer mainframe world doing analytics for nuclear power plants that involved simulations, stress analysis, and vibrations analysis. In some cases, there were football fields full of super computers doing cluster and high volume data analytic calculations. It involved taking massive streams of input and making them quantifiable to give us results on the nuclear power plant.

That bridged me into the mechanical engineering applications like digital wind tunnels, stress and vibrations analysis on aircraft wings, for example. I also had some experience with Department of Defense (DOD) intelligence analytics.  We had massive streams of information from which we tried to develop probability factors. This was just the beginning of the data explosion. It was the early stages in which supercomputing capability was needed. Of course, now, we have distributed architectures where many small computers are organized together to give the same or better computing power at a fraction of the cost. Hadoop is an extension of this trend, in which we need a distributed, high performance, cost-effective cluster processing type of solution.

PK: So, technologies like Hadoop have been a big factor in this Big Data trend?

BF: Absolutely. I’ve been involved with Hadoop and Big Data from medical, to entertainment, financial, manufacturing, consumer goods, DOD, and telecom industries deploying this new, cost-effective Big Data technology. With the onset of the data explosion, technologies like data warehousing and data archiving did not have the performance, scalability, or the cost effectiveness to keep up when it came to large pools of data from multiple organizations. Therefore, industry leaders were looking for technologies that could enable cluster-like performance at a reasonable cost.  This demand led to the birth of Hadoop and distributed file systems that support the evolution of cluster computing at the average business level.

PK: Why are Big Data and Data Lakes concepts important for credit unions?

BF: We’ve produced more data in the last 10 years than we have in our entire history, and it’s not stopping with the onset of devices and the Internet of Things. One big problem with the data explosion is the inevitable need to combine multiple streams of data into a single repository for consumption. Ingestion or acquisition of this data is probably one of the biggest obstacles of getting all this data to be structured and quantifiable data, on which you can do analytical calculations.

PK: How does the Data Lake concept fit in here?

BF: With these large datasets, cost-effective performance and scalability are difficult to achieve so a Data Lake architecture comes into play. It is wide and deep versus narrow and tall, like a traditional data warehouse data archive.  Therefore, it’s much easier to add data and remove data. It’s much easier and faster with the large volumes of data. The data warehouses and archive technologies have trouble keeping up with performance, scalability and the removal of data.

The Data Lake architecture is distributed information. So, one piece of information can be shared appropriately where it’s needed versus requiring multiple copies and tree structures. The schema has been associating various copies of redundant data together. As a result, the data warehouses are now becoming the intellectual property for companies; they are, you could say your “golden copy”. With OnApproach M360, it’s like your intellectual property that’s the Golden Egg. So, taking that and amplifying that to many sources from many companies and creating a lake, it allows you to do complex analytics, and enables the collaboration of intelligence from one to the other. This is where the lake is starting to really offer some value.

NoSQL is a typical solution for data lakes. Why is that? Because it’s not a tree structure. It’s not a hierarchy structure. Everything is in a key value pair, so you can get to anything at the same level. You don’t have to traverse a tree to get to a particular result.

PK: So, it’s great to have large volumes of data that can be analyzed but preparing it for meaningful analysis sounds like a lot of work.

BF: One of the biggest heavy lifting requirements of achieving a Data Lake is the ETL (extract, transform, load) process, and a big part of that is so-called unstructured data. I like to say information, because I believe there’s no such thing as unstructured data. If you have data, it’s data. If you don’t have data, it’s information. Some people call it unstructured data.

An example is the receipt from the gas station that’s two blocks long for a stick of gum. This is an example of old legacy data from some point of sale input from some older technology that was ported to some newer technology. They all have bits of this information in its own format. Different items in different databases, at different speeds make for volumes of this information. You have to manipulate this data and information to make it quantifiable so you can run analytics on it. It’s a huge undertaking.

Traditionally, the ETL is executed on Hadoop clusters. In some cases that consumes all the resources of the cluster and there’s very little left over for the lake. In other cases, it’s a matter of tool sets to achieve this ETL. The variety of sources of this information can be quite wide and far and there’s really no template to work from. We can just do one and apply it to all. In some cases, you can have same logic to do this ingestion, but to cleanse and normalize, unify and structure this information is a huge, huge undertaking. In some cases, I’ve seen Hadoop clusters spend two years and $3,000,000 doing nothing but trying to acquire information for the lake, clean it up, and prepare it for the lake to make it quantifiable, calculatable, pervasive, available, efficient, and effective.

PK: It seems some kind of standardization in the incoming data would make a big difference.  Since M360 is a standardized data model, what implications does this have for the data lake?

BF: It would be astronomically huge. As I mentioned, I’ve been working with Hadoop as long as it has been around. I’ve been all over the country doing what I call rescue missions. These “rescue missions” were for companies that were unable to adequately prepare their data for analysis in the Data Lake. These companies would be two years into a project and still trying to make the data quantifiable so they could do analytics on it.

OnApproach M360 executes an on-site ETL process to cleanse that data, normalize it, structure it, and put it into various schemas to deliver it to the customer on site, analytically ready, calculated, and so forth. M360 is doing the heavy lifting if you think about it, taking all these bits and bytes and converting them from this format or that format, joining them from this to that, converging, and merging.  That’s a lot of computer resources to store and move and distribute that’s done by M360 so it doesn’t need to be done at the Data Lake, M360 is ETL in a box.

Doing this at the customer site eliminates the need to do it on the Hadoop Data Lake cluster. This allows the Data Lake to become a large, intelligent repository, pervasively available for any analytical tool anytime, all the time, and to do so very quickly, while maintaining scalability and dynamic fluidity. You can take data and remove data with ease. I’ve spent decades with the Hadoop architectures and people doing Hadoop trying to achieve what OnApproach M360 does today. It eliminates and reduces risk, cost and time to market to get the Big Data results you want.  Once a credit union installs M360, it has Global Big Data.

PK: The original value proposition for OnApproach M360 was that the credit union would have their internal data available in a structured format to support more effective analysis. But it sounds like another huge value is that their data can be contributed to the Data Lake in such a way that it’s instantly compatible with all the other M360 credit unions’ data in the Data Lake.

BF: Yes, I think what M360 does for the credit union industry is what 270/271 and other standardized healthcare information did for the medical industry.  Since M360 follows the CUFX standard, there is a consistent standard structured format across the industry. M360 brings a predictable, consistent format and structure and a delivery mechanism for big data analytics overnight.

PK: In the healthcare space there are major players like Epic who have been instrumental in pushing along the evolution of electronic medical records. From your recollection, did they establish a standard for the electronic medical record that then became the standard for systems across their industry?

BF: Absolutely. There’s something called medical standards 270 and 271 the government has for medical digital data formats transmission and control. It’s about 4,900 different elements that define all of the medical characteristics for the schema and lay out the architecture for interacting with the HIPPA Eligibility Transfer System (HETS).  

PK: Considering data standardization in healthcare world, is the data standardization in M360 analogous?

BF:  I firmly believe and forecast that M360 will bring a standard to the credit union and financial industries for data delivery and data storage similar to that in the medical world. The difference is they didn’t have an M360. They’ve done it the hard way. They have some of these Hadoop clusters and they’ve written code with millions of dollars and years of people. There are some mechanisms that facilitate the management of that data once it’s structured. The government controls the 270/271 standards. These force structured schemas that everyone must adhere to. I see M360 as an enablement tool to deliver a unified standard for credit unions and financial institution.

PK: One of the things that the credit unions struggle with when they think about contributing their data to a Data Lake is their obligation to keep their members’ data secure. Yet they’d like to have the benefits of analysis made possible with the data contributed to the Data Lake. How did they handle the same privacy issue on medical side?

BF: It is a slightly different environment because of the service providers. Its a different architecture in terms of how the information is shared and who gets access to what. It’s a different entry point than it is for credit unions. If we had multiple hospitals joining information, there might come a time where one hospital or one insurance company would know what you ate for dinner. Credit unions are more collaborative in terms of their information. So, there are more buckets of confidential information that could be shared. By sharing that, however, we must obscure and protect the various types of sensitive information.

PK: What are some of the issues regarding securing that data so that it doesn’t point back to a particular member while at the same time getting the benefits of the Data Lake?

BF: We’ve looked at NIST (National Institute of Standards and Technology) standards and some other GDPR standards in terms of obfuscation.  We’ve taken a holistic view of what’s out there today and what’s coming in this area. We are adding this capability to our M360 architecture with a series of applications and software code to obfuscate this data to meet these standards today and tomorrow. Once that data is shared to the lake, we can run analytics on it. The analytics will still work, but some views will be obscured. That information can then be sent back to the credit union and they can “unobscure” the data so they can see certain details.

PK: OK. When it’s in the lake it’s completely safe from a privacy perspective, but at the same time there’s enough information there like loan balances and other types of things that can be actually be analyzed and used to do, say predictive analytics, machine learning, real-time analytics, etc.

BF: Over time, my thought is that that obfuscation might loosen up a little. Why I say that is because we’ve architected a leading edge, high-performance, high security, in-memory lake. This is kind of unique compared to what others have done in terms of its performance and security. And why do I say that? Because at some point we may be able to, depending on comfort level, reduce the obfuscation, but only some of that which is visible at the analytic layer.

PK: Thank you for your time.

Pete Keers

Pete Keers

Pete serves as an Engagement Manager at OnApproach. He has over 20 years of management reporting, information systems, and project management experience. He has held leadership roles in both business ... Web: www.onapproach.net Details