Personally Identifiable Information for Dummies

ibmpii

Have you heard or read the terms Personal Information or Personally Identifiable Information and not been entirely sure what they represent? It may be that they were mentioned in the perennial article in the news about a data breach in which thousands, or hundreds of thousands of individuals’ credit card numbers, driver’s license numbers, addresses, and so on were stolen from company xyz. Or you may be in a job role where you are directly or indirectly affected by internal policy or external regulation around your stakeholders’ personal data – the latest such regulation (and the reason every company all of a sudden is asking you if you want to continue hearing from them) is the European Global Data Protection Regulation (GDPR). Wherever you are coming from, here is a quick and straightforward primer on PI/PII.

The first thing to understand when trying to understand these or similar terms is that they are legal concepts and can differ somewhat in their definition depending on the jurisdiction e.g. whether you are in Canada, the United States, Europe, or Australia. However, whether you call it personal information, personal data, or personally identifiable information, the common thread is that it is information about a uniquely identified or identifiable individual, or in simpler terms the information includes an element or elements that can uniquely identify the person it relates to.

These elements which can uniquely identify a person can be broken down into ‘direct’ and ‘indirect’ identifiers. Direct identifiers are sufficient on their own to uniquely identify an individual and indeed are often used for the express purpose of determining individual identity. Examples of direct identifiers are:

  • Social Security Number/Social Insurance Number/National Identification Number
  • Driver’s license number
  • Employee serial number
  • Credit card number
  • Telephone number
  • Fingerprints
  • Address

Indirect identifiers are those which on their own would not be sufficient to uniquely identify an individual but when combined with other elements can be used to do so (see diagram below). Examples of indirect identifiers are:

  • State/Province or postal code of residence
  • Job role or name of employer
  • Gender
  • Age
  • School attended or attending

What complicates matters is that in some cases or for some individuals an identifier may be sufficient to uniquely identify an individual whereas for others it may not. A great example of this is full name. For someone named ‘John Smith’, full name and city of residence may not be sufficient to uniquely identify someone whereas for someone with a more unique full name it would. From a data protection and privacy perspective it makes sense to apply the strictest possible interpretation of what could uniquely identify an individual.

PII

Now that we have defined Personal Information/Personally Identifiable Information, we can consider a special type of PI: Sensitive Personal Information. This is Personal Information which is considered sensitive because it could be used to cause substantial harm, embarrassment, inconvenience, or unfairness to an individual. The harm could be of a financial, employment, or reputational/social nature.

Again, it’s important to consider that what is SPI in terms of regulatory/legal requirements depends on the jurisdiction. Businesses may also have their own policy which specifies what is SPI for the purposes of their business practices, based on the jurisdictions they operate in. Examples of SPI are below and include many of the PI types mentioned previously which also happen to be direct identifiers:

  • Social Security Number/Social Insurance Number/National Identification Number

That is PI/PII and SPI/SPII in a nutshell. Hopefully this article has been straightforward to understand and has given you a grasp on these subjects.

 

 

Databases – To Containerize or Not To Containerize?

blue white orange and brown container van

Photo by Pixabay on Pexels.com

The microservices architecture is an increasingly popular application architecture approach which can improve time to market by structuring an application as a collection of loosely coupled services, which implement specific business capabilities. While microservices and containers are not equivalent concepts (a microservices architecture can be achieved through different deployment mechanisms), containers are the most logical approach because of their characteristics.

Coming from a data background, I was interested in how databases fit into a containerized microservices world. Following the microservices architecture, each microservice or business capability should keep its own data private or separate from that of another microservice. That doesn’t mean you need a separate database for each microservice – you can achieve this separation with distinct unrelated tables or separate schemas in the same database for each service. A one database per service approach allows more flexibility in that you can choose the right data store for the job and not have any changes by one service potentially impact another, though it involves more overhead.

However, whether you choose to isolate each service’s data at the table, schema, or database level,  a more basic question is should the database itself be containerized? The most significant challenge around containerizing a database is that containers were originally designed to be stateless – persistent state or storage does not survive when the container is not running. Some mechanism is required to maintain persistent storage for the database when a container is moved or shut down/restarted.

stateless

In some cases, say for test or development environments, persistence may not be a big issue. But for production workloads were you cannot lose any data, you definitely need a way to persist the data across container restarts, and probably a backup/high availability strategy for the storage system on the underlying server if it fails.

There are a few options for handling this potential challenge in the context of building a microservices-based applications with containers. The first approach is simply to not containerize the database. This works really well in a public cloud context where you have SaaS cloud data services that you easily spin up, require no management of the hardware and little to no software administration, and in many cases have high availability built in. The diagram below describes a storefront shopping application built on the IBM Cloud using Kubernetes and Docker and bound to instances of Elasticsearch, Cloudant, and MySQL cloud data services (link to detail).

 

If you decide to containerize the database after, you need to leverage a mechanism to persist the data store’s associated data outside of the container, either on the container host’s file system or an external file system/storage area. The Docker documentation highlights volumes as the “preferred mechanism for persisting data generated by and used by Docker containers”. Quick summary of volumes:

  • A volume can be created by Docker at container creation time or afterward, or an existing volume can be mounted into a container
  • A host volume is created in the /var/lib/docker/volumes/ directory on the host machine, which is managed by Docker; this directory is what is mounted into the container
  • Volumes can be shared between multiple containers, although containers writing to the same volume at the same time without corrupting data is not handled automatically

An example of a containerized database leveraging volumes is Db2 Warehouse, which mounts a host file system on /mnt/clusterfs in one or more Db2 Warehouse containers. Db2 Warehouse can be deployed in an MPP or distributed flavour, which requires a POSIX-compliant cluster file system. A diagram of what an MPP deployment leveraging IBM Spectrum Scale (GPFS) file system is below.

Screen Shot 2018-08-24 at 2.29.10 PM

Another approach to enable usage of an external file system is to use volume drivers/plugins. These plugins abstract the application logic from the external storage systems and can provide additional functionality, such as high availability, backups, and data encryption for your data. For example, Flocker is a plugin that enables volumes to follow your containers when they move between hosts in your cluster. VMware vSphere Storage Plugin enables running of containers backed by storage in a vSphere environment.

Whichever approach you take, the bottom line is don’t forget the usual considerations for database deployment such as performance, availability, security when considering how to handle data storage in a container world.

The Evolution of Data Governance

My very simple definition of Data Governance is: the management of data throughout its life cycle, ensuring it can be effectively found, accessed, and trusted by relevant stakeholders, secured against unauthorized access, and disposed of when appropriate. But how has Data Governance evolved over the past number of years?

Working with businesses on their governance initiatives over the 5 or 6 years, I have certainly seen some marked shifts. Early adopters of a data governance program were primarily motivated by regulatory compliance requirements. A great example of this is the BCBS 239 regulation titled “Principles for effective risk data aggregation and risk reporting”. At a high level, this regulation requires the largest financial institutions in the world to have in place strong governance around risk data aggregation and risk reporting practices and specifically to be able to prove the accuracy, integrity, and completeness of data used for risk calculations. Regulations around data exist within specific industries as well as across industries, and being non-compliant can result in steep fines or a halt in operations for an business. The security breaches which are so commonly reported on in the media also bring tremendous impacts to firms’ reputations.

What has been rapidly emerging as another driver for data governance is the need for effective self-service analytics on an organization’s full set of data assets (as well as external data) for all users within the organization. Data has been recognized as a valuable resource that allows companies to innovate and drive new business models. A data governance methodology and tooling can enable IT to work effectively with the business, to empower users in different business functions to leverage trusted data in an efficient, barrier-free, and safe way.

This is why I see more and more clients today looking to implement data governance programs, from retailers to higher education institutions to transportation firms. Compliance and risk management continue to be critical, with new regulations such as the EU’s General Data Protection Regulation, but more and more the conversation is around building a data lake, or agile analytics – Governance 2.0 if you will.

Screen Shot 2017-09-09 at 7.48.39 PM.png

 

 

Data Lake Repositories

In my previous post The Holy Grail of the Data Lake I offered a definition of a Data Lake and reviewed the key elements that make up a Data Lake. I have been referring to an IBM RedBook on this topic. The most obvious element is a set of Data Repositories – you need a place to store different types of data coming into the Data Lake from your Systems of Record ie. in the business critical systems and Systems of Engagement ie. mobile, web interfaces.

A viewpoint that was disseminated by Hadoop-only vendors at one point, and one that I still encounter with clients is that the only solution you need for a Data Lake is Hadoop. The problem is no single storage format or processing engine is appropriate or best for all workloads. Hadoop and associated Apache projects themselves comprise multiple data formats and engines. It’s notable that one of the Hadoop players, Cloudera, now positions itself as a data management platform provider rather than Hadoop provider and breaks it’s offerings into “Analytic DB, Operational DB, Data Science and Engineering, and Cloudera Essentials” – screenshot from their site below. Bottom line is that you should not try to jam a round peg into a square hole and do your due diligence on what repository/engine, whether Open Source or proprietary, best addresses a certain use case.

clouderamarketing

(No longer just Hadoop – Cloudera website)

The diagram below, taken from the previously mentioned RedBook, summarizes the different data domains Data Lake Repositories support. Something interesting called out here is that a Data Lake contains not just the data analytics will be performed on, but also metadata, or descriptive data about the data (Descriptive Data) – this data, as I will discuss, is critical to making the Data Lake useful for an organization’s lines of business.

lakerepositories

There are different ways to architect a Data Lake in terms of repositories and their uses, and this largely depends on an organization’s needs, but the above provides an idea of the different types of uses for data and corresponding repositories needed:

  1. Descriptive Data: metadata about the data assets in the Data Lake and a search index to allow users to easily find data for analytics use cases, supporting a “shop for data” experience. Information views refer to semantic or virtualized views of data providing a simplified view of some data sets for subsets of users.
  2. Deposited Data: an area for users to contribute their own data, or store intermediate data sets or analytic results they have developed.
  3. Historical Data:
    1. Operational History – historical data from systems of record, could be used for some reporting, as archive for active application or maintained for compliance reasons for decommissioned applications. Can be considered a landing zone where data is in similar format to that in operational source.
    2. Audit – record of who is accessing what data in the reservoir.
  4. Harvested Data: data from outside Data Lake which may have been cleansed, combined, converted into a different form than that in source applications in order to support Analytics.
    1. Deep Data – supports different types of data at high volume, storing data with or without a schema, and supporting analytics on structured and semi-structured/unstructured data.
    2. Information Warehouse – Consolidated historical view of structured data for high performance analytics.
  5. Context Data – organization’s operational data – master data e.g. customer record  reference data e.g. country code tables, and business content and media e.g. PDFs, audio files.
  6. Published Data – data refined and targeted at particular consumers.

The Holy Grail of the Data Lake

Many, many businesses today are striving to build a “Data Lake” (/Data Reservoir/Logical Data Warehouse) for their organization. In my experience they all are undertaking this with the goal of making more agile, self-service and IT independent analytics available to the LOBs.  Often they also do not have a clear idea of what a successful Data Lake initiative really entails. Some simply deploy a Hadoop cluster and load in all their data with the expectation that this is all that is required, which leads to that other often referenced concept, a “Data Swamp”.

A simple early definition of a Data Lake is “a storage repository that holds a vast amount of raw data in its native format until it is needed”.

IBM’s definition emphasizes the central role of a governance and metadata layer and that the Data Lake is a set of data repositories rather than single store: “a group of repositories, managed, governed, protected, connected by metadata and providing self service access”.

So keep in mind:

nohadoop

As well as this high level view of the components:

datalakehighlevel

Mandy Chessel is a Distinguished Engineer and Master Inventor in IBM’s Analytics CTO office who is a thought leader on the Data Lake and has worked with customers such as ING on implementations of it. Her RedGuide and RedBook on the topic provide a wealth of information. She defines the three key elements of a Data Lake as follows:

Data Lake Repositories – provide platforms both for storing data and running analytics as close to the data as possible.

Data Lake Services – provide the ability to locate, access, prepare, transform, process, and move data in and out of the data reservoir repositories.

Information Management and Governance Fabric – provides the engines and libraries to govern and manage the data in the data reservoir. This set of capabilities includes validating and enhancing the quality of the data, protecting the data from misuse, and ensuring it is refreshed, retained, and eventually removed at appropriate points in its life cycle.

In my view it’s the Data Lake Services that pose the greatest challenge to deliver for most customers. This is because:

a) being able to locate the right data requires commitment and ownership from the LOBs to continuously catalog/label their data via a data catalog and,

b) while there are many tools for enabling self-service data movement, data virtualization/federation, and metadata management I don’t believe there is a single out of the box silver bullet for all applications and the right solution may vary depending on your data repositories and priorities.

 

Quantifying the Value of Data Governance

Data Governance is often viewed as a cost of doing business rather than a driver of business value and innovation. But it is in fact critical to the agility of an organization’s analytics initiatives and its ability to make informed business decisions. The explosion of available data, the reduced cost of analyzing it, and pressure from the business to leverage it make delivering access to quality data more important than ever. What good is all that data if it takes you weeks to find what you need, get access to it, and validate that it is appropriate to base critical decisions on? So how do you put a dollar value on an investment in data governance software? governance

A starting point for the business case has to do with the amount of time employees searching for, understanding, and cleansing data via highly manual processes and what kind of an improvement a fit for purpose tool set can deliver. How much time do your business analysts or IT staff spend doing the following:

  • Understanding what data sets they need to gain certain analytic insights
  • Determining the quality and origin of each data set
  • Understanding/agreeing on what a certain term or formula in a report means
  • Cleansing data

The other side of the coin is what is the lost revenue to the business of poor data quality, a lack of insight from the business on what data they are basing analytic insights on or what data sets they have available to them. Some potential ways to approach this are looking at:

  • Delayed rollout or failure of analytics initiatives
  • Cost of false claims not identified (fraud)
  • Fines associated with bad data, ungoverned processes or data exposure

Data Integration on Hadoop – Why Reinvent the Wheel?

It seems a common reason that a new project is started on Hadoop which seems to duplicate already existing capability is that the existing solution is just not built to scale to large data volumes. Often that’s a valid argument, but in the case of Data Integration/Data Quality, there are many mature existing solutions out there in the market. Are they all hamstrung when it comes to big data integration?

IBM’s Information Server, a well-established Data Integration solution, initially featured some capability that allowed pushdown of its workload to Hadoop via MapReduce. Of course MapReduce has in time been shown to not be the most performant tool and been essentially superceded by Spark’s in-memory engine. But customers have been using the Information Server Engine itself in its scale-out configuration for big data transformation for many many years, in very large clusters. From this reality I surmise came the decision to unleash the Information Server engine directly as an application on YARN, as BigIntegrate and BigQuality. The below diagram shows how the engine runs on YARN, but at the core of it an Information Server Application Master which negotiates resources for IS processes with the ResourceManager.

isonyarn

How have other integration vendors designed their Big Data solution? Talend, which initially also pushed workload down into MapReduce has switched over to converting its jobs to Spark. This is logical since Spark is much faster than MapReduce, but I expect also involves some significant coding effort to get right. Informatica’s approach seems a bit more confused or nuanced – they promote their “Blaze” Informatica engine also running on YARN but suggest that their solution “supports multiple processing paradigms, such as   MapReduce, Hive on Tez, Informatica Blaze, and Spark to execute each workload on the best possible processing engine” – link. I think this is just because at the end of the day the Informatica engine wasn’t built to handle true big data volumes.

There’s always the option of doing data integration directly with hadoop itself, but there’s not much in the way of a solution there. You can use Sqoop to bring data in, or out but you’ll still end up writing HiveQL and hundreds of scripts.

 

 

 

Master Data Management (Still) at the Core of a Customer-Centric Strategy

Being able to compile an accurate view of an existing or prospective customer is a tremendous competitive differentiator. Knowing your customer can allow you to make the right upsell offer at the right time to drive increased revenue, to take the right action at the right time to prevent losing a dissatisfied customer, and to provide a tailored experience which builds customer loyalty/reduces customer churn.

Knowing your customer involves understanding who they are across your entire business and in today’s social age, even outside of it. Different parts of the business may identify one customer slightly differently, and have different relevant pieces of information about that customer. How do you bring all this together? This is the domain of Master Data Management – building a single “golden” view of a customer from many different sources and allowing seamless access to this view to consuming applications. If you think enabling this is straightforward, and can be built from scratch in-house, you should give your head a good shake. You can only re-invent so many wheels with your IT department.

With the explosion of available data, Master Data Management is more important than ever. Now, in addition to your various internal sources of record from which to build a picture of your customer you have countless social and other third-party sources. Matching customer identity across all of these and viewing relationships between customers is more crucial than ever!

mdmcustomer

 

Speeding up Your Big Data Journey

Hadoop and now Spark can be made out to be the answer to every question, to be the cure for cancer – the Koolaid is very powerful. It’s important to remember that while these technologies are certainly game-changers they do not solve every problem (yet!) and come with familiar challenges.

grail

It seems that when it comes to Hadoop/Spark, considerations around ease of implementation and total cost of ownership can easily go out the window. Some businesses believe they will stand up a cluster and reap tremendous insights in a week. They may also proceed into a Hadoop initiative without having defined use cases for their business. The reality is that with just the core components of the platform you will probably need to hire an army of data scientists/developers to get value out of your data. Not an issue for companies named Google and Amazon but more so for most others.

Two things can make Hadoop a more palatable and realistic proposition for the masses:

  1. Analytics accelerators which make analytics on Hadoop accessible to people without a PhD in mathematics. IBM offers these kinds of “value-adds” – for example BigSheets, which allows analysis of data on Hadoop via a spreadsheet-like interface. SQL on Hadoop also provides a great easy win for the platform, allowing data discovery or the querying of archived data directly on Hadoop with SQL tools. IBM’s BigSQL has been shown to be a strong contender in this area.
  2. Cloud – eliminating the upfront setup of a cluster as well as on-going administration is huge. Not many of the major Hadoop vendors have a true SaaS offering for Hadoop. Nearly every customer is considering existing or new workloads to put on the cloud; Hadoop-as-a-service makes a lot of sense. IBM’s “eHaaS” offering is BigInsights on Cloud.

 

The In-Memory DB Hypewagon – Quick Thoughts

Recent years have seen many stories touting the impact of in-memory database management systems (IMDBMSes), which load up all data into memory, as the new direction for ultra-fast, real-time processing. With each week a new VC-funded entrant seems to arrive with claims of being the fastest database ever and accelerating processing by 100X, 1000X, 10000X! Gartner late last year put out a Market Guide for In-Memory DBMS which advises IT decision-makers to investigate this disruptive technology. Poor old relational database management systems are slagged as relics, dinosaurs optimized for on-disk processing.

There’s no doubt the performance of all DBMSes benefits from providing enough memory to fit all data in memory. The reason IMDBMSes are presented by their proponents as gazillions of times faster than traditional databases (even when they are given enough memory to load up all the data into the cache) is because they are designed from the ground up to optimize for in-memory operations vs. disk-based processing.

It seems to me however, that the optimizations that really account for the biggest performance gains can and in many cases have been successfully built into our aforementioned poor “traditional” RDBMSes. In the case of analytics, I would say it is the columnar data structure that is really the game-changer, facilitating efficient aggregation and calculation on individual columns. DB2 10.5 added BLU Acceleration – with this capability you are able to create columnar storage tables in the database. You can provision enough memory for all of them to be loaded up but BLU is also optimized for working on large data sets which do not fit into memory, minimizing I/O from disk. When SAP HANA, one of the supposed “pure” in-memory database players, first came out SAP said that all data should/will be in memory. They have since backtracked with SAP IQ for Near-Line Storage for HANA and more recently HANA Dynamic Tiering for Extended Storage (for “non-active” data). The customer performance results I have seen with BLU are no less impressive than those of HANA, and are typically accomplished with a ton less hardware. Microsoft SQL Server and Oracle have also come out with their own columnar store offerings, the Columnstore Index and Database In-Memory, respectively.

On the OLTP side of the house, I have not seen any real evidence that in-memory optimizations have resulted in tremendous performance acceleration. The major difference maker, which is nothing new, is turning off logging so you avoid touching disk altogether but this is not applicable to all application uses. While grandiose claims of XXXX times speedup for OLTP are common by IMDBMS vendors, substantiated results are harder to find. I would refer you to a very funny old article on how these claims or “benchmarks” are often put together – MySQL is bazillion times faster than MemSQL, a response to a MemSQL “result”. SQL Server In-Memory OLTP, an offering by a traditional vendor, has received some plaudits. But from what I have seen, acceleration is not broadly applicable to an entire workload, is quite limited without the use of C stored procedures (not exactly an in-memory optimization or even a new one), and perhaps has something to do with addressing SQL Server locking issues. I absolutely understand that there are optimizations that can and have been done for more efficient in-memory processing for OLTP, I just don’t know that they have delivered a game-changing performance boost.