{source} <iframe vheight="3280px" height="400px" width="100%" src="/ValueofData/index.htm" ></iframe> {/source}

View this presentation full screen

In case there is content you want from the text, here's the outline view for this training.

The Value of Big Data

Robin Basham, Director Integrated Audit, Ellie Mae, Inc. CISA, CGEIT, CRISC, M.Ed,  M.IT,  VRP,  CRP,  HISP

Prepared for ISACA SV and IMA Palo Alto

With reference to “Infonomics: The Practice of Information Economics” Doug Laney and “Big data for the Masses  The Unique Challenge of Big Data Integration”  A Talend White Paper

The characteristics of Big Data – It’s just data
◦Limits and Benefits in use
◦Why we use Big Data
◦How we use Big Data
—Structured v. Unstructured Data
◦Web 3.0
◦So, is it big or just more BI?
—Overview of new technologies
◦Hiring these skills and creating these skills
◦Simply, what they do, how they fit into any solution
◦Complexity and interpretation risk, you get what you pay for
—Is Social Data on the Balance Sheet?
◦Risks in using social data
◦Problems caused by investing in social data
◦Gartner guidance to question data on the balance sheet
As Director, Integrated Audit at Ellie Mae is accountable to creating and using a GRC program, conducting SOX, SOC, ISMS and various program-specific audits including and FDIC examination. 

As the creator of Facilitated Compliance Management Software (4Point GRC), and founder of EnterpriseGRC Solutions and Phoenix Business and Systems Process, Inc.
-ISACA SV Conference Director,  an ITPreneurs partner, and board advisor for Holistic Information Security Practitioners provides Cloud Security & Virtualization Controls Management training in the San Francisco and Bay Area.  She’s known for successful GRC implementations, supplying overall design, development, and training to companies ranging from start-up to fortune five hundred. Past president of the Association of Certified Green Technology Auditors, ACGTA, a frequent committee contributor to the ISACA Silicon Valley Chapter and liaison to the ITSMF SV chapter, as well as the frequent participant in Cloud Security Alliance local chapter. EnterpriseGRC Solutions is recently added to the Cloud Credential Council and is named to the certification committee of The Holistic Information Security Practitioner Institute (HISPI). EnterpriseGRC Solutions® is an active sponsor of Information Systems Audit and Control Association, ISACA®, listed as a corporate sponsor and many time CobiT® trainer for the ITGI. Visit http://enterprisegrc.com
Robin Basham, M.ED, M.IT, CISSP, CISA, CGEIT, CRISC, ACC, CRP, VRP, and HISP, Director Integrated Audit, Ellie Mae, Inc., CEO founder EnterpriseGRC Solutions Inc.®
What is Big Data?
When data sets became so large and complex that they could no longer be managed using on-hand database management tools, we saw an emergence of Big data technologies.
This new generation of technologies and architectures were designed to extract economic value from datasets by enabling high-velocity capture, discovery, and analysis.
As a result of their invention, we now experience an entirely new information economy, “Infonomics:  The Practice of Information Economics”
Inadequately monitored and largely unregulated, this presentation will highlight ways that this Big Data puts our business strategy and Bay Area economy at Big Risk.
What is the Value of our Big Data
Facebook “likes” and Twitter “tweets” are reported to represent $14 per “share” and $5 per “tweet”. *
Either a company will report an increase in revenue, or a company will pay for that tracked human behavior.
Either we prove that a human committed that behavior, we prove that the activity had a sales result, or we stop accepting false claims.
We need to do at least one.
How We Use Big Data
Marketing Campaign Analysis
Recommendation Engine
Customer Retention and Churn Analysis
Social Graph Analysis
Capital Markets Analysis
Predictive Analytics
Risk Management
Rogue Trading
Fraud Detection
Retail Banking
Network Monitoring
Research And Development
Archiving  Please read more at the source http://info.talend.com
How Should We Use Big Data?
While Business Intelligence has had some time to mature,  Big “social” data projects are new to the requirements of governance.
It appears we may not be equipped for rapidly evolving changes to enterprise management and data governance.
Limited Big Data Resources
Poor Data Quality = Big Risks
Project Governance not Fully Understood
What is a user worth? (Valuation)
What is a good user (Validity)
What is a real user vs. a fake user? (Accuracy, Fraud)
Open Container V. Closed Container

The concept of tampering is critical to valid data sources.

  • Science, Manufacturing, Forensics, Law, all consider the source of information and the capacity that others would have had to alter them.
  • Big data has moved the emphasis of business intelligence from closed to open containers. 
  • How should this affect our willingness and decisions to use that body of data? 
  • Which companies will analyze and compile results and scores based on that information?
  • What will be our grounds to select those vendors and how will that selection be controlled by contract and SLA?
  • What is the Basis to Trust Those Who Measure Social Data?


Why Would We Use Big Data?
Marketing Campaign Analysis, a target audience that identifies the “right” person for the “right” products.
Big Data allows marketing teams to evaluate large volumes from new data sources, like click-stream data and call detail records, to increase the accuracy of the analysis.
Web 3.0:  With over a billion users, today's Internet is arguably the most successful human artifact ever created

This 6-minute video outlines the basic themes of the European Union's Future Internet initiative. These include: an Internet of Services, where services are ubiquitous; an Internet of Things where in principle every physical object becomes an online addressable resource; a Mobile Internet where 24/7 seamless connectivity over multiple devices is the norm; and the need for semantics in order to meet the challenges presented by the dramatic increase in the scale of content and users."

Isn’t Big Data, Just More Data?
  • Which attributes would alert us to the use of Big Data? Aren’t we just using OLTP (On-Line Transactional Processing) and OLAP (On-Line Analytical Processing)? 
  • Big Data describes large volumes of a wide variety of data collected from various sources across the enterprise including transactional data from
  • Enterprise applications
  • Databases
  • Social media data
  • Mobile device data
  • Unstructured data/documents,
  • Machine-generated data
Structured V. Unstructured – Keeping Up
  • We, as auditors and business advisors, need to gain comfort with these new technologies, understanding their benefits and risks, adding capabilities to our workforce, and establishing ground rules for both application and project governance.
  • Traditionally, “Structured” Data is getting faster and bigger.  Machines can’t keep up
  • Traditionally  “Unstructured” Data leads the technologies that would allow their management and distribution;  News and Research
What are the New Technologies?
Reference Slide: Hadoop
  • Hadoop was built to address the challenge of indexing the entire World Wide Web every
  • 2004 - Google developed a paradigm called MapReduce
  • 2005 - Yahoo! started Hadoop as an implementation of MapReduce, 2007 - open source project
  • Hadoop has the basic constructs needed to perform computing:
  • It has a file system, a language to write programs, a way of managing the distribution of those programs over a distributed cluster, and a way of accepting the results of those programs. Ultimately the goal is to create a single result set.
    WithHadoop, big data is distributed into pieces that are spread over a series of nodes running on commodity hardware.
  • Reference Slide: Pig
The Apache Pig project is a high-level data-flow programming language and execution framework for creating MapReduce programs used with Hadoop.
  • The abstract language for this platform is called Pig Latin and it abstracts the programming into a notation, which makes MapReduce programming similar to that of SQL for RDBMS systems.
  • Pig Latin is extended using UDF (User Defined Functions), which the user can write in Java and then call directly from the language.

Reference Slide: Job Tracker

  • A Job Tracker is the entry point for a “map job” or process to be applied to the data. A map job is typically a query written in java and is the first step in the MapReduce process. The Job
  • Tracker asks the name node to identify and locate the necessary data to complete the job. Once it has this information it submits the query to the relevant named nodes.
  • Any required processing of the data occurs within each named node, which provides the massively parallel characteristic of Map Reduce.  When each node has finished processing, it stores the results. The client then initiates a "Reduce" job.
  • The results are then aggregated to determine the “answer” to the original query. The client then accesses these results on the filesystem and can use them for whatever purpose.

Reference Slide: Hive and HiveQL

  • Apache Hive is a data warehouse infrastructure built on top of Hadoop (originally by Facebook) for providing data summarization, ad-hoc query, and analysis of large datasets.
  • Hive provides a mechanism to project structure onto this data and query the data using an SQL-like language called HiveQL.
  • HiveQL is used for business intelligence and visualization tools.

Reference Slide: HBase, HCatalog

  • HBase is a non-relational database that runs on top of the Hadoop file system (HDFS). It is columnar and provides fault-tolerant storage and quick access to large quantities of sparse data. It also adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts, and deletes. It was originally developed by Facebook to serve their messaging systems and is used heavily by eBay as well.
  • HCatalog is a table and storage management service for data created using Apache Hadoop. It allows interoperability across data processing tools such as Pig, Map Reduce, Streaming, and Hive and a shared schema and data type mechanism.

Reference Slide: Flume, Oozie

  • Flume - is a system of agents that populate a Hadoop cluster. These agents are deployed across an IT infrastructure and collect data and integrate it back into Hadoop.
  • Oozie - coordinates jobs written in multiple languages such as Map Reduce, Pig and Hive. It is a workflow system that links these jobs and allows specification of order and dependencies between them.

Reference Slide: Mahout Sqoop

  • Mahout - is a data mining library that implements popular algorithms for clustering and statistical modeling in MapReduce.
  • Sqoop - is a set of data integration tools that allow non-Hadoop data stores to interact with traditional relational databases and data warehouses.

Reference Slide: NoSQL

  • NoSQL(Not only SQL) - refers to a large class of data storage mechanisms that differ significantly from the well-known, traditional relational data stores (RDBMS). These technologies implement their own query language and are typically built on advanced programming structures for key/value relationships, defined objects, tabular methods or tuples.
  • NoSQL as a term is used to describe the wide range of data stores classified as big data. Some of the major flavors adopted within the big data world today include
  • Cassandra,
  • MongoDB,
  • NuoDB,
  • Couchbase and
  • VoltDB.

True or False: Social Data is a part of our Corporate Assets? “The Value of our Data”

  • Where is Data? Who’s using it?
  • What is Data Quality? What is Data Integrity?
  • How do we Enforce Data Retention?
  • What Is our Data Liability?
  • Can we spot the difference between an Illusion of Influence and Actual Influence
  • What was my data cost in researching this or any topic?


Cloud will create 14 Millions Jobs by 2014
Understanding Big Data Risk is Complex
Where is the data?
Can we trust the data?
(False negatives eventually self-correct, but do we have the time?)
Tracking, Liable, Exploit is Complex
Something on this page is “blocked”.
I am on the fence about buying that very dresser.  I wonder if McAfee caught that I was being tracked by an unsafe source? For this exercise, I elect to unblock.
Perhaps Data Governance Should Consider Source – Or Not use the Word “Governance”
Should the information we commingle with news differ from the information used for advertising?
What Is Our Reputational Risk?
  • How Will The Market Feel When they See What We Paid and What We Made?
  • How will we differentiate the use of big data, as opposed to big distraction?
  • Test if Functionality is Limited by Restricting Cookies – Know the Risk to Reader
  • This test shows that we can examine rank without being tracked
  • Can We Trust the Media to Recommend a Product?
  • If results favor a company’s investments isn’t this a step along the path to fraud? This is not a dig on CNET.  It’s a question about doing what everyone else is doing. 

For example, New York Times, and Wall Street Journal have set privacy and governance that would restrict this same behavior.

GPS Act – Example Law

The GPS Act, short for the Geolocation Privacy and Surveillance Act, is a bill co-sponsored by Senator Ron Wyden (D-OR) and Rep. Jason Chaffetz (R-UT) and introduced to the Senate and House in June 2011.

The bill would impose tighter restrictions on how and in what instances law enforcement agencies could legally obtain cell user location information, requiring a warrant in all cases exception a few narrowly-defined emergency situations, such as when an officer “reasonably determines,” that there is risk of “immediate danger of death or serious physical injury to any person,” or “conspiratorial activities” relating to national security or “characteristic of organized crime.”

The bill has been read twice and is stalled in committee.

The GPS Act Supports Legitimate Investigations and Protects Privacy

Facebook Resolves User Right of Publicity Claims Concerning Sponsored Stories Advertising

Facebook recently settled a class-action lawsuit stemming from Facebook’s alleged unauthorized use of users’ photographs in ‘sponsored stories’ advertisements on its site. The class action plaintiffs alleged that Facebook’s use of their images in “Sponsored Stories” advertisements violated the plaintiffs’ rights under the California right of publicity statute, which reserves to the individual the right to control their image for commercial purposes. Under the terms of the settlement, Facebook agreed to pay a total of $20 million, with half of the settlement funds donated to charities and law schools, and the other half going to plaintiffs’ attorneys. Other than the three class representatives, no Facebook users will receive any funds from the settlement. 

Facebook users had been serving as unwitting brand promoters on the site, appearing without their permission or knowledge in promotional ‘stories’ featuring advertised products and services. Merely ‘liking’ a company or brand functioned as an effective opt-in that allowed Facebook to use the user’s image in that company or brand’s advertising on the site. The only means of withdrawing from the promotional use of one’s image was to ‘unlike’ the brand, which, prior to this settlement, was not an easy feat.  […]

 Published In: Civil Remedies Updates, Communications & Media Law Updates, Personal Injury Updates, Privacy Updates  © Kilpatrick Townsend 2012 | Attorney Advertising

When We Use Social Data for A Business Decision, Are We Protected Under CDA 230?
Section 230 of Title 47 of the United States Code (47 U.S.C. 230) was passed as part of the much-maligned Communication Decency Act of 1996.  Many aspects of the CDA were unconstitutional restrictions on freedom of speech, but this section survived and has been a valuable defense for Internet intermediaries ever since. "By its plain language, 230 creates a federal immunity to any cause of action that would make service providers liable for information originating with a third-party user of the service”. Zeran v. America Online, Inc., 129 F.3d 327, 330 (4th Cir. 1997), cert. denied, 524 U.S. 937 (1998)
EFF maintains an archive of CDA cases: http://www.eff.org/legal/ISP_liability/CDA230/

Defamation: CDA Cases – consider Yelp and other reputation data

How Might Social Data Open a Company to Hate?
Is this a crowdsourcing benefit or a liability?
Risks in Life Logging - ENISA
R1 – Breach of privacy
R2 – Inappropriate secondary use of data
R3 – Malicious attacks on smart devices increase their value to authenticate individuals and store personal data increases
R4 – Compliance with and enforcement of data protection legislation made more difficult
R5 – Discrimination and exclusion
R6 – Monitoring, cyber-stalking, child grooming and “friendly” surveillance
R7 – Unanticipated changes in citizens’ behavior and creation of an “obedient” citizen
R8 – Poor decision making/inability to make decisions
R9 – Psychological harm
R10 – Physical theft of property or private information from home environment
R11 – Reduction of choices available to individuals as consumers and user lock-in
R12 – Decrease of productivity

To log or not to log? - Risks and benefits of emerging life-logging applications

Security and Legal Aspects Issues Affecting Privacy
Should We Show Off our Connections?
April 28th, 2012 “Look at Me, me, me”
Who Are We Hurting When We Give Information Away?
Why don’t we regulate companies that exceed a million user threshold?
Why Are We So Willing to Give Away our Data?
After ten minutes of viewing Collusion, I decide to block all tracking sites.
What Happens When We Turn Off Tracking?
What Do We Lose When We Turn Off Tracking? If a follower is worth $118, I lost $35K (I lost nothing)
Can We Act On the Advice of Others?
What are the implications of driving business decisions that involve access or use of the web?
If followers are assets, then isn’t the discouraging of traffic a form of “asset theft”?
Who Are the Contributors to the Information We Trust? Can they have a Paid Agenda?
Should we better qualify contributors in our big data sets?
Bad Rating V. Buying a High Score
Affiliate Programs SHOULD NOT be represented as Community Rating
© Gartner, Doug Laney on Value of Data
Infonomics:  The Practice of Information Economics
The Value of Information
Why Put a Value on Information?
Information is a Unique Asset
Where Are Information Assets on the Balance Sheet?
Reasons to Acknowledge and Account for Information as an Asset
Measuring the Value of Your Information
Understanding Your True Information ROI
Securing Your Information
Influencing Your Corporate Valuation
Assessing Contractual Risks
Borrowing Against Information
Bartering With Information
Selling Information

Doug Laney is research vice president at Gartner

What is the Basis for Posting Value?
The sample shows a small business site valued at one MILLIONTH the other companies, that topped placement in a financially valued list.  The company in first place has made no revenue as a result of that placement.
Unless We Enforce Standards in User Engagement Reporting We May Destroy Our Information Economy
What Are The Incentives to Lie?
How do we define Propaganda?
What constitutes illegal protest?
We Have to Define Free Speech and our Rights to Influence Others
Free speech is not for pay
Free speech is not automated in a batch of hundreds or thousands of communicated response
We have to safeguard our rights to be counted
Instead of making it illegal to track me, perhaps it should be illegal to MISTRACK me.
Wash Your Hands Before You Eat
What are some of the things we should clean or delete?
Competitors from your first connections
Private Phone messages stored for playback on the internet
Facebook applications (unless the product is well understood and serves specific scope and function)
Anonymous identities (oxymoron?)
Connections representing bias or undue influence
Hateful comments, explicit content, personal content
Anything that stores your password and shares your identity with other applications
Yeah, me too.I use social media to track the security flaws in social media