In case there is content you want from the text, here's the outline view for this training.
The Value of Big Data
Robin Basham, Director Integrated Audit, Ellie Mae, Inc. CISA, CGEIT, CRISC, M.Ed, M.IT, VRP, CRP, HISP
Prepared for ISACA SV and IMA Palo Alto
With reference to “Infonomics: The Practice of Information Economics” Doug Laney and “Big data for the Masses The Unique Challenge of Big Data Integration” A Talend White Paper
As the creator of Facilitated Compliance Management Software (4Point GRC), and founder of EnterpriseGRC Solutions and Phoenix Business and Systems Process, Inc.
-ISACA SV Conference Director, an ITPreneurs partner, and board advisor for Holistic Information Security Practitioners provides Cloud Security & Virtualization Controls Management training in the San Francisco and Bay Area. She’s known for successful GRC implementations, supplying overall design, development, and training to companies ranging from start-up to fortune five hundred. Past president of the Association of Certified Green Technology Auditors, ACGTA, a frequent committee contributor to the ISACA Silicon Valley Chapter and liaison to the ITSMF SV chapter, as well as the frequent participant in Cloud Security Alliance local chapter. EnterpriseGRC Solutions is recently added to the Cloud Credential Council and is named to the certification committee of The Holistic Information Security Practitioner Institute (HISPI). EnterpriseGRC Solutions® is an active sponsor of Information Systems Audit and Control Association, ISACA®, listed as a corporate sponsor and many time CobiT® trainer for the ITGI. Visit http://enterprisegrc.com
The concept of tampering is critical to valid data sources.
- Science, Manufacturing, Forensics, Law, all consider the source of information and the capacity that others would have had to alter them.
- Big data has moved the emphasis of business intelligence from closed to open containers.
- How should this affect our willingness and decisions to use that body of data?
- Which companies will analyze and compile results and scores based on that information?
- What will be our grounds to select those vendors and how will that selection be controlled by contract and SLA?
- What is the Basis to Trust Those Who Measure Social Data?
This 6-minute video outlines the basic themes of the European Union's Future Internet initiative. These include: an Internet of Services, where services are ubiquitous; an Internet of Things where in principle every physical object becomes an online addressable resource; a Mobile Internet where 24/7 seamless connectivity over multiple devices is the norm; and the need for semantics in order to meet the challenges presented by the dramatic increase in the scale of content and users."
- Which attributes would alert us to the use of Big Data? Aren’t we just using OLTP (On-Line Transactional Processing) and OLAP (On-Line Analytical Processing)?
- Big Data describes large volumes of a wide variety of data collected from various sources across the enterprise including transactional data from
- Enterprise applications
- Social media data
- Mobile device data
- Unstructured data/documents,
- Machine-generated data
- We, as auditors and business advisors, need to gain comfort with these new technologies, understanding their benefits and risks, adding capabilities to our workforce, and establishing ground rules for both application and project governance.
- Traditionally, “Structured” Data is getting faster and bigger. Machines can’t keep up
- Traditionally “Unstructured” Data leads the technologies that would allow their management and distribution; News and Research
- Hadoop was built to address the challenge of indexing the entire World Wide Web every
- 2004 - Google developed a paradigm called MapReduce
- 2005 - Yahoo! started Hadoop as an implementation of MapReduce, 2007 - open source project
- Hadoop has the basic constructs needed to perform computing:
- It has a file system, a language to write programs, a way of managing the distribution of those programs over a distributed cluster, and a way of accepting the results of those programs. Ultimately the goal is to create a single result set.
WithHadoop, big data is distributed into pieces that are spread over a series of nodes running on commodity hardware.
- Reference Slide: Pig
- The abstract language for this platform is called Pig Latin and it abstracts the programming into a notation, which makes MapReduce programming similar to that of SQL for RDBMS systems.
- Pig Latin is extended using UDF (User Defined Functions), which the user can write in Java and then call directly from the language.
Reference Slide: Job Tracker
- A Job Tracker is the entry point for a “map job” or process to be applied to the data. A map job is typically a query written in java and is the first step in the MapReduce process. The Job
- Tracker asks the name node to identify and locate the necessary data to complete the job. Once it has this information it submits the query to the relevant named nodes.
- Any required processing of the data occurs within each named node, which provides the massively parallel characteristic of Map Reduce. When each node has finished processing, it stores the results. The client then initiates a "Reduce" job.
- The results are then aggregated to determine the “answer” to the original query. The client then accesses these results on the filesystem and can use them for whatever purpose.
Reference Slide: Hive and HiveQL
- Apache Hive is a data warehouse infrastructure built on top of Hadoop (originally by Facebook) for providing data summarization, ad-hoc query, and analysis of large datasets.
- Hive provides a mechanism to project structure onto this data and query the data using an SQL-like language called HiveQL.
- HiveQL is used for business intelligence and visualization tools.
Reference Slide: HBase, HCatalog
- HBase is a non-relational database that runs on top of the Hadoop file system (HDFS). It is columnar and provides fault-tolerant storage and quick access to large quantities of sparse data. It also adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts, and deletes. It was originally developed by Facebook to serve their messaging systems and is used heavily by eBay as well.
- HCatalog is a table and storage management service for data created using Apache Hadoop. It allows interoperability across data processing tools such as Pig, Map Reduce, Streaming, and Hive and a shared schema and data type mechanism.
Reference Slide: Flume, Oozie
- Flume - is a system of agents that populate a Hadoop cluster. These agents are deployed across an IT infrastructure and collect data and integrate it back into Hadoop.
- Oozie - coordinates jobs written in multiple languages such as Map Reduce, Pig and Hive. It is a workflow system that links these jobs and allows specification of order and dependencies between them.
Reference Slide: Mahout Sqoop
- Mahout - is a data mining library that implements popular algorithms for clustering and statistical modeling in MapReduce.
- Sqoop - is a set of data integration tools that allow non-Hadoop data stores to interact with traditional relational databases and data warehouses.
Reference Slide: NoSQL
- NoSQL(Not only SQL) - refers to a large class of data storage mechanisms that differ significantly from the well-known, traditional relational data stores (RDBMS). These technologies implement their own query language and are typically built on advanced programming structures for key/value relationships, defined objects, tabular methods or tuples.
- NoSQL as a term is used to describe the wide range of data stores classified as big data. Some of the major flavors adopted within the big data world today include
- Couchbase and
True or False: Social Data is a part of our Corporate Assets? “The Value of our Data”
- Where is Data? Who’s using it?
- What is Data Quality? What is Data Integrity?
- How do we Enforce Data Retention?
- What Is our Data Liability?
- Can we spot the difference between an Illusion of Influence and Actual Influence
- What was my data cost in researching this or any topic?
- How Will The Market Feel When they See What We Paid and What We Made?
- How will we differentiate the use of big data, as opposed to big distraction?
- Test if Functionality is Limited by Restricting Cookies – Know the Risk to Reader
- This test shows that we can examine rank without being tracked
- Can We Trust the Media to Recommend a Product?
- If results favor a company’s investments isn’t this a step along the path to fraud? This is not a dig on CNET. It’s a question about doing what everyone else is doing.
For example, New York Times, and Wall Street Journal have set privacy and governance that would restrict this same behavior.
The GPS Act, short for the Geolocation Privacy and Surveillance Act, is a bill co-sponsored by Senator Ron Wyden (D-OR) and Rep. Jason Chaffetz (R-UT) and introduced to the Senate and House in June 2011.
The bill would impose tighter restrictions on how and in what instances law enforcement agencies could legally obtain cell user location information, requiring a warrant in all cases exception a few narrowly-defined emergency situations, such as when an officer “reasonably determines,” that there is risk of “immediate danger of death or serious physical injury to any person,” or “conspiratorial activities” relating to national security or “characteristic of organized crime.”
The bill has been read twice and is stalled in committee.
The GPS Act Supports Legitimate Investigations and Protects Privacy
Facebook recently settled a class-action lawsuit stemming from Facebook’s alleged unauthorized use of users’ photographs in ‘sponsored stories’ advertisements on its site. The class action plaintiffs alleged that Facebook’s use of their images in “Sponsored Stories” advertisements violated the plaintiffs’ rights under the California right of publicity statute, which reserves to the individual the right to control their image for commercial purposes. Under the terms of the settlement, Facebook agreed to pay a total of $20 million, with half of the settlement funds donated to charities and law schools, and the other half going to plaintiffs’ attorneys. Other than the three class representatives, no Facebook users will receive any funds from the settlement.
Facebook users had been serving as unwitting brand promoters on the site, appearing without their permission or knowledge in promotional ‘stories’ featuring advertised products and services. Merely ‘liking’ a company or brand functioned as an effective opt-in that allowed Facebook to use the user’s image in that company or brand’s advertising on the site. The only means of withdrawing from the promotional use of one’s image was to ‘unlike’ the brand, which, prior to this settlement, was not an easy feat. […]
Defamation: CDA Cases – consider Yelp and other reputation data
To log or not to log? - Risks and benefits of emerging life-logging applications
Infonomics: The Practice of Information Economics
Doug Laney is research vice president at Gartner