Our world is a potential treasure trove for data scientists and analysts who can comb through massive amounts of data for new insights, research breakthroughs, undetected fraud or other yet-to-be-discovered purposes. But it also presents a problem for traditional relational databases and analytics tools, which were not built to handle the data being created. Another challenge is the mixed sources and formats, which include XML, log files, objects, text, binary and more.
"We have a lot of data in structured databases, traditional relational databases now, but we have data coming in from so many sources that trying to categorize that, classify it and get it entered into a traditional database is beyond the scope of our capabilities," said Jack Collins, director of the Advanced Biomedical Computing Center at the Frederick National Laboratory for Cancer Research. "Computer technology is growing rapidly, but the number of [full-time equivalent positions] that we have to work with this is not growing. We have to find a different way."
You can't have a conversation about Big Data for very long without talking about the elephant: Hadoop.
Hadoop is an open source software platform managed by the Apache Software Foundation that's very helpful in storing and managing vast amounts of data cheaply and efficiently.
But what makes it special?
Hadoop is more than just a faster, cheaper database and analytics tool. In some cases, the Hadoop framework lets users query datasets in previously unimaginable ways.
Basically, it's a way of storing enormous data sets across distributed clusters of servers and then running "distributed" analysis applications in each cluster.
Here's how Apache describes it:
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly available service on top of a cluster of computers, each of which may be prone to failures.
Introducing Apache Hadoop: The Modern Data Operating System
But what is BIG DATA?
Big data is a popular term used to describe the exponential growth, availability and use of information, both structured and unstructured. Much has been written on the big data trend and how it can serve as the basis for innovation, differentiation and growth.
In this video, Antony Wildey from Oracle Retail explains what Big Data is, and why effective management of data is vital to retailers in gaining actionable insight into how to improve their business. It includes how Oracle can help businesses to use data from social networking sites such as Facebook and Twitter, and use sentiment analysis seamlessly to provide insight on product demand.