From Wikiversity - Reading time: 5 min
| Big Data |
|---|
Technologies
|
Currently this resource draws some inspiration from Big Data. (The former goal is not only to create a comprehensive collection of single pieces, but also bring them together, indicate their dependencies between each other, and group them together.)
The current goal is to form pieces into divisions where each piece has a discernible learning objective, with the overall learning objective being that a learner understand how Big Data works.
First, the big picture of how technologies used in the context of Big Data interlink and can be grouped is described on this page. After that, Technologies are discussed in a more detailed way.
Data is the raw bits of unfiltered data that different programs produce, in contrast to information which has been processed. Big Data refers to the volume, velocity and variety of the raw data. The volume and velocity of such data is usually measured in Gigabytes or Terabytes per unit of time, such as hours or days. Variety here refers to the forms of the data, which can be structured in a database like Oracle or MySQL or unstructured like a log file. There has been a proposal for another V being veracity, or truth of the data. This refers to ensuring the data is cleaned and accurate.
Traditional data processing practices focused on centralized batch processes that ran against finite data growth processes. With the explosive growth of the Internet, and the decreasing costs of both storage and data collection pieces, these centralized resources are quickly overwhelmed.
The response to this was to harness the power of parallel computing. Up until big data, parallel processing was expensive and specialized, requiring custom programming and special hardware. Big data aimed to use ordinary commodity hardware to form a large cluster which was able to do the processing.
In contrast to conventional processing, where all of the data is brought to the CPU for processing, Big Data sends the program that needs to be executed to where the various bits of data. In simple terms, it takes a single file and splits it into pieces, and then stores those pieces on various physical computers around the cluster. When it needs to perform processing on that data, it figures out which computers have parts of that file, and sends the program to be run to those various computers. This allows each computer to process it's piece of the file at the same time, giving full parallelism. After this processing, depending on what information is needed, there might have to be some post-processing, and then a result is produced.
It's important to recognize there is some overhead to the process, and that data that is conventional sized would be slower to process with this overhead. However, with larger volumes of data, the processing time quickly overwhelms the overhead, providing significant savings.
These learning resources will focus on the various mechanisms which are put together to build a Big Data processing system. Our focus will be on the Open Source Hadoop system, and we will examine file storage, batch and interactive processing, data organization and other topics.
File Storage
In Hadoop, File storage is the responsibility of the Hadoop Distributed File System, or HDFS. To learn more about how it works, click the HDFS link.
At the moment, the structure is fairly simple, actually everything that belongs to big data is currently in Technologies, over the time articles should be clustered in better groups, i.e. theoretical algorithms and data structures, implementations, etc.
Articles should have a the following structure:
The discussion pages of each article should be be used for further comments, please feel free to do this!
If you want to add or extend an article, feel free to do this, please try to follow the proposed structure
The initial idea for this collection came from the Reading Club on big data by René Pickhardt.