Could you define Beauty ?
So is Big Data, it is itself a definition. you could ask what is its characteristics.
Big data has n Vs dimension, where n often changes. Laney (2001) suggested that Volume, Variety, and Velocity as 3 Vs, then IBM added Veracity "realism" as the fourth V, later Oracle introduced Value.
So how would we process this Big Data. I use hadoop & wish to learn spark.
Hadoop is an opensource framework used for analyzing big chunk of data, its divide to 2 modules. map-reduce module and a file system module "HDFS".
hadoop divide data to small chunk, start processing each chunk on its own, then start combining each chunk again "divide and conquer principle we used to do in merge sort", each chunk need a core & memory to run on.
as a start I need to define location of my data, where would my data reside
data would reside hadoop file system (HDFS) fs.defaultFS : hdfs://rserver:9000/
then I define my resources " number of processors, ram size, hard capacity ..etc "
Note that map or reduce memory size is a multiple of container memory.
each daemon is located a 1G of ram by default to run on so total physical memory should exceed memory available for containers to use by 5 "number of daemons running".
So is Big Data, it is itself a definition. you could ask what is its characteristics.
Big data has n Vs dimension, where n often changes. Laney (2001) suggested that Volume, Variety, and Velocity as 3 Vs, then IBM added Veracity "realism" as the fourth V, later Oracle introduced Value.
So how would we process this Big Data. I use hadoop & wish to learn spark.
Hadoop is an opensource framework used for analyzing big chunk of data, its divide to 2 modules. map-reduce module and a file system module "HDFS".
hadoop divide data to small chunk, start processing each chunk on its own, then start combining each chunk again "divide and conquer principle we used to do in merge sort", each chunk need a core & memory to run on.
as a start I need to define location of my data, where would my data reside
data would reside hadoop file system (HDFS) fs.defaultFS : hdfs://rserver:9000/
then I define my resources " number of processors, ram size, hard capacity ..etc "
- Number of cores, yarn.nodemanager.resource.cpu-vcores: 20
- Total memory, yarn.nodemanager.resource.memory-mb : 40960
- Total memory for continers, yarn.scheduler.maximum-allocation-mb : 35840
- Container memory, yarn.scheduler.minimum-allocation-mb : 1024
- Map container memory, mapreduce.map.memory.mb : 2084
- Map heap size, mapreduce.map.java.opts : -Xmx1600m
- Reduce container memory, mapreduce.reduce.memory.mb : 2084
- Reduce heap size, mapreduce.reduce.java.opts : -Xmx1600m
- Location for namenode, dfs.name.dir : file:///hadoopinfra/namenode
- Location for datanode, dfs.data.dir : file:///hadoopinfra/datanode
- Tmp directory, hadoop.tmp.dir : /hadoopinfra/tmp
- File chunk size, dfs.block.size : 536870912 "512MB"
so whats a container, in hadoop 2 yarn was introduced, which introduced concept of container. to execute a program we need a processor and memory to run my code on, so a container defines a context for my program to run on "a core and ram size".
Note that map or reduce memory size is a multiple of container memory.
for any framework to start, daemons starts in background
- NameNode for managing file-system tree .
- DataNode for storing and retrieving blocks .
- Secondary NameNode.
- Resource manager 1/cluster for managing resources.
- History server.
each daemon is located a 1G of ram by default to run on so total physical memory should exceed memory available for containers to use by 5 "number of daemons running".
Comments
Post a Comment