Skip to main content

Big data OverView

Could you define Beauty ?
So is Big Data, it is itself a definition. you could ask what is its characteristics.
Big data has n Vs dimension, where n often changes. Laney (2001) suggested that Volume, Variety, and Velocity as 3 Vs, then IBM added Veracity "realism" as the fourth V, later Oracle introduced Value.

So how would we process this Big Data. I use hadoop & wish to learn spark.

Hadoop is an opensource framework used for analyzing big chunk of data, its divide to 2 modules. map-reduce module and a file system module "HDFS".
hadoop divide data to small chunk, start processing each chunk on its own, then start combining each chunk again "divide and conquer principle we used to do in merge sort", each chunk need a core & memory to run on.

as a start I need to define location of my data, where would my data reside
data would reside hadoop file system (HDFS) fs.defaultFS : hdfs://rserver:9000/
then I define my resources " number of processors, ram size, hard capacity ..etc "
  1. Number of cores, yarn.nodemanager.resource.cpu-vcores: 20
  2. Total memory, yarn.nodemanager.resource.memory-mb : 40960
  3. Total memory for continers, yarn.scheduler.maximum-allocation-mb : 35840
  4. Container memory, yarn.scheduler.minimum-allocation-mb : 1024
  5. Map container memory, mapreduce.map.memory.mb : 2084
  6. Map heap size, mapreduce.map.java.opts : -Xmx1600m 
  7. Reduce container memory, mapreduce.reduce.memory.mb : 2084
  8. Reduce heap size, mapreduce.reduce.java.opts : -Xmx1600m
  9. Location for namenode, dfs.name.dir : file:///hadoopinfra/namenode
  10. Location for datanode, dfs.data.dir : file:///hadoopinfra/datanode
  11. Tmp directory, hadoop.tmp.dir : /hadoopinfra/tmp 
  12. File chunk size, dfs.block.size : 536870912 "512MB"
so whats a container,  in hadoop 2 yarn was introduced, which introduced concept of container. to execute a program we need a processor and memory to run my code on, so a container defines a context for my program to run on "a core and ram size".

Note that map or reduce memory size is a multiple of container memory.

for any framework to start, daemons starts in background
  1. NameNode for managing file-system tree .
  2. DataNode for storing and retrieving blocks .
  3. Secondary NameNode. 
  4. Resource manager 1/cluster for managing resources.
  5. History server.

each daemon is located a 1G of ram by default to run on so total physical memory should exceed  memory available for containers to use by 5 "number of daemons running".

Comments

Popular posts from this blog

The post-office & the postman

If we were to talk about old messaging system where there exist post-office, postman & mailbox. each component had its own functionality that we looked for when trying to visualize how those component where to interact in a computerized version. Simple scenario: Mail is added in mail box Postman arrive pick mails from his area mailboxes and take them to the post-office. Post-office organize mails by areas. Postman takes mails related to his area "distribute it in mailboxes". A person can go to post-office and  pick his own mail "in case of failure or wishes for early delivery". Mapping in a computerized version: Scenario: Observer design pattern which can use push or pull scenario, to inform those whom are registered for an event about its occurrence. Component: Post-Office = Message-Broker Post-Office-Box = Message-Storage-Validity Mailbox = Topic/Queue Postman !!! where's the postman ? Apache kafka act as a message broker which d...

Not all Ps sting

  If someone meant to say Ps and pronounce it Bees. would this confuse you :). Ps is for the P that is the start of Properties and Practice Each application should have some properties and follow certain practices. Properties: Below are 5 properties we should try to have in our application with a small description of how to include them Scalable, Scale => Increase workload (horizontally scaling) Statless, no state should be shared among different application instances,  Concurrency, concurrent processing = Threads. Loosely coupled, decompose the system into modules, each has minimal dependencies on each other "modularization", encapsulating code that changes together "High cohesion".  API first, Interfaces, implementation can be changed without affecting other application. favor distribution of work across different teams.  Backing Services, "DB, SMTP, FTP ..." , treating them as attached resources, meaning they can easily be changed. Manageable, changi...

String literal pool

I'm used to use String .format() when constructing a SQL statement, and a friend I knew likes to concatenate saying :"it's more readable to me". So String Objects are immutable, meaning that once they are created, they can't be altered. Concatenating 2 strings doesn't modify either Strings instead, it creates a new String "old ones are added to the string pool". String literals always have a reference to them in String Literal Pool, therefore not eligible for garbage collection. to concatenate use StringBuffer (Synchronized) or StringBuilder (not Synchronized) As both uses an internal array "so that new String Objects are not created". String literal Pool : String are stored in pool and before creating a new string  literals, compiler checks if such string already defined "used to optimize and save space". String literals : a sequence of characters between quotation marks.