Skip to main content

digging


When you say digging, 1st thought, most would think that you would plant a tree.

How about digging in DATA

1st Hadoop is a framework for processing large chunks of data, consisting of 2 modules

  • HDFS: Hadoop Distributed File System "for managing files".
  • Map-Reduce: hadoop methodology for processing data, where big chunks of data is divided into smaller chunks, each directed to the map fn to extract the needed data from, then the reduce fn where the actual processing we need takes place.
Hadoop work on the whole data, in one time,  so it is considered Batch processing.

2nd Hadoop eco-system
It would be annoying, that each time you wish to do a task, you write a java code for each of the map function, then the reduce function, compile the code.. etc. yet Hadoop eco-system provide us with tools that could do so for us

  1. PIG: a scripting language "that is translated in the background to a Map-Reduce job"
  2. Hive: A SQL like query language "also translated to a Map-Reduce job".
  3. Impala: a SQL like query language
  4. SQOOP: for transferring bulk data between Apache Hadoop and structured datastores "RDBM".
  5. HUE "Hadoop User Experience": is a web interface,has editors and browsers for SQL, Hive ..etc.
  6. OOZIE: for workflow
3rd YARN Yet Another Resource Negotiator
YARN was introduced in hadoop 2 release, for a better management for resources "Containers: Memory+Processor". Container is the context where our map and reduce function runs

  1. Resource Manager act as the CEO on all resources, he knows who is occupied and who is free
  2. Node Manager act as CXO under the CEO, knows whos is occupied and who is free on his only node.
  3. Container is our worker, context where map and reduce function resides.
i.e:
a Hadoop cluster constsist of many nodes "PCs", on each node there is a one node manager who control resources on this specific node, all node managers are managed by one resource manager.



Comments

Popular posts from this blog

The post-office & the postman

If we were to talk about old messaging system where there exist post-office, postman & mailbox. each component had its own functionality that we looked for when trying to visualize how those component where to interact in a computerized version. Simple scenario: Mail is added in mail box Postman arrive pick mails from his area mailboxes and take them to the post-office. Post-office organize mails by areas. Postman takes mails related to his area "distribute it in mailboxes". A person can go to post-office and  pick his own mail "in case of failure or wishes for early delivery". Mapping in a computerized version: Scenario: Observer design pattern which can use push or pull scenario, to inform those whom are registered for an event about its occurrence. Component: Post-Office = Message-Broker Post-Office-Box = Message-Storage-Validity Mailbox = Topic/Queue Postman !!! where's the postman ? Apache kafka act as a message broker which d...

Not all Ps sting

  If someone meant to say Ps and pronounce it Bees. would this confuse you :). Ps is for the P that is the start of Properties and Practice Each application should have some properties and follow certain practices. Properties: Below are 5 properties we should try to have in our application with a small description of how to include them Scalable, Scale => Increase workload (horizontally scaling) Statless, no state should be shared among different application instances,  Concurrency, concurrent processing = Threads. Loosely coupled, decompose the system into modules, each has minimal dependencies on each other "modularization", encapsulating code that changes together "High cohesion".  API first, Interfaces, implementation can be changed without affecting other application. favor distribution of work across different teams.  Backing Services, "DB, SMTP, FTP ..." , treating them as attached resources, meaning they can easily be changed. Manageable, changi...

String literal pool

I'm used to use String .format() when constructing a SQL statement, and a friend I knew likes to concatenate saying :"it's more readable to me". So String Objects are immutable, meaning that once they are created, they can't be altered. Concatenating 2 strings doesn't modify either Strings instead, it creates a new String "old ones are added to the string pool". String literals always have a reference to them in String Literal Pool, therefore not eligible for garbage collection. to concatenate use StringBuffer (Synchronized) or StringBuilder (not Synchronized) As both uses an internal array "so that new String Objects are not created". String literal Pool : String are stored in pool and before creating a new string  literals, compiler checks if such string already defined "used to optimize and save space". String literals : a sequence of characters between quotation marks.