Change

Posts

Container Storage

When traveling, we always think of our traveling bag. How our luggage will fit in it, yet have you ever thought, How your bag will be stored or transferred. I think data should be thought of same way. How it will be stored and how it will be retrieved. also sizing would matter, How much size data will occupy. Would it fit in one server or will we need a cluster. Many question would jump in your mind, yet storing and retrieving is the most general one. And so we'll move to ask SQL or NoSQL, I'd go for SQL in-case of having multiple entities communicating with each other, having multiple relation between one another. Yet if what I need is a storage unit, then NoSQL would be my choice. Have a look on the CAP theorem and how each of consistency, availability and partition tolerance will be satisfied using the DB engine you choose

The post-office & the postman

If we were to talk about old messaging system where there exist post-office, postman & mailbox. each component had its own functionality that we looked for when trying to visualize how those component where to interact in a computerized version. Simple scenario: Mail is added in mail box Postman arrive pick mails from his area mailboxes and take them to the post-office. Post-office organize mails by areas. Postman takes mails related to his area "distribute it in mailboxes". A person can go to post-office and pick his own mail "in case of failure or wishes for early delivery". Mapping in a computerized version: Scenario: Observer design pattern which can use push or pull scenario, to inform those whom are registered for an event about its occurrence. Component: Post-Office = Message-Broker Post-Office-Box = Message-Storage-Validity Mailbox = Topic/Queue Postman !!! where's the postman ? Apache kafka act as a message broker which d...

Baby steps

Data aggregation is one of the roads we use to understand our data diversity. SQL "Selective Query Language" is the easiest way we use to do so. below is how to map our SQL syntax to Pig or Spark. SQL Structure: What to retrieve, stating which column we choose to display from my data structure SQL: Select student, age From mathClass Pig: namedMathClass = foreach mathClass generate (chararray) $0 as student:chararray, (int) $2 as age:int ; Spark: namedMathClass = mathClass.map( row => row(0), row(2) ) Whether this row is to be added in our data-set or not "Condition" SQL: where age > 10 Pig: greater_10 = Filter namedMathClass by age > 10 ; Spark: greater_10 = namedMathClass.filter( col => col(1) > 10 ) How to aggregate, we group similar data together in one bag then apply our aggregate function on this bags SQL: Select age, Count(student) From mathClass group by age Pig: groupAge = Group mathClass by age; Iterate_...

digging

Open SVG image in a browser, use arrows to navigate When you say digging, 1 st thought, most would think that you would plant a tree. How about digging in DATA 1 st Hadoop is a framework for processing large chunks of data, consisting of 2 modules HDFS: Hadoop Distributed File System "for managing files". Map-Reduce: hadoop methodology for processing data, where big chunks of data is divided into smaller chunks, each directed to the map f n to extract the needed data from, then the reduce f n where the actual processing we need takes place. Hadoop work on the whole data, in one time, so it is considered Batch processing. 2 nd Hadoop eco-system It would be annoying, that each time you wish to do a task, you write a java code for each of the map function, then the reduce function, compile the code.. etc. yet Hadoop eco-system provide us with tools that could do so for us PIG: a scripting language "that is translated in the background to a ...

Big data OverView

Could you define Beauty ? So is Big Data, it is itself a definition. you could ask what is its characteristics. Big data has n Vs dimension, where n often changes. Laney (2001) suggested that Volume, Variety, and Velocity as 3 Vs, then IBM added Veracity "realism" as the fourth V, later Oracle introduced Value. So how would we process this Big Data. I use hadoop & wish to learn spark. Hadoop is an opensource framework used for analyzing big chunk of data, its divide to 2 modules. map-reduce module and a file system module "HDFS". hadoop divide data to small chunk, start processing each chunk on its own, then start combining each chunk again "divide and conquer principle we used to do in merge sort", each chunk need a core & memory to run on. as a start I need to define location of my data, where would my data reside data would reside hadoop file system (HDFS) fs.defaultFS : hdfs://rserver:9000/ then I define my resources " number o...

Pipe is not always for smoking

pipes with IO & Networking Is a way of communication, connecting two ends "program to a file , program to a program ...etc".as if a person calling another between those two person there exist a pipe. For IO one end is a file, where as for networking both ends are programs basic operation open():inputStream, for opening a connection. A person dailing a number, the other end pick up close(), for closing this connection, one person hang up write(byte[] bufferToWrite,int numOfByteToWrite):numOfActuallyWriten. A person talking read(byte[] bufferToReadIn,int numOfByteToRead):numOfBytesRead. A person listening read() operation is a blocking one, so can use available() to check if in the input stream there exist any data to read before do a blocking read. can also synchronize between the two ends by adding a Buffer for file: FileInputStream --> BufferedInputStream --> DataInputStream BufferedInputStream is used to read stream of bytes, if want to distinguish ...