Skip to main content

Baby steps

Data aggregation is one of the roads we use to understand our data diversity. SQL "Selective Query Language" is the easiest way we use to do so. below is how to map our SQL syntax to Pig or Spark.

SQL Structure:
  • What to retrieve, stating which column we choose to display from my data structure
    • SQL: Select student, age From mathClass
    • Pig: namedMathClass = foreach mathClass generate (chararray) $0 as student:chararray, (int) $2 as age:int ;
    • Spark: namedMathClass = mathClass.map( row => row(0), row(2) )
  • Whether this row is to be added in our data-set or not "Condition"
    • SQL: where age > 10
    • Pig: greater_10 = Filter namedMathClass by age > 10 ;
    • Spark: greater_10 = namedMathClass.filter( col => col(1) > 10 )
  • How to aggregate, we group similar data together  in one bag then apply our aggregate function on this bags 
    • SQL: Select age, Count(student) From mathClass group by age
    • Pig:
      • groupAge = Group mathClass by age;
      • Iterate_Age = Foreach groupAge generate group as age, COUNT(mathClass) as total;
    • Spark: Iterate_Age = mathClass.groupBy("age").agg(count("student").alias("total")

A complete Spark example:


import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val file = sc.textFile("hdfs://MyServer:9000/path/to/data")
val splitData = file.map(line => line.split(";") ).
                map(row => Row(row(0).substring(0,10), row(1),row(2),row(3)))

// column names 
val schemaString = "day user Err_Code Err_MSG"
// map each column to a data type
val fields = schemaString.split(" ").
  map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)
val dataDF = spark.createDataFrame(splitData, schema)
val aggregated = dataDF.groupBy("day","Err_Code").
agg(count("user").alias("user"),countDistinct("user").alias("unique"),
count(when($"Err_MSG"==="SUCCESS",true) ).alias("Condtion"))
val result = aggregated.rdd
result.saveAsTextFile("hdfs://MyServer:9000/path/to/saveAt")

Comments

Popular posts from this blog

The post-office & the postman

If we were to talk about old messaging system where there exist post-office, postman & mailbox. each component had its own functionality that we looked for when trying to visualize how those component where to interact in a computerized version. Simple scenario: Mail is added in mail box Postman arrive pick mails from his area mailboxes and take them to the post-office. Post-office organize mails by areas. Postman takes mails related to his area "distribute it in mailboxes". A person can go to post-office and  pick his own mail "in case of failure or wishes for early delivery". Mapping in a computerized version: Scenario: Observer design pattern which can use push or pull scenario, to inform those whom are registered for an event about its occurrence. Component: Post-Office = Message-Broker Post-Office-Box = Message-Storage-Validity Mailbox = Topic/Queue Postman !!! where's the postman ? Apache kafka act as a message broker which d...

Not all Ps sting

  If someone meant to say Ps and pronounce it Bees. would this confuse you :). Ps is for the P that is the start of Properties and Practice Each application should have some properties and follow certain practices. Properties: Below are 5 properties we should try to have in our application with a small description of how to include them Scalable, Scale => Increase workload (horizontally scaling) Statless, no state should be shared among different application instances,  Concurrency, concurrent processing = Threads. Loosely coupled, decompose the system into modules, each has minimal dependencies on each other "modularization", encapsulating code that changes together "High cohesion".  API first, Interfaces, implementation can be changed without affecting other application. favor distribution of work across different teams.  Backing Services, "DB, SMTP, FTP ..." , treating them as attached resources, meaning they can easily be changed. Manageable, changi...

exploring black box

Once loved watching a comic old movie, where a clause was often said:"what's in this ring box ?" accompanied with a surprising answer:"an elephant is in this ring box " :o Reflection discover whats in the black box "an object", could know what is in an object as if a mirror that reflects it content "ability to examine the run-time properties of the object " small ex wanted to return a generic list containing result of a sql query "as if hibernate" List<T>list=new ArrayList<T>(); ResultSetMetaData metadata=resultSet.getMetaData(); String[] columnName=new String[metadata.getColumnCount()]; for(int i=0;i<columnName.length;i++){//column name equivalent to object instance variable      columnName[i]=metadata.getColumnLabel(i+1); } int index=0; T t; while (resultSet.next()) {   list.add((T)cls.newInstance()); t=list.get(index); for(int i=0;i<columnName.length;i++){//initializati...