Skip to main content

Baby steps

Data aggregation is one of the roads we use to understand our data diversity. SQL "Selective Query Language" is the easiest way we use to do so. below is how to map our SQL syntax to Pig or Spark.

SQL Structure:
  • What to retrieve, stating which column we choose to display from my data structure
    • SQL: Select student, age From mathClass
    • Pig: namedMathClass = foreach mathClass generate (chararray) $0 as student:chararray, (int) $2 as age:int ;
    • Spark: namedMathClass = mathClass.map( row => row(0), row(2) )
  • Whether this row is to be added in our data-set or not "Condition"
    • SQL: where age > 10
    • Pig: greater_10 = Filter namedMathClass by age > 10 ;
    • Spark: greater_10 = namedMathClass.filter( col => col(1) > 10 )
  • How to aggregate, we group similar data together  in one bag then apply our aggregate function on this bags 
    • SQL: Select age, Count(student) From mathClass group by age
    • Pig:
      • groupAge = Group mathClass by age;
      • Iterate_Age = Foreach groupAge generate group as age, COUNT(mathClass) as total;
    • Spark: Iterate_Age = mathClass.groupBy("age").agg(count("student").alias("total")

A complete Spark example:


import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val file = sc.textFile("hdfs://MyServer:9000/path/to/data")
val splitData = file.map(line => line.split(";") ).
                map(row => Row(row(0).substring(0,10), row(1),row(2),row(3)))

// column names 
val schemaString = "day user Err_Code Err_MSG"
// map each column to a data type
val fields = schemaString.split(" ").
  map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)
val dataDF = spark.createDataFrame(splitData, schema)
val aggregated = dataDF.groupBy("day","Err_Code").
agg(count("user").alias("user"),countDistinct("user").alias("unique"),
count(when($"Err_MSG"==="SUCCESS",true) ).alias("Condtion"))
val result = aggregated.rdd
result.saveAsTextFile("hdfs://MyServer:9000/path/to/saveAt")

Comments

Popular posts from this blog

Not all Ps sting

  If someone meant to say Ps and pronounce it Bees. would this confuse you :). Ps is for the P that is the start of Properties and Practice Each application should have some properties and follow certain practices. Properties: Below are 5 properties we should try to have in our application with a small description of how to include them Scalable, Scale => Increase workload (horizontally scaling) Statless, no state should be shared among different application instances,  Concurrency, concurrent processing = Threads. Loosely coupled, decompose the system into modules, each has minimal dependencies on each other "modularization", encapsulating code that changes together "High cohesion".  API first, Interfaces, implementation can be changed without affecting other application. favor distribution of work across different teams.  Backing Services, "DB, SMTP, FTP ..." , treating them as attached resources, meaning they can easily be changed. Manageable, changi...

Micro-Service with mind-map

  If we were to give a definition to micro service, what will it be? A simple one is an architectural style, that functionally decomposes an application into a set of services, each service has a focused, cohesive set of responsibilities. Similar to most, it has to have some properties & practices, which we can categorize to a general ones and detailed ones “12-factors”.  Away from the 12 factor, Some general practice that be considered while decomposing a services:  Loosely coupled: minimum communication between Services. Cohesion: elements that are tightly related to each other and change together should stay together "Common Closure Principle (CCP)". Single responsibility principal (SRP): every micro-service should do one thing and do it exceptionally. When constructing an application or defining its architecture, we follow below three-step process: Identifying the system operations, functional requirement, which are the user stories and their associated user scena...

digging

Open SVG image in a browser, use arrows to navigate When you say digging, 1 st thought, most would think that you would plant a tree. How about digging in DATA 1 st Hadoop is a framework for processing large chunks of data, consisting of 2 modules HDFS: Hadoop Distributed File System "for managing files". Map-Reduce: hadoop methodology for processing data, where big chunks of data is divided into smaller chunks, each directed to the map f n to extract the needed data from, then the reduce f n where the actual processing we need takes place. Hadoop work on the whole data, in one time,  so it is considered Batch processing. 2 nd Hadoop eco-system It would be annoying, that each time you wish to do a task, you write a java code for each of the map function, then the reduce function, compile the code.. etc. yet Hadoop eco-system provide us with tools that could do so for us PIG: a scripting language "that is translated in the background to a ...