HD insgiht is hadoop?
WTF? Hadoop is not a ms product, well yes it is.
Hadoop is open source. MapReduce.
Flume and scoop import and export data out of hadoop.
Domain specialists and write a program and distribute processing across all nodes, apply structure then add.
Azure Data factory, its to be SSIS in teh cloud.
What is the differece between Scoop, Flume and Azure Data Factory. We’ve recommened Scoop in teh past, Flume Log data, scoop data can react with. scoop is schema driven.
Describe the schema and import it.
Why would you use Hadoop or hdinsight ?
Money… its cheaper to deal with large amounts of data on cheap hardware. Now we do it in the cloud and hte compute instance are valid.
TO have hadoop and blob storage – HDInsight.
To not charge you money, delete it. rather than spin down.
Job scheduling service – UZI
HIVE – SQL like query ontop of data.
Those can store metadata in the database.
These columns are related to directory in storage you can store it in an external database. You can store Metadata in SQL then you can use that metadata against the blobs
Base is a no sql data store. (Great access time), has management API
Storm – real time event processing plug components together or a graph, inputs and as stream between parts. Using glasses. If you never take the cluster down you don’t want to do the offset again. Made in Java you can extend it to work with any programming language.
Storm in Python.
Spark is a cluster type, just came out of preview, for windows. NEAR REAL TIME BATCH PROCESSING
SPARK is the new one. <------ Hotness.
Java is the base for Hadoop you can do JVM. 80% it java.
Spark is Scala compared to Java.
R is popular with Data science, you can install R components in Hadoop we want to install R not eh Hadoop cluster. \
Azure machine learning, we use it for learning, we analyze data through a neural network.
PowerShell command libel Azure CLI (implemented in node JS). SDK's .NET /Python/Node.js
Cluster is a group and you can feed that tot the azure service and it will "make that happen".
I can then create my template and use it how I want it.
How do you feed template?
You can feed it through command lines or SDK's or Raw Restful API. Wire it up to a button and then it will get the prams and then start.
We have a GitHub repository, Azure Quick start templates. All of them Readme and then you can click and deploy to azure.
GitHub look for HDInsight Then you can get the templates. You can spin up a predefined cluster.
HDInsight can be up and running in less than 15 minutes. (Usually 15 minutes).
Watch Hadoop spin up.
Click on a cluster. Cluster is its own entity.
If I want eon query logs from heck. Chuck ne and egg problem. You need data in a cloud to do data analysis. It would take you millennia. HDInsight can query for both.
If you have data locally, HDINSIGHT app gateway even if you have that data you still ahv to deliver tot he cloud.
How do I create a solution to HDInsight. Eclipse, Maven or gradle. VISUAL STUDIO <-- no im not partial
Extractions versus MapReduce, sql like query over Hadoop. You need to be able to assign rows and columns,
If you were a sql developer you may want to write in sql.
This is just base hive. There isn’t a hive over spark. There is a spark sql. The data is still stored in a blob.
The way hive works and then you can create table over this database. It’s a CSV file T1 as string t2 as integer. X
Compressed fast, then you can’t read it outside that hive unless it has a abstraction of MapReduce. Pig is mapping. You will have to recreate the cluster.
Hive -> changes to MapReduce.
Jupiter tries to run everything in memory so its faster.
Use data factory to get data to the cloud.
Share on Facebook