Azure Databricks is a fully managed, Platform-as-a-Service (PaaS) offering which was released on Feb 27, 2019, Azure Databricks leverages Microsoft cloud to scale rapidly, host massive amounts of data effortlessly, and streamline workflows for better collaboration between business executives, data scientists and engineers.
Azure Databricks, a Microsoft service that is “first party”, was created by a year-long collaboration between Microsoft and Databricks teams. It provides Databricks’ Apache Spark-based analytical service as part of the Microsoft Azure platform.
Azure Databricks uses Azure Active Directory (AAD), security framework. Existing credentials authorization can still be used, provided that the appropriate security settings are set. Access and identity control can all be done in the same environment. AAD makes it easy to integrate with the entire Azure stack, including Data Lake Storage (as an output or data source), Data Warehouse, Blob Storage and Azure Event Hub.
Blob storage can be used to share data with the world or to keep application data private. Databricks, for those who are familiar with Azure HDInsight or Azure Data Lake Analytics, is a great alternative to Azure HDInsight.
Connecting Azure Databricks and the Azure Storage Account
Create a Storage account and create a container (private) and then upload a file.
Upload the blob file into the container, you can download the file from the given link: https://csg10032000aeaa88a0.blob.core.windows.net/datafile/employe_data.csv
Click on the context menu, click Generate SAS, and then copy the blob SAS Token. Store it somewhere you will use it in the future.
Create an Azure Databricks
Click on Create and choose the subscription, if there are many, and then select/create the resource groups name. Choose the location where you want to create data bricks. Finally, select the pricing tier.
Continue to make changes, then click on Review + create and wait for validation
Once your validation is complete, click on Create
Once your deployment is complete, click on the Go to resource button.
Click on Launch Workspace to redirect to the Azure Databricks Page.
Click on Clusters in your left pane. Now click on Create Cluster. Enter the name of the cluster and the Cluster-Mode Standard. Select the configuration details as described below to create the cluster.
Now, start your cluster. Make sure your cluster is in a running condition.
Click on the workspace in the left pane. You can now see another workspace. Then right-click workspace -> create.
Now, enter the name of your notebook and choose Scala in Default language. Next, select the previous cluster you have created and click Create
Copy the following code into your notebook to establish the connection with your storage account. + containerName+ “.” + storageAccountName + “.blob.core.windows.net”dbutils.fs.mount( source = “wasbs://”+containerName+”@”+storageAccountName+”.blob.core.windows.net/employe_data.csv”, extraConfigs = Map(config -> sas))val mydf = spark.read.option(“header”,”true”).option(“inferSchema”, “true”).csv(“/mnt/myfile”)display(mydf)123456789101112val containerName = “”val storageAccountName = “”val sas = “”val config = “fs.azure.sas.” + containerName+ “.” + storageAccountName + “.blob.core.windows.net”dbutils.fs.mount(source = “wasbs://”+containerName+”@”+storageAccountName+”.blob.core.windows.net/employe_data.csv”,extraConfigs = Map(config -> sas))val mydf = spark.read.option(“header”,”true”).option(“inferSchema”, “true”).csv(“/mnt/myfile”)display(mydf)
If you are able to fetch the data shown below, then you have successfully connected.