Power BI Shared Datasets: What is it? How does it work? Why should I care?
Nowadays, if you are in the world of Power BI you will have heard a lot about certified datasets, and also shared datasets becoming available across multiple workspaces. In this article you will learn about:
- What is a shared dataset in Power BI?
- How the shared dataset can help in Power BI development?
- Where is the place of shared datasets in the Power BI architecture?
- How shared datasets work behind the scene in the Power BI service?
- What are certified and promoted datasets?
What is a shared dataset in Power BI?
When you create a Power BI report (or let’s call it a *.PBIX file), the report has two components (if the data connection mode is import data); a report and a dataset. When you are in the environment of Power BI Desktop, you can’t see the separation that easily, unless you go to the task manager and see the dataset running behind the scene under the Power BI Desktop task threads.
However, when you publish the PBIX file into the service (the Power BI website), you can easily see there are two objects.
- The report is the visualization layer of your Power BI implementation
- The dataset includes the data, tables, relationships, calculations, and connection to the data source.
You can schedule to refresh for the dataset. and connect to on-premises sources (through a gateway), or cloud-based sources.
What is a shared dataset?
Now that you know about the dataset, let’s talk about the shared dataset. A shared dataset is a dataset shared between multiple reports. For a long time, you could create a new report from an existing dataset through the Power Bi website. This feature has been available from the early days of Power BI.
About April 2017, the ability to create a report from Power BI Desktop that can point to an existing dataset, and has a live connection to an existing dataset became available.
Those days it was called; Get Data from Power BI Service. Nowadays, this feature is renamed as; Getting Data from Power BI Dataset.
A shared dataset is a dataset that is shared between multiple reports. Multiple reports connecting to one shared dataset. When that dataset gets refreshed, all of those reports will have new data. A shared dataset is one step closer to the multi-developer tenant in the Power BI environment.
Sharing Datasets Across Multiple Workspaces
For a long time, sharing datasets was only possible inside a workspace. You could not use a dataset from workspace 1 as the resource for a report in workspace 2. However, recently, the feature became available and you can share the dataset across multiple workspaces. This is an amazing update and changes the way that Power BI development works in the future.
When you get data from a Power BI dataset through the Power BI Desktop, you have the option to select which dataset you want to get data from.
How does shared dataset work behind the scenes?
When you share a dataset in the same workspace, everything is clear. You have one dataset to schedule refresh, and multiple reports connected to it. However, when you use a dataset shared from another workspace, you get something that might look a bit different.
When you get data from a Power BI dataset that appears in workspace 1, and then save your report in workspace 2, you might get something like a copy of your dataset in workspace 2. you might say this is not a shared dataset, it is a copied dataset. The fact is that what you see is just a link. Power BI will bring a link to that dataset into the new workspace, this link helps you to understand when the dataset gets refreshed.
Here is what a linked dataset looks like and you can see the difference of that of a normal dataset.
You cannot manually refresh or refresh based on the schedule of a linked dataset. The refresh action can only be configured in the main dataset. The linked dataset is just a link, showing you when was the last date and time of the last refresh, and an easier way to generate more reports from that dataset.
Certified and Promoted Datasets
When Power BI developers use the function of Getting data from the Power BI dataset, they see all datasets from all workspaces that they have access to. This might be a bit confusing. There might be tons of datasets shared in the environment. Developers end up with a question of: Which of these can I use? Which of these are valid to use? Which of these are reconciled and tested? etc.
A new labeling system is added to the Power BI datasets which helps in this scenario. You can mark some of the datasets as certified or promoted. To get a dataset certified, there is an approval process that can assure the dataset passed some of the tests. You can clarify through this labeling system, like what datasets can be used as the source and which can’t. You can build the concept of gold, silver, and bronze datasets. Having gold datasets that are fully tested and reconciled and then down to other levels where the bronze datasets that haven’t been tested yet.
To use this labeling system, the creator of the dataset can go to the settings of the dataset.
In the settings you can set the Endorsement levels as below:
As you can see the Certified option might not be available. The Power BI tenant administrator has the authority to enable that labeling and give access to who needed it in the tenant settings.
The labeling system helps Power BI developers to then see what is the level of certification that a dataset has to be used as a shared dataset, and then they can select based on that respectively.
Shared Datasets in the Power BI architecture
Previously I have written about how Dataflow and shared datasets can play an important role in the multi-developer tenant of Power BI implementation. In a nutshell, using the dataflow ensures you can bring the data well prepared in a central area, which you can call a centralized data warehouse in the Azure Data Lake. Using the shared datasets, you can build data marts that can be used by multiple reports. Here is how the architecture works in a diagram view:
Instead of having silos of Power BI reports and files everywhere, you can build an architecture that works best with multiple developers, less redundancy in the data, in the code, in the logic, and easier to maintain.
The shared dataset is not a new feature in Power BI, but the ability to share it between multiple workspaces announced recently is a game-changer in the architecture of Power BI implementation. Using shared datasets, you can have centralized data models (data marts) that can serve multiple reports. You can reduce the maintenance time, the redundancy of the code, and the data through this approach. Having the labeling system of the certified or promoted dataset is also a great way of putting some processes and governance in place to make sure the shared datasets have been through testing and reconciling.
Resource Credit | RADACAD