3 keystones of a data project you don’t learn at school

Maxime Boillot
Webedia I/O
Published in
6 min readSep 6, 2022

--

As a data analyst intern at Webedia Group, I worked during 3 months on a data project centered around Salesforce data. I definitely found out that making improvements on a data project wasn’t what one can learn in an engineering school.

Beyond discovering the interest of cloud services in a data project , I learnt that a data project isn’t just about technical tasks such as building queries with high performances of execution but also about producing high quality data, which is enough clearly documented to be shared and reused .

My data modelisations were intended for the Head of Salesforce and international financial teams (Spain, Germany, Brazil, etc). So the point was to make the datasource reusable by the whole data team and other teams.

I’ve been working for 3 months on a dbt project, and consequently I discovered how dbt could help make the database reusable.

Dbt (for ‘data build tool’) is a data transformation tool used by data engineers and data analysts to transform , test and document data in the cloud data warehouse. Many features have been developed so far.

dbt transformation process

First the main feature of dbt that we used is automation.

1. Why automation is a keypoint of a data project in a major company ?

Automation on dbt enables the freshness of the data. Indeed, in a company, commercial and financial activities evolve fastly and often need to be monitored daily.

For this reason having daily, weekly or even monthly automated refresh of the financial or commercial data is a primary need, which is easily fulfilled by dbt.

How dbt is a life-changer for automation in a data project ?

On the top of that, dbt completely handles the order of execution of the models and sources, and that is a huge time-saving feature for the data team.

By the way, dbt handles parallelism of the execution of the requests. It is sufficient to set the number of threads in the profiles.yml file (which contains all the settings of the environment profile).

Lineage graph of a project focused on a single node of the project

As an example, dbt facilitates the execution of this part of the project because it automatically compiles the sources, then revenue_schedule_by_media_campaign_and_campaign_item model and finally the other models (4 by 4 as we set the number of threads at 4 in the profiles.yml file).

dbt execution logs of the models linked to the revenue_schedule_by_media_campaign_and_campaign_item model

Technically we launch the same command line from a server every morning that executes all the SQL queries in the project, but one can also choose to use tags and selectors to execute only some of them.

For example, thanks to the next command line you can execute every model in the project affected to the tag “aggregated” in the documentation and the result will be stocked in the mydbtenvironment environment :

|dbt run — — target mydbttestenvironment — — select tag:aggregated

At Webedia Group, we only use the open-source version of dbt , which enables dbt execution only through a command line interface ( all the commands are listed here https://docs.getdbt.com/reference/dbt-commands ). However if you don’t feel confortable using a CLI, you can upgrade to their not-for-free cloud product : dbt cloud, which provides an integrated development environment.

2. Why do we have to document our project ?

Then comes the need for documentation. Indeed documenting is paramount for any project. In our case, documenting has two major goals.

First we document the project to make it reusable by other members of the data team. On the same level, we document it to make it reusable by financial and commercial teams so that they produce their own dashboards easily.

How dbt helps us doing it ?

Concretely, we have to fill a .yml file of the project with a piece of code to declare the description of each column.

Then we simply launch two following command lines.

Then this page appears :

Documentation of a model on dbt

From this page, one can also access the following lineage graph, which is paramount for the others in order to have a full vision of the dependencies between the tables, data sources and endline applications/ visualizations of your project.

The prerequisite to make the links between sources and models is to mention the source in a jinja template piece of code. As an example you can link the model stg_orders to jaffle_shop.payment by writing:

|select * from {{source(‘jaffle_shop’,’payment’)}}

3. Easy test implementation ? dbt test !

Furthermore, it’s always necessary to monitor the data that we extract every day.

The first reason is that the schema may have changed (for this need we use predefined generic tests) or the data didn’t update. But sometimes the team simply needs to keep a control on the data that is extracted and ingested every day, to prevent input errors, or be sure that the data preprocessing and data modelisation produced clean and complete information.

In addition, one of the main requirements for implementing tests in dbt is to warn data analysts when an update to a model request has led to a mistake in the modeling of that model or a model that depends on it.

Logs resulting from dbt test execution on the a dbt built-in generic test

Additionally one can implement generic tests thanks to dbt packages (dbt_expectations is a good and user-friendly one!) about the rows, type of values, metrics, etc. More specific tests could be implemented on the dbt project if you know how to handle Jinja templates !

Otherwise one can find many more features on dbt docs : seeds, sources, exposures, analysis and metrics. However if you don’t find anything in this list that could feed your need, you should have a look at the different packages available on dbt hub !

No technical limits for dbt ?

However, dbt is still in development and many useful resources of the project still aren’t exploited.

For example there is no direct possibility to easily monitor the result of a test or even of the execution of a model. Nevertheless if you wonder if it’s possible, the answer is yes !

The most efficient way we found out until now is to get the logs of the execution of the dbt commands in a SQL table and do some cleaning and BI modelisation. Then you can create your own dashboard to monitor the execution of command lines on your project.

Furthermore, the documentation is formatted by dbt and available on an HTML page - there is no other clean output of the documentation delivered by dbt.

Nonetheless, all the information is stored in a file named manifest.json that you can use to create your own csv files of documentation or to create your own documentation page.

Alternatively, one can also configure an external layer named Castor in order to bring more trust and visibility to your datasource within your data team, which gives you access to a catalog and a more detailed lineage graph.

Just try dbt, you can’t be disappointed

Even with these limitations dbt remains a very powerful and promising tool in the data industry. So, if you wondered how you could ease your data transformation or make your database more reusable , you now know how dbt could help you !

Finally, I would like to thank Simon Gautrey and Emeric Trossat for supervising my work on Bigquery and dbt, as well as Franck Haible for taking time to explain the concrete meaning and business rules of the Salesforce data I worked on and documented.

--

--