In many of our consulting projects we saw a typical misconception. Companies are storing data in the Petabytes. But when the data is needed for the creation of new insights it is not usable.
In timeseries data there are structural breaks - like the merging of different product codes throughout years. In log data there are missing values. Product data is highly ambiguous and the same product is named with three different IDs and project managers try to come up with fuzzy matching in Excel. A doomed endeavour.
Most of our datascience projects consist of all or most of the following steps
Together we identify and review all datasources available in your business unit or company. More often than not clients are focused on single sources like SAP and forget unusal internal or external datasources on the way
After clearly defining the overall goal that should be reached with the help of data we define a potential setup to gather the needed datapoints. Taking into account governance, technical feasibility, licensing cost and other factors
While nobody wants to talk about it reality is that 80% of data science projects are about getting and cleansing the required data. We will help you with this step and make sure that you build your analytics and machine learning efforts on a solid foundation.
With the previously defined use cases in mind we explore the cleansed datasources and make sure that our final goal is in line with the available data quantity and quality. If there are any problems at this step we can still go back to acquisition and cleansing while saving a lot of money spent further down the process
For each of the desired use cases we will now build a functional prototype to proof the value add. Together with all relevant stakeholders we will iterate on the prototype until we are sure to fulfill the need while having great usability. Only then will we proceed to our final step of building a product
After ensuring that we have sufficient high quality data and can fulfill the stakeholder product requirements we finalize all steps in a fully working data product