We work on the generation of synthetic datasets through simulation. Our systems are able to generate near-infinite permutations of complex, domain-specific, 3D environments. Within these environments, we can simulate a range of sensors, from standard RGB cameras to lidar, depth sensors, ultrasound, radar and more.
The known ground truth of the system, allows for highly consistent, rich annotation. This exciting approach opens new paths to machine learning where a lack of the right data is currently limiting.
why synthetic data?
- The ability to generate highly specific data at scale and variation: for example, the simulation of medical imaging techniques in highly controlled permutations of human anatomy or the generation of a range of defects on generated strawberries or CAD models for the industry.
- The ability to generate data before a system is operational. This can enable the training of vision systems for production lines before these are operational or the generation of a range of sensor data through our systems for autonomous robot development.
- Rich labeling: as our 3D environments are defined by our systems, we can annotate any generated feature at will. This can be highly consistent labeling of product quality, the volume of individual beans, perfect segmentation of ultrasound images and much more.
- Full control over datasets: if biases are present in your real-world data, if certain edge cases are underrepresented, synthetic data offers a truly data-centric approach to address these challenges.
How are these datasets generated?
A simulation consists of two main components:
- A definition of the project environment through 3D volumes, meshes and behavioral definitions.
- The simulation of the different sensors inside these 3D environments.
A modular approach to the development of geometry generators and modifiers is key. A parametric interface for each of these modules allows for the controlled generation of near-infinite permutations of geometry, materials and behavior.
A combination of modules can be compiled into a new module with its own high-level interface. A network of these modules can generate complex environments. This architecture allows for a scriptable interface to the simulation and computing farm.
Besides the simulated sensor data, we can use the known ground truth to generate rich labeling. Essential in developing an effective synthetic model:
- Domain knowledge of the project, by us or in cooperation with partners
- A close working loop with the AI team, whether it’s ours or yours
- A solid data validation pipeline
architecture of our simulations.
In the animation below, you find a clear visualization of the architecture of our simulations.
In a data-centric approach, we analyze the quality of the generated data by comparing the synthetic set to a small set of real-world data. This comparison is done in a suitable feature space.
Two important concepts to analyze are the fidelity (how much does the synthetic data resemble the real data) and the diversity (does the synthetic data cover all the variation from the real data).
A range of advanced metrics is used to gather actionable intelligence to refine the synthetic model and improve the generated datasets.
our way of working
we offer synthetic data in three different ways.
For all cases, we believe that close, iterative cooperation between the Al teams and the synthetic data team is key.
• as part of a larger Demcon high-tech or medical system- or product development
• as part of a vision or Al project
• or as a stand-alone service.
In the first phase of a typical project, we focus on reducing the domain gap between our simulation and a small set of real-world data if available. In the subsequent phase, we expand the simulation to cover the project domain.