Small data ops#

A small start-up project or company with a team of data scientists seeking to build ML models for narrow and well-defined problems. The team can be agile and highly collaborative. ML models are most likely trained locally on the respective data scientists’ computers and can then be forgotten about, or scaled out and deployed on the cloud for inference. The team may lack a streamlined CI/CD approach for deploying models. They might manage to have central or distributed data sources that are managed carefully by the team, and the training code can be versioned and maintained in a central repository.

When operations start to scale, the team can:

  • Run into situations where much of the work is repeated by multiple people including tasks such as crafting data, ML pipelines doing the same job, or training similar types of ML models.

  • Work in silos, having minimal understanding of the parallel work of their teammates.

  • Incur huge costs, or higher costs than expected, due to the mundane and repeated work.

  • Experience code and data starting to grow independently.

  • Build artifacts which are not being audited, hence many are non-repeatable.

These operations are typical for small data ops:

  • The team consists of only data scientists.

  • Team members only work with Python environments and manage everything in the Python framework. Choosing Python can be a result of having many ML libraries and tools ready to plug and play for quick prototyping and building solutions.

  • Little to no big data processing is required as the data scientists use small data (<10 GB).

  • Quick ML model development starts with a local computer, then scales out to the cloud for massive computation resources.

  • High support requirements for open source technologies such as PyTorch, TensorFlow, and scikit-learn for any type of ML, from classical learning to deep, supervised, and unsupervised learning.