Fall 2022 Data Management Must-Reads

“In almost every position in today’s world, decisions are made based on compiling and analyzing data and implementing strategies based on the data findings. In my world, there isn’t a day that goes by where I am not doing something with data.” – Stephen Howe, Analytics Starts with a Question: How to Better Understand Your Data

Copyright Clearance Center is excited to share the Fall 2022 edition of the “Data Management Must-Reads” series – a thoughtfully curated selection of important articles from the past few months that expound upon “can’t miss” developments in the world of data.

Managing the entire lifecycle of data

It’s often the case that companies think of the data that they collect or ingest as static. However, data, like a natural resource, has a life cycle of its own. In June an article from the IEEE Computer Society did an excellent job in discussing the importance of managing the entire lifecycle of data. The Importance of Data Lifecycle Management (DLM) and Best Practices describes the various stages of the life cycle and highlights the importance of curating and maintaining data.

The data quality challenge in action

Data quality is a well-known challenge for companies of all sizes. Some of the engineers at LinkedIn looked at this problem of managing data quality at the scale of data that LinkedIn consumes. Towards data quality management at LinkedIn describes the architecture of a solution that they developed, the “Data Health Monitor”, with the goal of improving the quality of data that LinkedIn uses for machine learning efforts.

Going inside the BLOOM Project

Staying on the topic of machine learning, an article the MIT Technology Review describes the BLOOM Project (BigScience Large Open-science Open-access Multilingual Language Model), which attempts to eliminate some of the criticism that has been directed at language models: they are opaque, both in the source code and in the data that is used for training the models. Inside a radical new project to democratize AI describes how the project designers hope to make their models as powerful as those of proprietary ones, but with a transparent process.

Research around databases

For those of you interested in fundamental research around databases, an article in the Communications of the ACM, The Seattle Report on Database Research describes the most recent of an ongoing (since 1988) series of meetings to identify promising areas of research for the next five years.

While the author of 8 Levels of Reproducibility: Future-Proofing Your Python Projects uses Python to discuss his ideas, the framework for reproducible research and coding that he lays out is applicable to data science projects using any language.

Want to keep learning? Take a look at some of CCC’s most recent data-related blog posts:

Managing the entire lifecycle of data

The data quality challenge in action

Going inside the BLOOM Project

Research around databases

Subscribe to CCC’s blog