Newly Released Datasets for ML/DL 💻

Julia Flash - Oct 14 '20 - - Dev Community

Previously published on my personal blog

image
Credit: Marketscale

Everyone knows about the datasets from Kaggle and other sources like UCI, but there are treasure troves of data out there at your fingertips right now that you might not know of!

One that you likely are not familiar with is the Data Asset Exchange which is a repo of open sourced datasets recently publicized from enterprise research teams. There are 30+ datasets with a ton released just this year!

The folks behind this effort are from the Center for Open-Source Data & AI Technologies (CODAIT) organization, a group of data scientists and open source developers dedicated to make open source AI easier to use. 🤗

codait max llama
An astounding llama depicted in the 4 styles available from the ‘Fast Neural Style Transfer’ model by CODAIT.
Credit: Nick Kasten

Where are the datasets?

You can find all the open source datasets here.

How do I download them?

Go to the link above and click on a dataset.
Then when you are on the landing page, click Get this dataset.

That's all there is to it!

image

Keep reading if you want to preview your data before downloading it though! Its always smart to see what you are downloading first.

What are the data formats?

They come in CSV, JSON, WAV, JPG, PNG, IOB, HDF5 and others depending on what the data is collected for.

image

Credit: Vince McKelvie

Can I look at the data before downloading it?

Of course! There are data previews for all the datasets here where you can explore it before downloading it.

Lets focus in on one dataset as an example to check this out!

Get Started with free DAX Datasets

image

  1. Go to the datasets, then click on a dataset that interests you. That will take you to a landing page for particular dataset, like this dataset that I am going to walk through: "IBM Debater® Wikipedia Oriented Relatedness".

image

See the blue rectangle in the image that highlights the text Preview the data & notebooks?

image

  1. Click on that button, which will take you to a page like this for the dataset:

image

Now you can then look through each tab to check it out!

image

I'll step through each of these 3 tabs here now so you can understand why they are provided for you.
image
Credit: Quasi Crystals


Dataset Metadata

The Dataset Metadata section shows you what the dataset is used for (Domain), the authors and where it came from, as well as the business case on how you could use this dataset.

image

With the IBM Debater® Wikipedia Oriented Relatedness dataset, the business case is:

Automated Customer Service: Train a chatbot to label and compare user query's concept type with list of available concepts the chatbot is capable of discussing.

Dataset Preview

In the Dataset Preview tab, you actually see the dataset values as they are, be it images, POS (Parts of Speech), concept origins and other facets of the data itself.

image

Dataset Glossary

Last but not least, the Dataset Glossary tab!
This is for if you have no idea what certain words mean within the dataset, the glossary helps you understand terms more thoroughly. I think its cool because I tend to be one of those people. 🙃
For example, asking someone what "POS" was while doing ngram analysis.

image

Are there examples on how to use the data?

That was my question! Yes, I did find that there are example Jupyter notebooks provided for each dataset which shows an analysis using Python kernels.

Go back to the main page with all the datasets here.

Then click on a dataset, which will lead it to landing page and go through the process of clicking on the Preview the data & notebooks button in the top-right of the page.

image

Then you will see in the navigation bar of the window these buttons:

image

Click on Preview Notebook.

This will lead you straight to a notebook which shows how to start with or use the dataset in full!

image

Pretty cool, huh?

Thank you for going through this short tutorial on what, where and how you can get started with Data Asset Exchange datasets via the CODAIT.org team!

3 Favorite DAX Datasets

"The dataset consists of 100 discussion threads crawled from Ubuntu Forums discussions. Each message in each individual thread is assigned a dialog label out of following eight classes: question, repeat question, clarification, further details, solution, positive feedback, negative feedback, junk."

"VTC contains 7920 samples, each consisting of a video-text instruction pair and a compliance/non-compliance label. The dataset has over 1.2 million frames. We take a unique approach in data collection so that the dataset can be automatically augmented from a set of core videos. To answer growing concerns on data privacy, we carefully followed privacy preserving safe-guards in the generation of VTC dataset."

"The WikiText-103 dataset is a collection of over 100 million tokens extracted from the set of verified ‘Good’ and ‘Featured’ articles on Wikipedia."

Hope you enjoyed this article, comment if you have any questions! 🐬

image
Credit: Rebloggy

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .