Objective: Generate a down-sampled FileDataset using random sampling from a larger FileDataset to be used in a Data Labeling project.
Details: I have a large FileDataset containing millions of images. Each filename contains details about the 'section' it was taken from. A section may contain thousands of images. I want to randomly select a specific number of sections and all the images associated with those sections. Then register the sample as a new dataset.
Please note that the code below is not a direct copy and paste as there are elements such as filepaths and variables that have been renamed for confidentiality reasons.
import azureml.core
from azureml.core import Dataset, Datastore, Workspace
# Load in work space from saved config file
ws = Workspace.from_config()
# Define full dataset of interest and retrieve it
dataset_name = 'complete_2017'
data = Dataset.get_by_name(ws, dataset_name)
# Extract file references from dataset as relative paths
rel_filepaths = data.to_path()
# Stitch back in base directory path to get a list of absolute paths
src_folder = '/raw-data/2017'
abs_filepaths = [src_folder + path for path in rel_filepaths]
# Define regular expression pattern for extracting source section
import re
pattern = re.compile('\/(S.+)_image\d+.jpg')
# Create new list of all unique source sections
sections = sorted(set([m.group(1) for m in map(pattern.match, rel_filepaths) if m]))
# Randomly select sections
num_sections = 5
set_seed = 221020
random.seed(set_seed) # for repeatibility
sample_sections = random.choices(sections, k = num_sections)
# Extract images related to the selected sections
matching_images = [filename for filename in abs_filepaths if any(section in filename for section in sample_sections)]
# Define datastore of interest
datastore = Datastore.get(ws, 'ml-datastore')
# Convert string paths to Azure Datapath objects and relate back to datastore
from azureml.data.datapath import DataPath
datastore_path = [DataPath(datastore, filepath) for filepath in matching_images]
# Generate new dataset using from_files() and filtered list of paths
sample = Dataset.File.from_files(datastore_path)
sample_name = 'random-section-sample'
sample_dataset = sample.register(workspace = ws, name = sample_name, description = 'Sampled sections from full dataset using set seed.')
Issue: The code I've written in Python SDK runs and the new FileDataset registers, but when I try to look at the dataset details or use it for a Data Labeling project I get the following error even as Owner.
Access denied: Failed to authenticate data access with Workspace system assigned identity. Make sure to add the identity as Reader of the data service.
Additionally, under the details tab Files in dataset is Unknown and Total size of files in dataset is Unavailable.
I haven't come across this issue anywhere else. I'm able to generate datasets in other ways, so I suspect it's an issue with the code given that I'm working with the data in an unconventional way.
Additional Notes:
One of my colleagues discovered that the managed identities were preventing the preview functionality. Once this aspect of the identities was modified, we could examine the data and use it for a data labelling project.