Search code examples
pythonnumpydeep-learningrdkit

Deepchem disk data to numpy array


I am using Deepchem wrapper for GraphConvolution model as follows. I have my smiles data in .csv which consists of 5 molecules with their smiles representation and their respective activities. The data can be accessed from here directly.

Importing the libraries:

from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import numpy as np
import tensorflow as tf
import deepchem as dc
from deepchem.models.tensorgraph.models.graph_models import GraphConvModel

Loading the data and featurizing it in a way so that it is suitable for Graph-convolution.

graph_featurizer = dc.feat.graph_features.ConvMolFeaturizer()
loader_train = dc.data.data_loader.CSVLoader( tasks=['Activity'], smiles_field="smiles",featurizer=graph_featurizer)
dataset_train = loader_train.featurize( './train_smiles_data.csv')

Analyzing the loaded and featurized data (My Try)

dataset_train.X

array([<deepchem.feat.mol_graphs.ConvMol object at 0x7f8bfc3ad748>,
       <deepchem.feat.mol_graphs.ConvMol object at 0x7f8bfc367828>,
       <deepchem.feat.mol_graphs.ConvMol object at 0x7f8bfc367208>,
       <deepchem.feat.mol_graphs.ConvMol object at 0x7f8bfc369c50>],
      dtype=object)


dataset_train.y

array([[2.71],
       [4.41],
       [3.77],
       [4.2 ]])

for x, y, w, id in dataset_train.itersamples():
    print(x, y, w, id)

<deepchem.feat.mol_graphs.ConvMol object at 0x7f8bfc3ad6a0> [2.71] [1.] CC1=C(O)C=CC=C1
<deepchem.feat.mol_graphs.ConvMol object at 0x7f8bfc30f518> [4.41] [1.] [O-][N+](=O)C1=CC=C(Br)S1
<deepchem.feat.mol_graphs.ConvMol object at 0x7f8bfc30f748> [3.77] [1.] CCC/C=C/C=O
<deepchem.feat.mol_graphs.ConvMol object at 0x7f8bfc30f940> [4.2] [1.] CCCCCC1=CC=CS1

What I want?

As it seems from the above code, dataset_train.X gives a diskobject like <deepchem.feat.mol_graphs.ConvMol object at 0x7f8bfc3ad6a0> and not a numpy array like dataset_train.y.

How do I know what type of data is stored in dataset_train.X? How can I see the data stored in dataset_train.X? Or in another words, how can I convert the dataset_train.X into such a format where I can inspect the data in it?

I believe there should be some way to do that.


Solution

  • As per your previous question dataset_train.X is an array of ConvMol objects. These ConvMol objects are a container for the features of each of your input molecules. The features are not represented like they are for your targets 'train_dataset.y' as they are more complex graph features. look at the source code for the ConvMol object again and look at the source code for the ConvMolFeaturizer. You can then determine how you want to interpret these features:

    # Inspect features for molecule 0
    conv_feature = dataset_train.X[0]
    # Print the atom features
    print(conv_feature.get_atom_features())
    # Print the adjacency list
    print(conv_feature.get_adjancency_list())