Associating a combination's indices with a value

I am working on a program for which I need combinations of the distances between atoms, or various points in 3D space. Here's an example:

A file 'test' contains the following information:

Ti 1.0 1.0 1.0

O 0.0 2.0 0.0

O 0.0 0.0 0.0

Ti 1.0 3.0 4.0

O 2.0 5.0 0.0

I would like my code to compute all combinations of distances between the points (which I've done!), then, I need to count the number of times that the distance between one atom and another is less than 2.2.

That's confusing in words, so I'll show you what I've got so far.

#!/usr/bin/env python
import sys, math, scipy, itertools
import numpy as np

try:
    infile = sys.argv[1]

except:
    print "Needs file name"
    sys.exit(1)

#opening files for first part
ifile = open(infile, 'r')
coordslist = []

#Creating a file of just coordinates that can be 'mathed on'
for line in ifile:
    pair = line.split()
    atom = (pair[0]); x = float(pair[1]); y = float(pair[2]); z = float(pair[3])
    coordslist += [(x,y,z)]
ifile.close()

#Define distance
def distance(p0,p1):
    return math.sqrt((p0[0] - p1[0])**2 + (p0[1] - p1[1])**2 + (p0[2] - p1[2])**                                          2)

#Initializing for next section
dislist = []
bondslist = []

#Compute distances between all points 1-2, 1-3, 1-4, etc.
for p0, p1 in itertools.combinations(coordslist,2):
    print p0, p1, distance(p0,p1)
    dislist += [distance(p0, p1)]
    if distance(p0,p1) < 2.2:
        bondslist += [(p0, distance(p0,p1))]
print bondslist
print dislist

I wasn't sure if making these lists would help me or not. So far, they haven't.

The output is:

(1.0, 1.0, 1.0) (0.0, 2.0, 0.0) 1.73205080757

(1.0, 1.0, 1.0) (0.0, 0.0, 0.0) 1.73205080757

(1.0, 1.0, 1.0) (1.0, 3.0, 4.0) 3.60555127546

(1.0, 1.0, 1.0) (2.0, 5.0, 0.0) 4.24264068712

(0.0, 2.0, 0.0) (0.0, 0.0, 0.0) 2.0

(0.0, 2.0, 0.0) (1.0, 3.0, 4.0) 4.24264068712

(0.0, 2.0, 0.0) (2.0, 5.0, 0.0) 3.60555127546

(0.0, 0.0, 0.0) (1.0, 3.0, 4.0) 5.09901951359

(0.0, 0.0, 0.0) (2.0, 5.0, 0.0) 5.38516480713

(1.0, 3.0, 4.0) (2.0, 5.0, 0.0) 4.58257569496

[((1.0, 1.0, 1.0), 1.7320508075688772), ((1.0, 1.0, 1.0), 1.7320508075688772), ((0.0, 2.0, 0.0), 2.0)]

[1.7320508075688772, 1.7320508075688772, 3.605551275463989, 4.242640687119285, 2.0, 4.242640687119285, 3.605551275463989, 5.0990195135927845, 5.385164807134504, 4.58257569495584]

One thing I need from this output is the number of times each atom has a distance less than 2.2, for example:

1 2 (because atom 1 has two distances less than 2.2 associated with it)

2 2

3 2 

4 0

5 0

I also need to see what two atoms are making that less-than-2.2 distance. I'm doing this to calculate Pauling charges; this is where you need to look at an atom, determine how many bonds there are to it (atoms less than 2.2 angstroms away), then look at the atoms attached to that atom, and see how many atoms are attached to those. It's terribly frustrating, but it's all going to be dependent on keeping track of each atom instead of just their combinations. An array will probably be extremely useful.

I've checked here and here for help and I think I need to combine these methods in some way. Any help is unbelievably appreciated!

Solution

Before we begin, let me note that in case of crystals (and I have a bit of a suspicion that you're not dealing with a Ti2O3 molecule) you should be careful with the periodic boundary conditions, i.e. the two last atoms that are distant from everyone might be closer to an atom in a neighbouring cell.

What you're trying to do is very simple if you know what tools to use. You're looking for a method that will tell you the pairwise distance between all the points in a set. The function that does this is called pdist, scipy.spatial.distance.pdist to be precise. This can compute pairwise distance for arbitrary sets of points in any dimensions, with any kind of distance. In your specific case, the default Euclidean distance will do.

The pair-wise matrix distance of a set of points (with element [i,j] telling you the distance between points i and j) is symmetric by construction, with zeros in the diagonal. For this reason the usual implementations of pdist return only the off-diagonal elements on one side of the diagonal, and scipy's version is no exception. However, there's the handy scipy.spatial.distance.squareform function that will turn an array containing such a condensed version of a purely off-diagonal symmetrical matrix, and makes it full. From there it's easy to post-process.

Here's what I'd do:

import numpy as np
import scipy.spatial as ssp

# atoms and positions:
# Ti 1.0 1.0 1.0
# O  0.0 2.0 0.0
# O  0.0 0.0 0.0
# Ti 1.0 3.0 4.0
# O  2.0 5.0 0.0

# define positions as m*n array, where n is the dimensionality (3)
allpos = np.array([[1.,1,1],  # 1. is lazy for dtype=float64
                   [0,2,0], 
                   [0,0,0],
                   [1,3,4],
                   [2,5,0]])

# compute pairwise distances
alldist_condensed = ssp.distance.pdist(allpos)       # vector of off-diagonal elements on one side
alldist = ssp.distance.squareform(alldist_condensed) # full symmetric distance matrix

# set diagonals to nan (or inf) to avoid tainting our output later
fancy_index = np.arange(alldist.shape[0])
alldist[fancy_index,fancy_index] = np.nan

# find index of "near" neighbours
thresh = 2.2
neighbslist = [np.where(alldist[k,:]<thresh)[0] for k in range(alldist.shape[0])]  # the k'th element is an array containing the indices which are "close" to atom number k

# find total number of "near" neighbours
nearnum = [neighbs.size for neighbs in neighbslist] # the k'th element is the number of atoms which are "close" to atom number k

So for your specific case, alldist contains the full distance matrix:

array([[        nan,  1.73205081,  1.73205081,  3.60555128,  4.24264069],
       [ 1.73205081,         nan,  2.        ,  4.24264069,  3.60555128],
       [ 1.73205081,  2.        ,         nan,  5.09901951,  5.38516481],
       [ 3.60555128,  4.24264069,  5.09901951,         nan,  4.58257569],
       [ 4.24264069,  3.60555128,  5.38516481,  4.58257569,         nan]])

As you can see, I manually set the diagonal elements to np.nan. This is necessary since I'm intending to check the elements of this matrix that are less than thresh, and the zeros in the diagonal would surely qualify. In our case np.inf would have been an equally good choice for these elements, but then what if you wanted to get points that are further from each other than thresh? Clearly for that case -np.inf or np.nan would be acceptable (so I went with the latter).

The post-processing of near neighbours leads us out of the realm of numpy (you should always stick with numpy as long as you can, that's usually fastest). For each atom you want to get a list of those atoms that are near it. Well, this is not an object with constant length for each atom, so you can't store this nicely in an array. The logical conclusion is to use a list, but then you can go all python and use a list comprehension to construct this list (reminder from above):

neighbslist = [np.where(alldist[k,:]<thresh)[0] for k in range(alldist.shape[0])]  # the k'th element is an array containing the indices which are "close" to atom number k

Here np.where will find the indices within row k for which the distance is small enough, and the 1d array of indices is stored in the kth element of the resulting list neighbslist. Then it's trivial to check the length of these arrays for each atom, giving you your "number of near neihbours" list. Note that we could've cast the output of np.where to a list in the list comp to leave numpy entirely, but then we would have had to use len(neighbs) instead of neighbs.size in the next line.

So, you have two key variables, two lists to be precise; nearnum[k] is the number of "near" neighbours for atom k (with k in range(allpos.shape[0]), and neighbslist[k] is a 1d numpy array listing the near indices for atom k, so neighbslist[k][j] (for j in range(nearnum[k])) is a number in range(allpos.shape[0]) not equal to k. Come to think of it, this list-of-arrays construction is probably a bit ugly, so you should probably cast this object to a proper list of lists during construction (even if this means a bit of overhead).

I only noticed at the end that your input data is in a file. Not to worry, that can also be readily read using numpy! Assuming that those empty lines are not in your input name test, you can call

allpos = np.loadtxt('test',usecols=(1,2,3))

to read the position matrix into your variable. The usecols option lets numpy ignore the first column of the data, which is not numeric, and would cause problems. We don't really need that anyway.