Context: I’m processing Reddit data. There is too much data to handle, therefore I created a random sample of the data. That leads in my network to have a lot of isolated nodes (emphasized, because isolated nodes are usually of degree 0, but here I am referring to degree <=2). An image is better than anything else:
The whole big gray ring is composed of nodes that are of degree 1 or 2.
Hence, I’d like to get rid of those nodes in order to have a more meaningful graph based on the sample I have.
Is this the correct approach? Is it feasible?
The potential problem with your current method is that it will bias your sample and change the topology of your network so that it is no longer representative of the original population (i.e. the entire reddit graph).
There is no perfect solution to your problem and it will depend on which features you are looking to preserve. A common method with some empirical and theoretical support is using random walk sampling, though this method can be costly, it preserves some of the topological features (such as the long-tailed degree distribution of most networks) under a range of contexts. Other methods such as snowball sampling or uniform random sampling as you seem to have done here, have some of the same issues but some undesirable subsampling may be necessary for computational reasons.