Removing unnecessary nodes in graph

I have a graph that has two distinct classes of nodes, class A nodes and class B nodes.

Class A nodes are not connected to any other A nodes and class B nodes aren’t connected to any other B nodes, but B nodes are connected to A nodes and vice versa. Some B nodes are connected to lots of A nodes and most A nodes are connected to lots of B nodes.

I want to eliminate as many of the A nodes as possible from the graph.

I must keep all of the B nodes, and they must still be connected to at least one A node (preferably only one A node).

I can eliminate an A node when it has no B nodes connected only to it. Are there any algorithms that could find an optimal, or at least close to optimal, solution for which A nodes I can remove?

Solution

Old, Incorrect Answer, But Start Here

First, you need to recognize that you have a bipartite graph. That is, you can colour the nodes red and blue such that no edges connect a red node to a red node or a blue node to a blue node.

Next, recognize that you're trying to solve a vertex cover problem. From Wikipedia:

In the mathematical discipline of graph theory, a vertex cover (sometimes node cover) of a graph is a set of vertices such that each edge of the graph is incident to at least one vertex of the set. The problem of finding a minimum vertex cover is a classical optimization problem in computer science and is a typical example of an NP-hard optimization problem that has an approximation algorithm.

Since you have a special graph, it's reasonable to think that maybe the NP-hard doesn't apply to you. This thought brings us to Kőnig's theorem which relates the maximum matching problem to the minimum vertex cover problem. Once you know this, you can apply the Hopcroft–Karp algorithm to solve the problem in O(|E|√|V|) time, though you'll probably need to jigger it a bit to ensure you keep all the B nodes.

New, Correct Answer

It turns out this jiggering is the creation of a "constrained bipartitate graph vertex cover problem", which asks us if there is a vertex cover that uses less than a A-nodes and less than b B-nodes. The problem is NP-complete, so that's a no go. The jiggering was hard than I thought!

But using less than the minimum number of nodes isn't the constraint we want. We want to ensure that the minimum number of A-nodes is used and the maximum number of B-nodes.

Kőnig's theorem, above, is a special case of the maximum flow problem. Thinking about the problem in terms of flows brings us pretty quickly to minimum-cost flow problems.

In these problems we're given a graph whose edges have specified capacities and unit costs of transport. The goal is to find the minimum cost needed to move a supply of a given quantity from an arbitrary set of source nodes to an arbitrary set of sink nodes.

It turns out your problem can be converted into a minimum-cost flow problem. To do so, let us generate a source node that connects to all the A nodes and a sink node that connects to all the B nodes.

Now, let us make the cost of using a Source->A edge equal to 1 and give all other edges a cost of zero. Further, let us make the capacity of the Source->A edges equal to infinity and the capacity of all other edges equal to 1.

This looks like the following:

The red edges have Cost=1, Capacity=Inf. The blue edges have Cost=0, Capacity=1.

Now, solving the minimum flow problem becomes equivalent to using as few red edges as possible. Any red edge that isn't used allocates 0 flow to its corresponding A node and that node can be removed from the graph. Conversely, each B node can only pass 1 unit of flow to the sink, so all B nodes must be preserved in order for the problem to be solved.

Since we've recast your problem into this standard form, we can leverage existing tools to get a solution; namely, Google's Operation Research Tools.

Doing so gives the following answer to the above graph:

The red edges are unused and the black edges are used. Note that if a red edge emerges from the source the A-node it connects to generates no black edges. Note also that each B-node has at least one in-coming black edge. This satisfies the constraints you posed.

We can now detect the A-nodes to be removed by looking for Source->A edges with zero usage.

Source Code

The source code necessary to generate the foregoing figures and associated solutions is as follows:

#!/usr/bin/env python3

#Documentation: https://developers.google.com/optimization/flow/mincostflow
#Install dependency: pip3 install ortools

from __future__ import print_function
from ortools.graph import pywrapgraph
import matplotlib.pyplot as plt
import networkx as nx
import random
import sys

def GenerateGraph(Acount,Bcount):
  assert Acount>5
  assert Bcount>5 

  G = nx.DiGraph() #Directed graph

  source_node = Acount+Bcount
  sink_node   = source_node+1

  for a in range(Acount):
    for i in range(random.randint(0.2*Bcount,0.3*Bcount)): #Connect to 10-20% of the Bnodes
      b = Acount+random.randint(0,Bcount-1) #In the half-open range [0,Bcount). Offset from A's indices
      G.add_edge(source_node, a,         capacity=99999, unit_cost=1, usage=1)
      G.add_edge(a,           b,         capacity=1,     unit_cost=0, usage=1)         
      G.add_edge(b,           sink_node, capacity=1,     unit_cost=0, usage=1)
      G.node[a]['type'] = 'A'
      G.node[b]['type'] = 'B'

  G.node[source_node]['type']   = 'source'
  G.node[sink_node]['type']     = 'sink'
  G.node[source_node]['supply'] = Bcount
  G.node[sink_node]['supply']   = -Bcount

  return G

def VisualizeGraph(graph, color_type):
  gcopy = graph.copy()

  for p, d in graph.nodes(data=True):
    if d['type']=='source':
      source = p
    if d['type']=='sink':
      sink = p    

  Acount = len([1 for p,d in graph.nodes(data=True) if d['type']=='A'])
  Bcount = len([1 for p,d in graph.nodes(data=True) if d['type']=='B'])

  if color_type=='usage':
    edge_color = ['black' if d['usage']>0 else 'red' for u,v,d in graph.edges(data=True)]
  elif color_type=='unit_cost':
    edge_color = ['red' if d['unit_cost']>0 else 'blue' for u,v,d in graph.edges(data=True)]

  Ai  = 0
  Bi  = 0
  pos = dict()
  for p,d in graph.nodes(data=True):
    if d['type']=='source':
      pos[p] = (0, Acount/2)
    elif d['type']=='sink':
      pos[p] = (3, Bcount/2)
    elif d['type']=='A':
      pos[p] = (1, Ai)
      Ai += 1
    elif d['type']=='B':
      pos[p] = (2, Bi)
      Bi += 1      

  nx.draw(graph, pos=pos, edge_color=edge_color, arrows=False)

  plt.show()



def GenerateMinCostFlowProblemFromGraph(graph):
  start_nodes = []
  end_nodes   = []
  capacities  = []
  unit_costs  = []

  min_cost_flow = pywrapgraph.SimpleMinCostFlow()

  for node,neighbor,data in graph.edges(data=True):
    min_cost_flow.AddArcWithCapacityAndUnitCost(node, neighbor, data['capacity'], data['unit_cost'])

  supply = len([1 for p,d in graph.nodes(data=True) if d['type']=='B'])

  for p, d in graph.nodes(data=True):
    if (d['type']=='source' or d['type']=='sink') and 'supply' in d:
      min_cost_flow.SetNodeSupply(p, d['supply'])

  return min_cost_flow



def ColorGraphEdgesByUsage(graph, min_cost_flow):
  for i in range(min_cost_flow.NumArcs()):
    graph[min_cost_flow.Tail(i)][min_cost_flow.Head(i)]['usage'] = min_cost_flow.Flow(i)

def main():
  """MinCostFlow simple interface example."""

  # Define four parallel arrays: start_nodes, end_nodes, capacities, and unit costs
  # between each pair. For instance, the arc from node 0 to node 1 has a
  # capacity of 15 and a unit cost of 4.

  Acount = 20
  Bcount = 20
  graph = GenerateGraph(Acount, Bcount)

  VisualizeGraph(graph, 'unit_cost')

  min_cost_flow = GenerateMinCostFlowProblemFromGraph(graph)

  # Find the minimum cost flow between node 0 and node 4.
  if min_cost_flow.Solve() != min_cost_flow.OPTIMAL:
    print('Unable to find a solution! It is likely that one does not exist for this input.')
    sys.exit(-1)

  print('Minimum cost:', min_cost_flow.OptimalCost())

  ColorGraphEdgesByUsage(graph, min_cost_flow)

  VisualizeGraph(graph, 'usage')

if __name__ == '__main__':
  main()