## Problem

You have a huge graph and you need to find all paths from particular vertices.

Apache Spark has great module GraphX to work with big distributed graphs. But in my opinion it is well suitable for numeric computation based on graph structure such as PageRank. When you need to discover some path GraphX becomes too slow, because of amount of data that is need to be delivered between spark executors.

### Idea

So imagine we have our graph in form of pairs

And we have RDD with particular vertices which paths we need to find.

The idea of algorithm is simple: we will make iterative joins to get all paths in our graph. Ok lets find all paths from startVertices with length 1

RDD

So far so good. As I said before algorithm has iterative nature(so it is very good for Spark). Let`s create recursive function to find all paths

Here we use

Now we can reduce all paths from

###

This naive solution has several problems that need to be resolved in real life applications

**(a, b) -**means oriented edge from a to b.And we have RDD with particular vertices which paths we need to find.

The idea of algorithm is simple: we will make iterative joins to get all paths in our graph. Ok lets find all paths from startVertices with length 1

RDD

**initStep**now contains all paths with length 1 from given vertices.So far so good. As I said before algorithm has iterative nature(so it is very good for Spark). Let`s create recursive function to find all paths

Here we use

**cogroup**instead of join because we do not need to store join key and**join**in Spark is just a special cogroup. Method**count**will trigger computation on our RDD and also is a marker for**stepOver**to stop.Now we can reduce all paths from

**startVertices**###
**P.S.**

This naive solution has several problems that need to be resolved in real life applications- Your graph can have cycles - method
**stepOver**will never ends. To resolve this issue, you can use BloomFilter data structure - it will accumulate all previously "visited" vertices in**stepOver**method and filter them in cogroup operation. I prefer twitter-algebird library and it`s BloomFilter implementation(see https://github.com/twitter/algebird)