Source: https://xrpcommunity.blog/the-full-xrp-ledger-visualized/ (1.4m nodes, 2.6m edges)

- Very large graphs, millions of nodes & edges
- Sampling? (whole is more than the sum of its parts)
- Generating ‘summary statistics’ is also a challenge
- How do we visualize?

- Multi-dimensional
- numerous vertex and edge attributes
- time dimension
- uncertainity / data quality issues

- Scalability
- Resouce constraints (unless we are Google)
- Most db systems do not support graph operations

Source: Cylynx - node propagation and decay on a transaction graph

Use cases:

- Community detection via label propagation
- Precomputing relations between entities
- Semi-supervised tagging

- Sparse transaction graph (100m nodes, 300m edges and growing)
- Heavily skewed degree distribution (<10% of the nodes are invovled in 80% of the transactions)
- Time order matters!

**Vertices**

vertex_uid | attr1 | attr2 | attr3 |
---|---|---|---|

se2bhd3s | 1 | 0 | 0 |

3fs34rfs | 0 | 1 | 0 |

3x6hk6qw | 0 | 0 | 1 |

**Edges**

txn_uid | from | to | value | timestamp |
---|---|---|---|---|

ASDJASD | se2bhd3s | 3fs34rfs | 10 | 11 Nov 2019 |

ASDAECD | 3fs34rfs | 3x6hk6qw | 20 | 12 Dec 2020 |

ESFDSGS | se2bhd3s | 3ds4rask | 30 | 20 Oct 2021 |

- Solution needs to be scaled to 10-100x larger graphs
- Need exact information on all the nodes / edges to be calculated
- Should be fast enough to try out different parameters and settings across the entire dataset
- Incremental updates should be fast
- Minimize cost (ok that’s my requirement)

Source: Cylynx - loop de loop, visualizing paths between random start and end points of a mixer component

- Let’s parallelize it with Spark!
- Dump everything into a graph library!
- Let’s use a more efficient graph library!
- How about a simple loop?
- Let’s add a little magic to the loop

**Our current solution is at least 10x faster and 10x cheaper than the initial one**

- Use spark / graphframe to implement parallel processing
- Guarantees scalability right?
- But, at what cost?

**Lesson learnt:** Use the right tools for the right job

*Pros*

- Nice api
- Visualization
- Easy to compute summary statistics
- Documentation

*Limitation*

- Very slow
- Tricky to deal with time data
- Memory hog

**Takeaway:** If you have a big data problem, skip networkx

Check out my repo for more performant libraries.

*Pros*

- Nice api
- Visualization
- Easy to compute summary statistics

*Limitation*

- Still not fast enough for our problem (though much faster than networkx)
- Tricky to deal with time data
- Memory intensive

**Takeaway:** Think about the problem, then re-think

Source: Cylynx - hierarchical partitions of a stochastic block model (drawn using graph-tool)

- For every transaction, we need to look up each vertex attribute
- Do a calculation base on that attribute and other edge attribute
- Store it back as a vertex attribute
- Repeat.
- Sounds familiar?
- Hmm…Maybe we can get by with a simple loop

- Let’s use each tool for what it’s best designed to do
- ‘Skip’ the vertex table
- Pre-merge vertex attributes to edges using our SQL system before looping
- Pre-sort the data, ‘stream’ it and loop over it in time order

*Pros*

- Very fast (remove 2 k/v lookups)
- Lower memory footprint (no need to store txn information)

**Can we do even better?**

Source: Cylynx - degree distribution of transactions

Source: Cylynx - what’s the catch?

- No need to store all the data on memory
- Slightly poorer performance but we can hold intermediate state in a k/v store
- Leverage on the graph degree distribution findings
- Active accounts at t are more likely to be active in t+1
- Let’s use a LRU cache!

*Pros*

- Fast
- Negligible memory footprint
- Cheap (substitute memory for ssd)

Source: Cylynx - a few random components of our tagged transaction graph, visualized

- It’s worth spending time to think through the problem carefully
- Personally, I would skip networkx and start exploring with a more performant graph library
- Leveraging on conventional SQL systems, with a small compute machine, we can do very impressive calculations
- Our solution is 10x faster and cheaper than our previous one
- Still further room to optimize but it’s really unnecessary at this point

- Visualization tools are still quite primitive for big graphs
- How do we visualize large graphs meaningfully?
- How can we allow efficient exploration of neighbourhoods?

- Distributed graph computing
- How do we do write tests on big graphs / monitor algorithm performance?
- How can we make graphDBs and graph computation software work seamlessly together?