Repositories and content addressable storage A data repository needs - - PowerPoint PPT Presentation

repositories and content addressable storage
SMART_READER_LITE
LIVE PREVIEW

Repositories and content addressable storage A data repository needs - - PowerPoint PPT Presentation

Repositories and content addressable storage A data repository needs to (among other things) Make sure data remains safe and uncorrupted Make sure data remains available If data is changed, previous version should be kept


slide-1
SLIDE 1

Repositories and content addressable storage

A data repository needs to (among other things)

  • Make sure data remains safe and uncorrupted
  • Make sure data remains available
  • If data is changed, previous version should be kept

Solutions available, but..

  • Links to data break -- how to make sure that once a link is created it never

breaks?

○ Who keeps track of what is where?

  • What if two files have different names but the same content (duplication)?
  • Dealing with unexpected events

Many solutions used centralized systems

  • Single point of failure, single entity in control
  • What about doing all the above at scale? Big data etc.
slide-2
SLIDE 2

Repositories and content addressable storage

Possible solution: distributed and content addressed storage Distributed = a resource is controlled by many. No single place, person, server, entity, has full control Content addressed = things can be found based on their content

  • Create a digital fingerprint of vacation.jpg based on its content.
  • The fingerprint stays the same no matter where it physically resides

Location addressed = things can be found based on a known location

  • C:\Photos\vacation.jpg
  • The identifier changes, even though the content doesn’t.

C:\Pictures\Vacation\waterslide.jpg

slide-3
SLIDE 3

Repositories and content addressable storage

Content addressable storage

  • Fingerprint (hash) stays the same always = uniquely identify, de-duplicate

Distributed content addressable storage

  • Decentralizes the table that keeps track of where the raw data associated with

the fingerprints physically reside

○ Uses many participants each having equal responsibility

  • No single point of failure - eg., no single entity controls the lookup table
  • Links stay can around forever as long as the network exists.
  • Can use the resources of participants to have safe copies of the data, use

their bandwidth to speed up transfers

slide-4
SLIDE 4

IPFS

IPFS is a content addressed distributed storage protocol

  • A single file system that is spread out on many computers (nodes)
slide-5
SLIDE 5

IPFS and repositories

Generally: IPFS is a protocol rather than a service The nodes form a distributed file system based

  • n P2P technology (e.g., DHT for lookups)

Versioning, de-duplication is fundamentally part

  • f it

Files are broken down into blocks.

  • Each block has a hash.
  • Blocks are linked it a tree-like structure.

“IPFS is actually more similar to a single bittorrent swarm exchanging git objects.” Some interesting properties: Can build services on top of it (client, server) Can access IPFS content via standard HTTP using gateways (see figure) or FUSE. Objects can be “pinned” so they aren’t garbage collected and always stay local Possibility to create a private IPFS network (via modification of the bootstrap list) Easy, quick

slide-6
SLIDE 6

IPFS Gateway

slide-7
SLIDE 7

What IPFS isn’t

A cloud storage service, backup protocol.

  • Can’t upload stuff and disconnect

Files must remain available by “pinning” them.

  • Unpinned files get deleted after some

time

  • Who will pin files in addition to the
  • wner?

  • ther interested parties?

A blockchain-based system Blockchain = immutable, publicly available & verifiable record of transactions Can work with blockchain

  • Incentives for providing node

resources

  • “Mining” a cryptocurrency for reward
  • Storing data in a blockchain is

inefficient. ○ Combine to store transactions in blockchain, data IPFS.