mirror of
https://github.com/project-zot/zot.git
synced 2026-06-18 21:48:04 +08:00
Add initial design document from PR #3733
Co-authored-by: rchincha <45800463+rchincha@users.noreply.github.com>
This commit is contained in:
@@ -0,0 +1 @@
|
||||
*.txt
|
||||
@@ -0,0 +1,88 @@
|
||||
# Image Stream Concept Design
|
||||
|
||||
This document describes a proposal for how image streaming could be implemented within zot.
|
||||
|
||||
## Background and Problem
|
||||
|
||||
Currently, when blobs are downloaded on-demand from zot, zot first pulls the blobs from upstream, commits the image to zot storage, and then replies to the client. For large blobs, this can result in a connection timeout for the client while waiting for blob data.
|
||||
|
||||
This can cause issues in environments such as Kubernetes where the image pull may fail multiple times until zot has successfully cached the image in its storage.
|
||||
|
||||
## Proposed Solution
|
||||
|
||||
With the proposed approach, while zot is downloading the blobs for local storage, it simultaneously makes the data available for clients to download. i.e. the client is allowed to download data for partially copied blobs.
|
||||
|
||||
The first client to request a non-existent image would trigger this in-flight download. Other clients which want to download the same blob can join in the download at any time during the download.
|
||||
|
||||
## Solution Details and Proof of Concept
|
||||
|
||||
The fundamental concept is that the blob is broken up into chunks of a fixed chunk size. Chunk size can be configurable as part of zot config. Using this chunk size, zot can track how many chunks have been written to disk and can be read by clients.
|
||||
|
||||
### Assumptions
|
||||
|
||||
The size of a blob MUST be available beforehand to calculate the total number of chunks.
|
||||
This size is available in the manifest as shown below:
|
||||
|
||||
```json
|
||||
{
|
||||
"schemaVersion": 2,
|
||||
"mediaType": "application/vnd.oci.image.manifest.v1+json",
|
||||
"config": {
|
||||
"mediaType": "application/vnd.oci.image.config.v1+json",
|
||||
"digest": "sha256:4be0d2f67cae5ca4f622fc3deccdd754d8eb5a6d2f9034474a29f01c69470439",
|
||||
"size": 2950
|
||||
},
|
||||
"layers": [
|
||||
{
|
||||
"mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
|
||||
"digest": "sha256:2937c3216fda91408f3a19648766369102691c9a4d698d12d4a0eb6155c13ef1",
|
||||
"size": 52246758
|
||||
},
|
||||
{
|
||||
"mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
|
||||
"digest": "sha256:2d5283c2546119a67577a6cbf063f0d05f3b491f556f8415bbee461f073b6d04",
|
||||
"size": 25630769
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Sparse storage of OCI images would be supported in zot temp store for storing manifests and partially copied blobs.
|
||||
|
||||
### Example blob download flow
|
||||
|
||||
1. Client issues a request for a blob `GET https://zothub.io/v2/golang/blobs/sha256:8ec9f1fd1cf4f5152e86d09c28013ce076b8c09d3a9f5850591be40273ff877e`
|
||||
2. zot checks for the blob locally, but it is not present.
|
||||
3. Due to on-demand sync, zot creates a `ChunkedImageCopier` that copies the blob from a regclient `blob.Reader` to zot temp storage. regclient has a `BlobGet` method which returns a `blob.Reader` instance. [documentation](https://pkg.go.dev/github.com/regclient/regclient@v0.11.1#RegClient.BlobGet)
|
||||
4. The `ChunkedImageCopier` calculates the number of chunks and begins the download to zot temp storage.
|
||||
5. The client (currently open HTTP connection) has an associated `InFlightImageCopier` object that tracks at a per-object level, the number of chunks copied. It opens the temporary file where the image is being written to and registers a channel with the `ChunkedImageCopier` which announces over the go channel, the latest chunk number at the time of registration/subscription and every time a new chunk has been copied to disk.
|
||||
6. The `InFlightImageCopier` receives the value from the channel and copies `(latestChunkNumber - numChunksCopied) * chunkSize` bytes from the open file descriptor to the connection's `io.Writer` implementation until all the chunks are copied.
|
||||
7. The `InFlightImageCopier` holds the connection and channel active until all the bytes are copied. If the client connection terminates during the copy, the channel is de-registered and closed.
|
||||
|
||||
Any new clients joining in during the copy will follow the same steps from 5 onwards. As many chunks as available would be copied from the disk. Once that is complete, the `InFlightImageCopier` will wait for announcements over the channel to continue copying bytes until all chunks are copied.
|
||||
|
||||
### Scaling up to images
|
||||
|
||||
For an image with multiple layers, zot can download multiple layers simultaneously and make available, one `ChunkedImageCopier` for each blob being downloaded.
|
||||
Clients are added on as they request.
|
||||
|
||||
For completed blobs, the `ChunkedImageCopier` can announce the final chunk number upon registration.
|
||||
|
||||
Manifests are not subject to this flow as they are a pre-requisite for streamed blob downloads. They would follow the usual flow where zot downloads first and then responds to the client.
|
||||
|
||||
### Benefits of this design
|
||||
|
||||
1. All requests asking for an image that is being streamed follow a single standard flow which makes it easy to reason about.
|
||||
2. It is relatively easy to keep track of clients as the `InFlightImageCopier` maintains the client state. Clients that disconnect are also handled elegantly as their subscription is terminated if any error is detected during writing to the `io.Writer` implementation.
|
||||
|
||||
### Possible Downsides with this design
|
||||
|
||||
1. Each client holds an open file descriptor to the temp file where the blob is being written to. If the number of clients are very high, it could result in a too many file descriptors open error.
|
||||
2. Download speeds for the client would be impacted by the configured chunk size.
|
||||
3. There are a lot of checks in regclient during image Copy which won't work if zot directly accesses the `blob.Reader`. This may need some discussion to ensure that access to completed image once all the blobs are streamed is sane.
|
||||
|
||||
### Proof of concept
|
||||
|
||||
The `main.go` file in this directory has a mock sample of a blob download where characters in a buffer go through a simulated download into a file called `ondiskblob.txt` which represents an OCI blob being written to disk. 2 sample clients are used - 1 writing to a text file `client1.txt` and another writing to stdout.
|
||||
|
||||
Running the program with `go run main.go` should result in lorem ipsum text being gradually written to 3 places - the 2 text files and stdout.
|
||||
@@ -0,0 +1,3 @@
|
||||
module imagestreamtest
|
||||
|
||||
go 1.25.5
|
||||
@@ -0,0 +1,259 @@
|
||||
package main
|
||||
|
||||
import (
|
||||
"errors"
|
||||
"io"
|
||||
"log"
|
||||
"os"
|
||||
"sync"
|
||||
"time"
|
||||
)
|
||||
|
||||
const chunkSizeBytes = 5
|
||||
|
||||
// The simulated Network Reader implements io.Reader with a delay
|
||||
// It intentionally reads up to 5 bytes at a time with a 2 second sleep to simulate a very slow
|
||||
// network copy.
|
||||
// The data supplied in the buffer is a stand-in for image blob data being transferred over the network.
|
||||
type simulatedNetworkReader struct {
|
||||
src []byte
|
||||
current int
|
||||
}
|
||||
|
||||
func NewSimulatedNetworkReader(src []byte) *simulatedNetworkReader {
|
||||
return &simulatedNetworkReader{
|
||||
src: src,
|
||||
current: 0,
|
||||
}
|
||||
}
|
||||
|
||||
func (snr *simulatedNetworkReader) Read(b []byte) (n int, err error) {
|
||||
time.Sleep(2 * time.Second)
|
||||
|
||||
if snr.current >= len(snr.src) {
|
||||
return 0, io.EOF
|
||||
}
|
||||
|
||||
bytesRead := 0
|
||||
|
||||
for i := range 5 {
|
||||
if snr.current+i >= len(snr.src) {
|
||||
break
|
||||
}
|
||||
|
||||
b[i] = snr.src[snr.current+i]
|
||||
bytesRead = i + 1
|
||||
}
|
||||
|
||||
snr.current += bytesRead
|
||||
return bytesRead, nil
|
||||
}
|
||||
|
||||
// InFlightImageCopier represents a client that wants to stream an image while it is being written to disk.
|
||||
// The data is copied first from disk up to the latest chunk and further copies wait for an announcement
|
||||
// over a channel when a new chunk has been written to disk.
|
||||
type InFlightImageCopier struct {
|
||||
numChunksCopied int
|
||||
source *ChunkedImageCopier
|
||||
dest io.Writer
|
||||
sync.Mutex
|
||||
}
|
||||
|
||||
func NewInFlightImageCopier(source *ChunkedImageCopier, dest io.Writer) *InFlightImageCopier {
|
||||
return &InFlightImageCopier{
|
||||
numChunksCopied: 0,
|
||||
source: source,
|
||||
dest: dest,
|
||||
}
|
||||
}
|
||||
|
||||
func (ific *InFlightImageCopier) Copy() (err error) {
|
||||
inputFile, err := os.Open(ific.source.onDiskPath)
|
||||
if err != nil {
|
||||
log.Printf("failed to open read file: %s\n", err.Error())
|
||||
os.Exit(1)
|
||||
}
|
||||
defer inputFile.Close()
|
||||
|
||||
// Register channel for latest chunk count updates
|
||||
chunkChan := make(chan int, 1)
|
||||
|
||||
id := ific.source.Subscribe(chunkChan)
|
||||
|
||||
for {
|
||||
latestChunkNum := <-chunkChan
|
||||
|
||||
ific.Lock()
|
||||
if latestChunkNum <= ific.numChunksCopied {
|
||||
ific.Unlock()
|
||||
continue
|
||||
}
|
||||
|
||||
_, err = io.CopyN(ific.dest, inputFile, (int64(latestChunkNum)-int64(ific.numChunksCopied))*chunkSizeBytes)
|
||||
if err != nil {
|
||||
if !errors.Is(err, io.EOF) {
|
||||
log.Printf("failed disk copy: %s\n", err.Error())
|
||||
os.Exit(1)
|
||||
}
|
||||
}
|
||||
ific.numChunksCopied = latestChunkNum
|
||||
ific.Unlock()
|
||||
|
||||
if latestChunkNum == ific.source.numChunksTotal {
|
||||
// transfer is complete
|
||||
break
|
||||
}
|
||||
}
|
||||
|
||||
ific.source.Unsubscribe(id)
|
||||
close(chunkChan)
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
// ChunkedImageCopier is a helper that splits an image into chunks based on chunkSize
|
||||
// It then copies chunks to disk.
|
||||
// The latest chunk number is announced to channels of subscribers.
|
||||
type ChunkedImageCopier struct {
|
||||
numChunksTotal int
|
||||
numChunksOnDisk int
|
||||
|
||||
onDiskPath string
|
||||
inFlightReader io.Reader
|
||||
clientMu sync.Mutex
|
||||
clients map[int]chan int
|
||||
numClientsTotal int
|
||||
}
|
||||
|
||||
func NewChunkedImageCopier(destFilePath string, r io.Reader, numChunksTotal int) *ChunkedImageCopier {
|
||||
return &ChunkedImageCopier{
|
||||
numChunksTotal: numChunksTotal,
|
||||
onDiskPath: destFilePath,
|
||||
inFlightReader: r,
|
||||
clients: make(map[int]chan int),
|
||||
}
|
||||
}
|
||||
|
||||
// Everytime a new client is interested in the current blob, the client would create a subscription
|
||||
// here with a channel where latest chunk info is sent.
|
||||
func (cic *ChunkedImageCopier) Subscribe(channel chan int) int {
|
||||
cic.clientMu.Lock()
|
||||
defer cic.clientMu.Unlock()
|
||||
|
||||
cic.clients[cic.numClientsTotal] = channel
|
||||
chanId := cic.numClientsTotal
|
||||
cic.numClientsTotal++
|
||||
|
||||
// Announce the current number of available chunks
|
||||
// TODO: should probably use a mutex lock here.
|
||||
go func() {
|
||||
channel <- cic.numChunksOnDisk
|
||||
}()
|
||||
|
||||
return chanId
|
||||
}
|
||||
|
||||
func (cic *ChunkedImageCopier) Unsubscribe(id int) {
|
||||
cic.clientMu.Lock()
|
||||
defer cic.clientMu.Unlock()
|
||||
|
||||
delete(cic.clients, id)
|
||||
}
|
||||
|
||||
// Starts writing content from inFlightReader to disk while updating clients
|
||||
func (cic *ChunkedImageCopier) Transfer() {
|
||||
log.Println("starting writer")
|
||||
outputFile, err := os.OpenFile(cic.onDiskPath, os.O_WRONLY|os.O_CREATE, 0o644)
|
||||
if err != nil {
|
||||
log.Printf("failed to open write file: %s\n", err.Error())
|
||||
os.Exit(1)
|
||||
}
|
||||
defer outputFile.Close()
|
||||
|
||||
var wg sync.WaitGroup
|
||||
|
||||
for cic.numChunksOnDisk < cic.numChunksTotal {
|
||||
// simulates writing network resp body into a blob file with delay
|
||||
_, err = io.CopyN(outputFile, cic.inFlightReader, chunkSizeBytes)
|
||||
if err != nil {
|
||||
if !errors.Is(err, io.EOF) {
|
||||
log.Printf("failed to copy bytes: %s\n", err.Error())
|
||||
os.Exit(1)
|
||||
}
|
||||
}
|
||||
|
||||
cic.numChunksOnDisk++
|
||||
cic.clientMu.Lock()
|
||||
|
||||
// Update all clients about the new chunk
|
||||
// Clients always read the chunk from disk
|
||||
for _, c := range cic.clients {
|
||||
wg.Go(func() {
|
||||
c <- cic.numChunksOnDisk
|
||||
})
|
||||
}
|
||||
|
||||
cic.clientMu.Unlock()
|
||||
}
|
||||
|
||||
wg.Wait()
|
||||
log.Println("closing writer")
|
||||
}
|
||||
|
||||
func chunkCountForBuffer(b []byte) int {
|
||||
chunkCount := len(b) / chunkSizeBytes
|
||||
remainder := len(b) % chunkSizeBytes
|
||||
|
||||
if remainder > 0 {
|
||||
chunkCount++
|
||||
}
|
||||
|
||||
return chunkCount
|
||||
}
|
||||
|
||||
func main() {
|
||||
// 104 bytes - represents a single image blob
|
||||
buff := []byte("Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis consectetur pellentesque ultrices sit. S12")
|
||||
|
||||
r := NewSimulatedNetworkReader(buff)
|
||||
|
||||
msf := NewChunkedImageCopier("ondiskblob.txt", r, chunkCountForBuffer(buff))
|
||||
|
||||
// client1.txt simulates an HTTP client receiving data over the network
|
||||
client1File, err := os.OpenFile("client1.txt", os.O_WRONLY|os.O_CREATE, 0o644)
|
||||
if err != nil {
|
||||
log.Printf("failed to open client file: %s\n", err.Error())
|
||||
os.Exit(1)
|
||||
}
|
||||
|
||||
client1 := NewInFlightImageCopier(msf, client1File)
|
||||
|
||||
// stdout is also used as a client and simulates another client interested in the same blob
|
||||
client2 := NewInFlightImageCopier(msf, os.Stdout)
|
||||
|
||||
var wg sync.WaitGroup
|
||||
|
||||
// Simulates the network transfer starting first
|
||||
wg.Go(msf.Transfer)
|
||||
|
||||
time.Sleep(10 * time.Millisecond)
|
||||
|
||||
wg.Go(func() {
|
||||
err := client1.Copy()
|
||||
if err != nil {
|
||||
log.Printf("client1: failed to copy: %s\n", err.Error())
|
||||
}
|
||||
})
|
||||
|
||||
// Wait for a bit longer to test a case where a new client comes in during the middle of copy
|
||||
time.Sleep(5 * time.Second)
|
||||
|
||||
wg.Go(func() {
|
||||
err := client2.Copy()
|
||||
if err != nil {
|
||||
log.Printf("client2: failed to copy: %s\n", err.Error())
|
||||
}
|
||||
})
|
||||
|
||||
wg.Wait()
|
||||
}
|
||||
Reference in New Issue
Block a user