Architecture
Learn more about how it works
The primary abstaction is a connector. Connectors are components that interface with the data stores that are responsible for fetching and writing data into them. Connectors conform to a defined gRPC interface and we can support any valid implementation. Connectors can be source only, destination only, or both. Dsync is software that coordinates data activities between connectors. For example, data movement/replication is supported from a connector that supports source capabilities into a connector that supports sink capabilities. Verification is supported for two connectors that support source capabilities. Dsync will ensure the specified connectors for any operation are compatible.
To support transforms, we also support a gRPC interface for transforms. Transforms are useful if data needs to be mapped from a source to destination.
Connector Interface
For the specifics of the gRPC interface, refer to the protobuf definitions.
Connectors must define the capabilities they support via the GetInfo
endpoint and basic information. For basic information, generally in type one would specify the underlying database, such as MongoDB
. The Id
is used to identify a connector is the same for resumability purposes.
A connector must either provide a non nil definition for at least one of the Source
or Sink
capabilities to describe whether it supports being a source, destination, or both.
Within each of these capabilities, a connector must specify which data types it supports. Currently, the only supported type is the MONGO_BSON
type which is specifically a type that uses a BSON encoding and also by definition has an _id
field that acts as the primary key. This may not be true for all future types.
Source Connector
Source connectors by default support a single specified namespace, but can additionally be configured to declare support multiple namespaces or no specified namespaces. These are simply options to provide flexiblity such as no specified namespace meaning "everything", but the interpretation is up to the connector.
The definition of what string is a namespace is also up to the individual connector. For example, it could be a table name for DynamoDB, or a fully qualified db.collection
identifier for MongoDB.
Source connectors implement the GetNamespaceMetadata
, GeneratePlan
, ListData
, and StreamUpdates
, and StreamLSN
endpoint. If it does not provide the lsn stream capability, then StreamLSN
should just return success right away.
GetNamespaceMetadata
currently only asks for the size of the namespace, which could be an estimate as well.
GeneratePlan
is an important endpoint for specifying how the source can be partitioned. For the initial sync piece, a source could provide various ranges for query parallelism. For updates, a single partition may be returned to represent everything by not specifying a namespace.
ListData
and StreamUpdates
are used to fetch the data for the initial sync and update plans respectively. ListData
must preserve the output of the last cursor until the next cursor is used to account for failure scenarios. It is acceptable to expire the last cursor's validity after enough time has passed. On the updates side, the next cursor is used to update a checkpoint on the stream. Ideally every update should come with it so we can resume from that point onwards.
Destination Connector
Destination connectors implement the WriteData
and WriteUpdates
endpoints. These are both batch endpoints that should only return a success once the provided updates or the data is considered persisted. Furtheremore reapplying the same update should not create duplicate data in the underlying data store. Under the hood they could both use the same implementation, the primary difference is that WriteData
may provide an opportunity for optimizing a pure insert/upsert command.
Transform Interface
For the specifics of the gRPC interface, refer to the protobuf definitions.
GetTransformInfo
is implemented to determine what data type to what data type mappings are supported. At the very least this could be just the same data type to itself.
GetTransform
is the actual endpoint that performs the transform. It basically should apply the requested transform to both the data
and updates
whichever are present and may also update the namespace
.
Last updated