Introduction
This repository / book describes the process for proposing changes to Graph Protocol in the form of RFCs and Engineering Plans.
It also includes all approved, rejected and obsolete RFCs and Engineering Plans. For more details, see the following pages:
RFCs
What is an RFC?
An RFC describes a change to Graph Protocol, for example a new feature. Any
substantial change goes through the RFC process, where the change is described
in an RFC, is proposed a pull request to the rfcs
repository, is reviewed,
currently by the core team, and ultimately is either either approved or
rejected.
RFC process
1. Create a new RFC
RFCs are numbered, starting at 0001
. To create a new RFC, create a new branch
of the rfcs
repository. Check the existing RFCs to identify the next number to
use. Then, copy the RFC
template
to a new file in the rfcs/
directory. For example:
cp rfcs/0000-template.md rfcs/0015-fulltext-search.md
Write the RFC, commit it to the branch and open a pull
request in the rfcs
repository.
In addition to the RFC itself, the pull request must include the following changes:
- a link to the RFC on the Approved RFCs page, and
- a link to the RFC under
Approved RFCs
inSUMMARY.md
.
2. RFC review
After an RFC has been submitted through a pull request, it is being reviewed. At the time of writing, every RFC needs to be approved by
- at least one Graph Protocol founder, and
- at least one member of the core development team.
3. RFC approval
Once an RFC is approved, the RFC meta data (see the template) is updated and the pull request is merged by the original author or a Graph Protocol team member.
Approved RFCs
- RFC-0001: Subgraph Composition
- RFC-0002: Ethereum Tracing Cache
- RFC-0003: Mutations
- RFC-0004: Fulltext Search
- RFC-0005: Multi-Blockchain Support
RFC-0001: Subgraph Composition
- Author
- Jannis Pohlmann
- RFC pull request
- https://github.com/graphprotocol/rfcs/pull/1
- Obsoletes
- -
- Date of submission
- 2019-12-08
- Date of approval
- -
- Approved by
- -
Summary
Subgraph composition enables referencing, extending and querying entities across subgraph boundaries.
Goals & Motivation
The high-level goal of subgraph composition is to be able to compose subgraph schemas and data hierarchically. Imagine umbrella subgraphs that combine all the data from a domain (e.g. DeFi, job markets, music) through one unified, coherent API. This could allow reuse and governance at different levels and go all the way to the top, fulfilling the vision of the Graph.
The ability to reference, extend and query entities across subgraph boundaries enables several use cases:
- Linking entities across subgraphs.
- Extending entities defined in other subgraphs by adding new fields.
- Breaking down data silos by composing subgraphs and defining richer schemas without indexing the same data over and over again.
Subgraph composition is needed to avoid duplicated work, both in terms of developing subgraphs as well as indexing them. It is an essential part of the overall vision behind The Graph, as it allows to combine isolated subgraphs into a complete, connected graph of the (decentralized) world's data.
Subgraph developers will benefit from the ability to reference data from other subgraphs, saving them development time and enabling richer data models. dApp developers will be able to leverage this to build more compelling applications. Node operators will benefit from subgraph composition by having better insight into which subgraphs are queried together, allowing them to make more informed decisions about which subgraphs to index.
Urgency
Due to the high impact of this feature and its important role in fulfilling the vision behind The Graph, it would be good to start working on this as early as possible.
Terminology
The feature is referred to by query-time subgraph composition, short: subgraph composition.
Terms introduced and used in this RFC:
- Imported schema: The schema of another subgraph from which types are imported.
- Imported type: An entity type imported from another subgraph schema.
- Extended type: An entity type imported from another subgraph schema and extended in the subgraph that imports it.
- Local schema: The schema of the subgraph that imports from another subgraph.
- Local type: A type defined in the local schema.
Detailed Design
The sections below make the assumption that there is a subgraph with the name
ethereum/mainnet
that includes an Address
entity type.
Composing Subgraphs By Importing Types
In order to reference entity types from annother subgraph, a developer would first import these types from the other subgraph's schema.
Types can be imported either from a subgraph name or from a subgraph ID. Importing from a subgraph name means that the exact version of the imported subgraph will be identified at query time and its schema may change in arbitrary ways over time. Importing from a subgraph ID guarantees that the schema will never change but also means that the import points to a subgraph version that may become outdated over time.
Let's say a DAO subgraph contains a Proposal
type that has a proposer
field
that should link to an Ethereum address (think: Ethereum accounts or contracts)
and a transaction
field that should link to an Ethereum transaction. The
developer would then write the DAO subgraph schema as follows:
type _Schema_
@import(
types: ["Address", { name: "Transaction", as: "EthereumTransaction" }],
from: { name: "ethereum/mainnet" }
)
type Proposal @entity {
id: ID!
proposer: Address!
transaction: EthereumTransaction!
}
This would then allow queries that follow the references to addresses and transactions, like
{
proposals {
proposer {
balance
address
}
transaction {
hash
block {
number
}
}
}
}
Extending Types From Imported Schemas
Extending types from another subgraph involves several steps:
- Importing the entity types from the other subgraph.
- Extending these types with custom fields.
- Managing (e.g. creating) extended entities in subgraph mappings.
Let's say the DAO subgraph wants to extend the Ethereum Address
type to
include the proposals created by each respective account. To achieve this, the
developer would write the following schema:
type _Schema_
@import(
types: ["Address"],
from: { name: "ethereum/mainnet" }
)
type Proposal @entity {
id: ID!
proposer: Address!
}
extend type Address {
proposals: [Proposal!]! @derivedFrom(field: "proposal")
}
This makes queries like the following possible, where the query can go "back"
from addresses to proposal entities, despite the Ethereum Address
type
originally being defined in the ethereum/mainnet
subgraph.
{
addresses {
id
proposals {
id
proposer {
id
}
}
}
In the above case, the proposals
field on the extended type is derived, which
means that an implementation wouldn't have to create a local extension type in
the store. However, if proposals
was defined as
extend type Address {
proposals: [Proposal!]!
}
then it would the subgraph mappings would have to create partial Address
entities with id
and proposals
fields for all addresses from which proposals
were created. At query time, these entity instances would have to be merged with
the original Address
entities from the ethereum/mainnet
subgraph.
Subgraph Availability
In the decentralized network, queries will be split and routed through the network based on what indexers are available and which subgraphs they index. At that point, failure to find an indexer for a subgraph that types were imported from will result in a query error. The error that a non-nullable field resolved to null bubbles up to the next nullable parent, in accordance with the GraphQL Spec.
Until the network is reality, we are dealing with individual Graph Nodes and
querying subgraphs where imported entity types are not also indexed on the same
node should be handled with more tolerance. This RFC proposes that entity
reference fields that refer to imported types are converted to being optional in
the generated API schema. If the subgraph that the type is imported from is not
available on a node, such fields should resolve to null
.
Interfaces
Subgraph composition also supports interfaces in the ways outlined below.
Interfaces Can Be Imported From Other Subgraphs
The syntax for this is the same as that for importing types:
type _Schema_
@import(types: ["ERC20"], from: { name: "graphprotocol/erc20" })
Local Types Can Implement Imported Interfaces
This is achieved by importing the interface from another subgraph schema and implementing it in entity types:
type _Schema_
@import(types: ["ERC20"], from: { name: "graphprotocol/erc20" })
type MyToken implements ERC20 @entity {
# ...
}
Imported Types Can Be Extended To Implement Local Interfaces
This is achieved by importing the types from another subgraph schema, defining a
local interface and using extend
to implement the interface on the imported
types:
type _Schema_
@import(types: [{ name: "Token", as "LPT" }], from: { name: "livepeer/livepeer" })
@import(types: [{ name: "Token", as "Rep" }], from: { name: "augur/augur" })
interface Token {
id: ID!
balance: BigInt!
}
extend LPT implements Token {
# ...
}
extend Rep implements Token {
# ...
}
Imported Types Can Be Extended To Implement Imported Interfaces
This is a combination of importing an interface, importing the types and extending them to implement the interface:
type _Schema_
@import(types: ["Token"], from: { name: "graphprotocol/token" })
@import(types: [{ name: "Token", as "LPT" }], from: { name: "livepeer/livepeer" })
@import(types: [{ name: "Token", as "Rep" }], from: { name: "augur/augur" })
extend LPT implements Token {
# ...
}
extend Rep implements Token {
# ...
}
Implementation Concerns For Interface Support
Querying across types from different subgraphs that implement the same interface may require a smart algorithm, especially when it comes to pagination. For instance, if the first 1000 entities for an interface are queried, this range of 1000 entities may be divided up between different local and imported types arbitrarily.
A naive algorithm could request 1000 entities from each subgraph, applying the selected filters and order, combine the results and cut off everything after the first 1000 items. This would generate a minimum of requests but would involve significant overfetching.
Another algorithm could just fetch the first item from each subgraph, then based on that information, divide up the range in more optimal ways than the previous algorith, and satisfy the query with more requests but with less overfetching.
Compatibility
Subgraph composition is a purely additive, non-breaking change. Existing subgraphs remain valid without any migrations being necessary.
Drawbacks And Risks
Reasons that could speak against implementing this feature:
-
Schema parsing and validation becomes more complicated. Especially validation of imported schemas may not always be possible, depending on whether and when the referenced subgraph is available on the Graph Node or not.
-
Query execution becomes more complicated. The subgraph a type belongs to must be identified and local as well as imported versions of extended entities have to be queried separately and be merged before returning data to the client.
Alternatives
No alternatives have been considered.
There are other ways to compose subgraph schemas using GraphQL technologies such as schema stitching or Apollo Federation. However, schema stitching is being deprecated and Apollo Federation requires a centralized server to serve to extend and merge GraphQL API. Both of these solutions slow down queries.
Another reason not to use these is that GraphQL will only be one of several query languages supported in the future. Composition therefore has to be implemented in a query-language-agnostic way.
Open Questions
-
Right now, interfaces require unique IDs across all the concrete entity types that implement them. This is not something we can guarantee any longer if these concrete types live in different subgraphs. So we have to handle this at query time (or must somehow disallow it, returning a query error).
It is also unclear how an individual interface entity lookup would look like if IDs are no longer guaranteed to be unique:
someInterface(id: "?????") { }
RFC-0002: Ethereum Tracing Cache
- Author
- Zac Burns
- RFC pull request
- https://github.com/graphprotocol/rfcs/pull/4
- Obsoletes (if applicable)
- None
- Date of submission
- 2019-12-13
- Date of approval
- 2019-12-20
- Approved by
- Jannis Pohlmann
Summary
This RFC proposes the creation of a local Ethereum tracing cache to speed up indexing of subgraphs which use block and/or call handlers.
Motivation
When indexing a subgraph that uses block and/or call handlers, it is necessary to extract calls from the trace of each block that a Graph Node indexes. It is expensive to acquire and process traces from Ethereum nodes in both money and time.
When developing a subgraph it is common to make changes and deploy those changes to a production Graph Node for testing. Each time a change is deployed, the Graph Node must re-sync the subgraph using the same traces that were used for the previous sync of the subgraph. The cost of acquiring the traces each time a change is deployed impacts a subgraph developer's ability to iterate and test quickly.
Urgency
None
Terminology
Ethereum cache: The new API proposed here.
Detailed Design
There is an existing EthereumCallCache
for caching eth_call
built into Graph Node today. This cache will be extended to support traces, and renamed to EthereumCache
.
Compatibility
This change is backwards compatible. Existing code can continue to use the parity tracing API. Because the cache is local, each indexing node may delete the cache should the format or implementation of caching change. In this case of invalidated cache the code will fall back to existing methods for retrieving a trace and repopulating the cache.
Drawbacks and Risks
Subgraphs which are not being actively developed will incur the overhead for storing traces, but will not ever reap the benefits of ever reading them back from the cache.
If this drawback is significant, it may be necessary to extend EthereumCache
to provide a custom score for cache invalidation other than the current date. For example, trace_filter
calls could be invalidated based on the latest update time for a subgraph requiring the trace. It is expected that a subgraph which has been updated recently is more likely to be updated again soon then a subgraph which has not been recently updated.
Alternatives
None
Open Questions
None
RFC-0003: Mutations
- Author
- dOrg: Jordan Ellis, Nestor Amesty
- RFC pull request
- URL
- Date of submission
- 2019-12-20
- Date of approval
- 2020-2-03
- Approved by
- Jannis Pohlmann
Contents
- Summary
- Goals & Motivation
- Urgency
- Terminology
- Detailed Design
- Compatibility
- Drawbacks and Risks
- Alternatives
- Open Questions
Summary
GraphQL mutations allow developers to add executable functions to their schema. Callers can invoke these functions using GraphQL queries. An introduction to how mutations are defined and work can be found here. This RFC will assume the reader understands how to use GraphQL mutations in a traditional Web2 application. This proposal describes how mutations are added to The Graph's toolchain, and used to replace Web3 write operations the same way The Graph has replaced Web3 read operations.
Goals & Motivation
The Graph has created a read semantic layer that describes smart contract protocols, which has made it easier to build applications on top of complex protocols. Since dApps have two primary interactions with Web3 protocols (reading & writing), the next logical addition is write support.
Protocol developers that use a subgraph still often publish a Javascript wrapper library for their dApp developers (examples: DAOstack, ENS, LivePeer, DAI, Uniswap). This is done to help speed up dApp development and promote consistency with protocol usage patterns. With the addition of mutations to the Graph Protocol's GraphQL tooling, Web3 reading & writing can now both be invoked through GraphQL queries. dApp developers can now simply refer to a single GraphQL schema that defines the entire protocol.
Urgency
This is urgent from a developer experience point of view. With this addition, it eliminates the need for protocol developers to manually wrap GraphQL query interfaces alongside developer-friendly write functions. Additionally, mutations provide a solution for optimistic UI updates, which is something dApp developers have been seeking for a long time (see here). Lastly with the whole protocol now defined in GraphQL, existing application layer code generators can now be used to hasten dApp development (some examples).
Terminology
- Mutations: Collection of mutations.
- Mutation: A GraphQL mutation.
- Mutations Schema: A GraphQL schema that defines a
type Mutation
, which contains all mutations. Additionally this schema can define other types to be used by the mutations, such asinput
andinterface
types. - Mutations Manifest: A YAML manifest file that is used to add mutations to an existing subgraph manifest. This manifest can be stored in an external YAML file, or within the subgraph manifest's YAML file under the
mutations
property. - Mutation Resolvers: Code module that contains all resolvers.
- Resolver: Function that is used to execute a mutation's logic.
- Mutation Context: A context object that's created for every mutation that's executed. It's passed as the 3rd argument to the resolver function.
- Mutation States: A collection of mutation states. One is created for each mutation being executed in a given query.
- Mutation State: The state of a mutation being executed. Also referred to in this document as "State". It is an aggregate of the core & extended states (see below). dApp developers can subscribe to the mutation's state upon execution of the mutation query. See the
useMutation
examples below. - Core State: Default properties present within every mutation state. Some examples:
events: Event[]
,uuid: string
, andprogress: number
. - Extended State: Properties the mutation developer defines. These are added alongside the core state properties in the mutation state. There are no bounds to what a developer can define here. See examples below.
- State Events: Events emitted by mutation resolvers. Also referred to in this document as "Events". Events are defined by a
name: string
and apayload: any
. These events, once emitted, are given to reducer functions which then update the state accordingly. - Core Events: Default events available to all mutations. Some examples:
PROGRESS_UPDATE
,TRANSACTION_CREATED
,TRANSACTION_COMPLETED
. - Extended Events: Events the mutation developer defines. See examples below.
- State Reducers: A collection of state reducer functions.
- State Reducer: Reducers are responsible for translating events into state updates. They take the form of a function that has the inputs [event, current state], and returns the new state post-event. Also referred to in this document as "Reducer(s)".
- Core Reducers: Default reducers that handle the processing of the core events.
- Extended Reducers: Reducers the mutation developer defines. These reducers can be defined for any event, core or extended. The core & extended reducers are run one after another if both are defined for a given core event. See examples below.
- State Updater: The state updater object is used by the resolvers to dispatch events. It's passed to the resolvers through the mutation context like so:
context.graph.state
. - State Builder: An object responsible for (1) initializing the state with initial values and (2) defining reducers for events.
- Core State Builder: A state builder that's defined by default. It's responsible for initializing the core state properties, and processing the core events with its reducers.
- Extended State Builder: A state builder defined by the mutation developer. It's responsible for initializing the extended state properties, and processing the extended events with its reducers.
- Mutations Config: Collection of config properties required by the mutation resolvers. Also referred to in this document as "Config". All resolvers share the same config. It's passed to the resolver through the mutation context like so:
context.graph.config
. - Config Property: A single property within the config (ex: ipfs, ethereum, etc).
- Config Generator: A function that takes a config argument, and returns a config property. For example, "localhost:5001" as a config argument gets turned into a new IPFS client by the config generator.
- Config Argument: An initialization argument that's passed into the config generator function. This config argument is provided by the dApp developer.
- Optimistic Response: A response given to the dApp that predicts what the outcome of the mutation's execution will be. If it is incorrect, it will be overwritten with the actual result.
Detailed Design
The sections below illustrate how a developer would add mutations to an existing subgraph, and then add those mutations to a dApp.
Mutations Manifest
The subgraph manifest (subgraph.yaml
) now has an extra property named mutations
which is the mutations manifest.
subgraph.yaml
specVersion: ...
...
mutations:
repository: https://npmjs.com/package/...
schema:
file: ./mutations/schema.graphql
resolvers:
apiVersion: 0.0.1
kind: javascript/es5
file: ./mutations/index.js
types: ./mutations/index.d.ts
dataSources: ...
...
Alternatively, the mutation manifest can be external like so:
subgraph.yaml
specVersion: ...
...
mutations:
file: ./mutations/mutations.yaml
dataSources: ...
...
mutations/mutations.yaml
specVersion: ...
repository: https://npmjs.com/package/...
schema:
file: ./schema.graphql
resolvers:
apiVersion: 0.0.1
kind: javascript/es5
file: ./index.js
types: ./index.d.ts
NOTE: resolvers.types
is required. More on this below.
Mutations Schema
The mutations schema defines all of the mutations in the subgraph. The mutations schema builds on the subgraph schema, allowing the use of types from the subgraph schema, as well as defining new types that are used only in the context of mutations. For example, starting from a base subgraph schema:
schema.graphql
type MyEntity @entity {
id: ID!
name: String!
value: BigInt!
}
Developers can define mutations that reference these subgraph schema types. Additionally new input
and interface
types can be defined for the mutations to use:
mutations/schema.graphql
input MyEntityOptions {
name: String!
value: BigInt!
}
interface NewNameSet {
oldName: String!
newName: String!
}
type Mutation {
createEntity(
options: MyEntityOptions!
): MyEntity!
setEntityName(
entity: MyEntity!
name: String!
): NewNameSet!
}
graph-cli
handles the parsing and validating of these two schemas. It verifies that the mutations schema defines a type Mutation
and that all of the mutations within it are defined in the resolvers module (see next section).
Mutation Resolvers
Each mutation within the schema must have a corresponding resolver function defined. Resolvers will be invoked by whatever engine executes the mutation queries (ex: Apollo Client). They are executed locally within the client application.
Mutation resolvers of kind javascript/es5
take the form of an ES5 javascript module. This module is expected to have a default export that contains the following properties:
-
resolvers: MutationResolvers
- The mutation resolver functions. The shape of this object must match the shape of thetype Mutation
defined above. See the example below for demonstration of this. Resolvers have the following prototype, as defined in graphql-js:import { GraphQLFieldResolver } from 'graphql' interface MutationContext< TConfig extends ConfigGenerators, TState, TEventMap extends EventTypeMap > { [prop: string]: any, graph: { config: ConfigProperties<TConfig>, dataSources: DataSources, state: StateUpdater<TState, TEventMap> } } interface MutationResolvers< TConfig extends ConfigGenerators, TState, TEventMap extends EventTypeMap > { Mutation: { [field: string]: GraphQLFieldResolver< any, MutationContext<TConfig, TState, TEventMap> > } }
-
config: ConfigGenerators
- A collection of config generators. The config object is made up of properties, that can be nested, but all terminate in the form of a function with the prototype:type ConfigGenerator<TArg, TRet> = (arg: TArg) => TRet interface ConfigGenerators { [prop: string]: ConfigGenerator<any, any> | ConfigGenerators }
See the example below for a demonstration of this.
-
stateBuilder: StateBuilder
(optional) - A state builder interface responsible for (1) initializing extended state properties and (2) reducing extended state events. State builders implement the following interface:type MutationState<TState> = CoreState & TState type MutationEvents<TEventMap> = CoreEvents & TEventMap interface StateBuilder<TState, TEventMap extends EventTypeMap> { getInitialState(uuid: string): TState, // Event Specific Reducers reducers?: { [TEvent in keyof MutationEvents<TEventMap>]?: ( state: MutationState<TState>, payload: InferEventPayload<TEvent, TEventMap> ) => OptionalAsync<Partial<MutationState<TState>>> }, // Catch-All Reducer reducer?: ( state: MutationState<TState>, event: Event ) => OptionalAsync<Partial<MutationState<TState>>> } interface EventPayload { } interface Event { name: string payload: EventPayload } interface EventTypeMap { [name: string]: EventPayload } // Optionally support async functions type OptionalAsync<T> = Promise<T> | T // Infer the payload type from the event name, given an EventTypeMap type InferEventPayload< TEvent extends keyof TEvents, TEvents extends EventTypeMap > = TEvent extends keyof TEvents ? TEvents[TEvent] : any
See the example below for a demonstration of this.
For example:
mutations/index.js
import {
Event,
EventPayload,
MutationContext,
MutationResolvers,
MutationState,
StateBuilder,
ProgressUpdateEvent
} from "@graphprotocol/mutations"
import gql from "graphql-tag"
import { ethers } from "ethers"
import {
AsyncSendable,
Web3Provider
} from "ethers/providers"
import IPFS from "ipfs"
// Typesafe Context
type Context = MutationContext<Config, State, EventMap>
/// Mutation Resolvers
const resolvers: MutationResolvers<Config, State, EventMap> = {
Mutation: {
async createEntity (source: any, args: any, context: Context) {
// Extract mutation arguments
const { name, value } = args.options
// Use config properties created by the
// config generator functions
const { ethereum, ipfs } = context.graph.config
// Create ethereum transactions...
// Fetch & upload to ipfs...
// Dispatch a state event through the state updater
const { state } = context.graph
await state.dispatch("PROGRESS_UPDATE", { progress: 0.5 })
// Dispatch a custom extended event
await state.dispatch("MY_EVENT", { myValue: "..." })
// Get a copy of the current state
const currentState = state.current
// Send another query using the same client.
// This query would result in the graph-node's
// entity store being fetched from. You could also
// execute another mutation here if desired.
const { client } = context
await client.query({
query: gql`
myEntity (id: "${id}") {
id
name
value
}
}`
})
...
},
async setEntityName (source: any, args: any, context: Context) {
...
}
}
}
/// Config Generators
type Config = typeof config
const config = {
// These function arguments are passed in by the dApp
ethereum: (arg: AsyncSendable): Web3Provider => {
return new ethers.providers.Web3Provider(arg)
},
ipfs: (arg: string): IPFS => {
return new IPFS(arg)
},
// Example of a custom config property
property: {
// Generators can be nested
a: (arg: string) => { },
b: (arg: string) => { }
}
}
/// (optional) Extended State, Events, and State Builder
// Extended State
interface State {
myValue: string
}
// Extended Events
interface MyEvent extends EventPayload {
myValue: string
}
type EventMap = {
"MY_EVENT": MyEvent
}
// Extended State Builder
const stateBuilder: StateBuilder<State, EventMap> = {
getInitialState(): State {
return {
myValue: ""
}
},
reducers: {
"MY_EVENT": async (state: MutationState<State>, payload: MyEvent) => {
return {
myValue: payload.myValue
}
},
"PROGRESS_UPDATE": (state: MutationState<State>, payload: ProgressUpdateEvent) => {
// Do something custom...
}
},
// Catch-all reducer...
reducer: (state: MutationState<State>, event: Event) => {
switch (event.name) {
case "TRANSACTION_CREATED":
// Do something custom...
break
}
}
}
export default {
resolvers,
config,
stateBuilder
}
// Required Types
export {
Config,
State,
EventMap,
MyEvent
}
NOTE: It's expected that the mutations manifest has a resolvers.types
file defined. The following types must be defined in the .d.ts type definition file:
Config
State
EventMap
- Any
EventPayload
interfaces defined within theEventMap
dApp Integration
In addition to the resolvers module defined above, the dApp has access to a run-time API to help with the instantiation and execution of mutations. This package is called @graphprotocol/mutations
and is defined like so:
-
createMutations
- Create a mutations interface which enables the user toexecute
a mutation query andconfigure
the mutation module.interface CreateMutationsOptions< TConfig extends ConfigGenerators, TState, TEventMap extends EventTypeMap > { mutations: MutationsModule<TConfig, TState, TEventMap>, subgraph: string, node: string, config: ConfigArguments<TConfig> mutationExecutor?: MutationExecutor<TConfig, TState, TEventMap> } interface Mutations< TConfig extends ConfigGenerators, TState, TEventMap extends EventTypeMap > { execute: (query: MutationQuery<TConfig, TState, TEventMap>) => Promise<MutationResult> configure: (config: ConfigArguments<TConfig>) => void } const createMutations = < TConfig extends ConfigGenerators, TState = CoreState, TEventMap extends EventTypeMap = { }, >( options: CreateMutationsOptions<TConfig, TState, TEventMap> ): Mutations<TConfig, TState, TEventMap> => { ... }
-
createMutationsLink
- wrap the mutations created above in an ApolloLink.const createMutationsLink = < TConfig extends ConfigGenerators, TState, TEventMap extends EventTypeMap, > ( { mutations }: { mutations: Mutations<TConfig, TState, TEventMap> } ): ApolloLink => { ... }
For applications using Apollo and React, a run-time API is available which mimics commonly used hooks and components for executing mutations, with the addition of having the mutation state available to the caller. This package is called @graphprotocol/mutations-apollo-react
and is defined like so:
-
useMutation
- see https://www.apollographql.com/docs/react/data/mutations/#executing-a-mutationimport { DocumentNode } from "graphql" import { ExecutionResult, MutationFunctionOptions, MutationResult, OperationVariables } from "@apollo/react-common" import { MutationHookOptions } from "@apollo/react-hooks" import { CoreState } from "@graphprotocol/mutations" type MutationStates<TState> = { [mutation: string]: MutationState<TState> } interface MutationResultWithState<TState, TData = any> extends MutationResult<TData> { state: MutationStates<TState> } type MutationTupleWithState<TState, TData, TVariables> = [ ( options?: MutationFunctionOptions<TData, TVariables> ) => Promise<ExecutionResult<TData>>, MutationResultWithState<TState, TData> ] const useMutation = < TState = CoreState, TData = any, TVariables = OperationVariables >( mutation: DocumentNode, mutationOptions: MutationHookOptions<TData, TVariables> ): MutationTupleWithState<TState, TData, TVariables> => { ... }
-
Mutation
- see https://www.howtographql.com/react-apollo/3-mutations-creating-links/interface MutationComponentOptionsWithState< TState, TData, TVariables > extends BaseMutationOptions<TData, TVariables> { mutation: DocumentNode children: ( mutateFunction: MutationFunction<TData, TVariables>, result: MutationResultWithState<TState, TData> ) => JSX.Element | null } const Mutation = < TState = CoreState, TData = any, TVariables = OperationVariables >( props: MutationComponentOptionsWithState<TState, TData, TVariables> ): JSX.Element | null => { ... }
For example:
dApp/src/App.tsx
import {
createMutations,
createMutationsLink
} from "@graphprotocol/mutations"
import {
Mutation,
useMutation
} from "@graphprotocol/mutations-apollo-react"
import myMutations, { State } from "mutations-js-module"
import { createHttpLink } from "apollo-link-http"
const mutations = createMutations({
mutations: myMutations,
// Config args, which will be passed to the generators
config: {
// Config args can take the form of functions to allow
// for dynamic fetching behavior
ethereum: async (): AsyncSendable => {
const { ethereum } = (window as any)
await ethereum.enable()
return ethereum
},
ipfs: "http://localhost:5001",
property: {
a: "...",
b: "..."
}
},
subgraph: "my-subgraph",
node: "http://localhost:8080"
})
// Create Apollo links to handle queries and mutation queries
const mutationLink = createMutationLink({ mutations })
const queryLink = createHttpLink({
uri: "http://localhost:8080/subgraphs/name/my-subgraph"
})
// Create a root ApolloLink which splits queries between
// the two different operation links (query & mutation)
const link = split(
({ query }) => {
const node = getMainDefinition(query)
return node.kind === "OperationDefinition" &&
node.operation === "mutation"
},
mutationLink,
queryLink
)
// Create an Apollo Client
const client = new ApolloClient({
link,
cache: new InMemoryCache()
})
const CREATE_ENTITY = gql`
mutation createEntity($options: MyEntityOptions) {
createEntity(options: $options) {
id
name
value
}
}
`
// exec: execution function for the mutation query
// loading: https://www.apollographql.com/docs/react/data/mutations/#tracking-mutation-status
// state: mutation state instance
const [exec, { loading, state }] = useMutation<State>(
CREATE_ENTITY,
{
client,
variables: {
options: { name: "...", value: 5 }
}
}
)
// Access the mutation's state like so:
state.createEntity.myValue
// Optimistic responses can be used to update
// the UI before the execution has finished.
// More information can be found here:
// https://www.apollographql.com/docs/react/performance/optimistic-ui/
const [exec, { loading, state }] = useMutation(
CREATE_ENTITY,
{
optimisticResponse: {
__typename: "Mutation",
createEntity: {
__typename: "MyEntity",
name: "...",
value: 5,
// NOTE: ID must be known so the
// final response can be correlated.
// Please refer to Apollo's docs.
id: "id"
}
},
variables: {
options: { name: "...", value: 5 }
}
}
)
// Use the Mutation JSX Component
<Mutation
mutation={CREATE_ENTITY}
variables={{options: { name: "...", value: 5 }}}
>
{(exec, { loading, state }) => (
<button onClick={exec} />
)}
</Mutation>
Compatibility
No breaking changes will be introduced, as mutations are an optional add-on to a subgraph.
Drawbacks and Risks
Nothing apparent at the moment.
Alternatives
The existing alternative that protocol developers are creating for dApp developers has been described above.
Open Questions
-
How can mutations pickup where they left off in the event of an abrupt application shutdown? Since mutations can contain many different steps internally, it would be ideal to be able to support continuing resolver execution in the event the dApp abruptly shuts down.
-
How can dApps understand what steps a given mutation will take during the course of its execution? dApps may want to present to the user friendly progress updates, letting them know a given mutation is 3/4ths of the way through its execution (for example) and a high level description of each step. I view this as closely tied to the previous open question above, as we could support continuing resolver executions if we know what step it's currently undergoing. A potential implementation could include adding a
steps: Step[]
property to the core state, whereStep
looks similar to:interface Step { id: string title: string description: string status: 'pending' | 'processing' | 'error' | 'finished' current: boolean error?: Error data: any }
This, plus a few core events & reducers, would be all we need to render UIs like the ones seen here: https://ant.design/components/steps/
-
Should dApps be able to define event handlers for mutation events? dApps may want to implement their own handlers for specific events emitted from mutations. These handlers would be different from the reducers, as we wouldn't want them to be able to modify the state. Instead they could store their own state elsewhere within the dApp based on the events.
-
Should the Graph Node's schema introspection endpoint respond with the "full" schema, including the mutations' schema? Developers could fetch the "full" schema by looking up the subgraph's manifest, read the
mutations.schema.file
hash value, and fetching the full schema from IPFS. Should the graph-node support querying this full schema directly from the graph-node itself through the introspection endpoint? -
Will server side execution ever be a reality? I have not thought of a trustless solution to this, am curious if anyone has any ideas of how we could make this possible.
-
Will The Graph Explorer support mutations? We could have the explorer client-side application dynamically fetch and include mutation resolver modules. Configuring the resolvers module dynamically is problematic though. Maybe there are a few known config properties that the explorer client supports, and for all others it allows the user to input config arguments (if they're base types).
RFC-0004: Fulltext Search
- Author
- Ford Nickels
- RFC pull request
- URL
- Obsoletes (if applicable)
- -
- Date of submission
- 2020-01-05
- Date of approval
- 2020-02-10
- Approved by
- Jannis Pohlmann
Contents
- Summary
- Goals & Motivation
- Urgency
- Terminology
- Detailed Design
- Compatibility
- Drawbacks and Risks
- Alternatives
- Open Questions
Summary
The fulltext search filter type is a feature of the GraphQL API that allows subgraph developers to specify language-specific, lexical, composite filters that end users can use in their queries. The fulltext search feature examines all words in a document, breaking it into individual words and phrases (lexical analysis), and collapsing variations of words into a single index term (stemming.)
Goals & Motivation
The current set of string filters available in the GraphQL API is lacking fulltext search capabilities that enable efficient searches across entities and attributes. Wildcard string matching does provide string filtering, but users have come to expect the easy to use filtering that comes with fulltext search systems.
To facilitate building effective user interfaces human-user friendly query filtering is essential. Lexical, composite fulltext search filters can provide the tools necessary for front-end developers to implement powerful search bars that filter data across multiple fields of an Entity.
The proposed feature aims to provide tools for subgraph developers to define composite search APIs that can search across multiple fields and entities.
Urgency
A delay in adding the fulltext search feature will not create issues with current deployments. However, the feature will represent a realization of part of the long term vision for the query network. In addition, several high profile users have communicated that it may be a conversion blocker. Implementation should be prioritized.
Terminology
-
lexeme: a basic lexical unit of a language, consisting of one word or several words, considered as an abstract unit, and applied to a family of words related by form or meaning.
-
morphology (linguistics): the study of words, how they are formed, and their relationship to other words in the same language.
-
fulltext search index: the result of lexical and morphological analysis (stemming) of a set of text documents. It provides frequency and location for the language-specific stems found in the text documents being indexed.
-
ranking algorithm: "Ranking attempts to measure how relevant documents are to a particular query, so that when there are many matches the most relevant ones can be shown first." - Postgres Documentation
Algorithms:
- standard ranking: ranking based on the number of matching lexemes.
- cover density ranking: Cover density is similar to the standard fulltext search ranking except that the proximity of matching lexemes to each other is taken into consideration. This function requires lexeme positional information to perform its calculation, so it ignores any "stripped" lexemes in the index.
Detailed Design
Subgraph Schema
Part of the power of the fulltext search API is the flexibility, so it is important to expose a simple interface to facilitate useful applications of the index and aim to reduce the need to create new subgraphs for the express purpose of updating fulltext search fields.
For each fulltext search API a subgraph developer must be able to specify:
1. a language (specified using an ISO 639-1
code),
2. a set of text document fields to include,
3. relative weighting for each field,
4. a choice of ranking algorithm for sorting query result items.
The proposed process of adding one or more fulltext search API involves
adding one or more fulltext directive to the _Schema_
type in the
subgraph's GraphQL schema. Each fulltext definition will have four
required top level parameters: name
, language
, algorithm
, and
include
. The fulltext search definitions will be used to generate
query fields on the GraphQL schema that will be exposed to the end user.
Enabling fulltext search across entities will be a powerful abstraction that allows users to search across all relevant entities in one query. Such a search will by definition have polymorphic results. To address this, a union type will be generated in the schema for the fulltext search results.
Validation of the fulltext definition will ensure that all fields referenced in the directive are valid String type fields. With subgraph composition it will be possible to easily create new subgraphs that add specific fulltext search capabilities to an existing subgraph.
Example fulltext search definition:
type _Schema_
@fulltext(
name: "media"
...
)
@fulltext(
name: "search",
language: EN, # variant of `_FullTextLanguage` enum
algorithm: RANKED, # variant of `_FullTextAlgorithm` enum
include: [
{
entity: "Band",
fields: [
{ name: "name", weight: 5 },
]
},
{
entity: "Album",
fields: [
{ name: "title", weight: 5 },
]
},
{
entity: "Musician",
fields: [
{ name: "name", weight: 10 },
{ name: "bio", weight: 5 },
]
}
]
)
The schema generated from the above definition:
union _FulltextMediaEntity = ...
union _FulltextSearchEntity = Band | Album | Musician
type Query {
media...
search(text: String!, first: Int, skip: Int, block: Block_height): [FulltextSearchResultItem!]!
}
GraphQL Query interface
End users of the subgraph will have access to the fulltext search
queries alongside the other queries available for each entity in the
subgraph. In the case of a fulltext search defined across multiple
entities,
inline fragments
may be used in the query to deal with the polymorphic result items. In
the front-end the __typename
field can be used to distinguish the
concrete entity types of the returned results.
In the text
parameter supplied to the query there will be several operators
available to the end user. Included are the and, or, and proximity operators
(&
, |
, <->
.) The special, proximity operator allows clients to specify
the maximum distance between search terms: foo<3>bar
is equivalent to
requesting that foo
and bar
are at most three words apart.
Example query using inline fragments and the proximity operator:
query {
search(text: "Bob<3>run") {
__typename
... on Band { name label { id } }
... on Album { title numberOfTracks }
... on Musician { name bio }
}
}
Tools and Design
Fulltext search query system implementations often involve specific systems for storing and querying the text documents; however, in an effort to reduce system complexity and feature implementation time I propose starting with extending the current store interface and storage implemenation with fulltext search features rather than use a fulltext specific interface and storage system.
A FullText search field will get its own column in a table dedicated to fulltext data. The data stored will be the result of the lexical, morphological analysis of text documents performed on the fields included in the index. The fulltext search field will be created using the Postgres ts_vector function and will be indexed using a GIN index. The subgraph developer will define a ranking algorithm to be used to sort query results,so the end-user facing API remains easy to use without any requirement to understand the ranking algorithms.
Compatibility
This proposal does not change any existing interfaces, so no migrations will be necessary for existing subgraph deployments.
Drawbacks and Risks
The proposed solution uses native Postgres fulltext features and there is a nonzero probability this choice results in slower than optimal write and read times; however the tradeoff in implementation time/complexity and the existence of production use case testimonials tempers my apprehension here.
In future phases of the network the storage layer may get a redesign with indexes being overhauled to facilitate query result verification. Postgres based fulltext search implementation would not be translatable to another storage system, so at the least a reevaluation of the tools used for analysis, indexing, and querying would be required.
Alternatives
An alternative design for the feature would allow more flexibility for Graph Node operators in their index implementation and create a marketplace for indexes. In the alternate, the definition of fulltext search indexes could be moved out of the subgraph schema. The subgraph would be deployed without them and they could be added later using a new Graph Explorer interface (in Hosted-Service context) or a JSON-RPC request directly to a Graph Node. Moving the creation of fulltext search indexes/queries out of the schema would mean that that the definition of uniqueness for a subgraph does not include the custom indexes, so a new subgraph deployment and subgraph re-syncing work does not have to be added in order to create or update an index. However, it also introduces significant added complexity. A separate query marketplace and discovery registry would be required for finding nodes with the needed subgraph-index combination.
Open Questions
Full-text search queries introduce new issues with maintaining query result determinism which will become a more potent issue with the decentralized network. A fulltext search query and a dataset are not enough to determine the output of the query, the index is vital to establish a deterministic causal relationship to the output data. Query verification will need to take into account the query, the index, the underlying dataset, and the query result. Can we find a healthy compromise between being prescriptive about the indexes and algorithms in order to allow formal verification and allowing indexer node operators to experiment with algorithms and indexes in order to continue to improve query speed and results?
Since a fulltext search field is purely derivative of other Entity data the addition or update of an @fulltext directive does not require a full blockchain resync, rather the index itself just needs to be rebuilt. There is room for optimization in the future by allowing fulltext search definition updates without requiring a full subgraph resync.
RFC-0004 Multi-Blockchain Support
Obsolete RFCs
Obsolete RFCs are moved to the rfcs/obsolete
directory in the rfcs
repository. They are listed below for reference.
- No RFCs have been obsoleted yet.
Rejected RFCs
Rejected RFCs can be found by filtering open and closed pull requests by those
that are labeled with rejected
. This list can be found
here.
Engineering Plans
What is an Engineering Plan?
Engineering Plans are plans to turn an RFC into an implementation in the core Graph Protocol tools like Graph Node, Graph CLI and Graph TS. Every substantial development effort that follows an RFC is planned in the form of an Engineering Plan.
Engineering Plan process
1. Create a new Engineering Plan
Like RFCs, Engineering Plans are numbered, starting at 0001
. To create a new
plan, create a new branch of the rfcs
repository. Check the existing plans to
identify the next number to use. Then, copy the Engineering Plan
template
to a new file in the engineering-plans/
directory. For example:
cp engineering-plans/0000-template.md engineering-plans/0015-fulltext-search.md
Write the Engineering Plan, commit it to the branch and open a pull
request in the rfcs
repository.
In addition to the Engineering Plan itself, the pull request must include the following changes:
- a link to the Engineering Plan on the Approved Engineering Plans page, and
- a link to the Engineering Plan under
Approved Engineering Plans
inSUMMARY.md
.
2. Engineering Plan review
After an Engineering Plan has been submitted through a pull request, it is being reviewed. At the time of writing, every Engineering Plan needs to be approved by
- the Tech Lead, and
- at least one member of the core development team.
3. Engineering Plan approval
Once an Engineering Plan is approved, the Engineering Plan meta data (see the template) is updated and the pull request is merged by the original author or a Graph Protocol team member.
Approved Engineering Plans
- PLAN-0001: GraphQL Query Prefetching
- PLAN-0002: Ethereum Tracing Cache
- PLAN-0003: Remove JSONB Storage
PLAN-0001: GraphQL Query Prefetching
- Author
- David Lutterkort
- Implements
- No RFC - no user visible changes
- Engineering Plan pull request
- https://github.com/graphprotocol/rfcs/pull/2
- Date of submission
- 2019-11-27
- Date of approval
- 2019-12-10
- Approved by
- Jannis Pohlmann, Leo Yvens
This is not really a plan as it was written and discussed before we adopted the RFC process, but contains important implementation detail of how we process GraphQL queries.
Contents
Implementation Details for prefetch queries
Goal
For a GraphQL query of the form
query {
parents(filter) {
id
children(filter) {
id
}
}
}
we want to generate only two SQL queries: one to get the parents, and one
to get the children for all those parents. The fact that children
is
nested under parents
requires that we add a filter to the children
query that restricts children to those that are related to the parents we
fetched in the first query to get the parents. How exactly we filter the
children
query depends on how the relationship between parents and
children is modeled in the GraphQL schema, and on whether one (or both) of
the types involved are interfaces.
The rest of this writeup is concerned with how to generate the query for
children
, assuming we already retrieved the list of all parents.
The bulk of the implementation of this feature can be found in
graphql/src/store/prefetch.rs
, store/postgres/src/jsonb_queries.rs
, and
store/postgres/src/relational_queries.rs
Handling first/skip
We never get all the children
for a parent; instead we always have a
first
and skip
argument in the children filter. Those arguments need to
be applied to each parent individually by ranking the children for each
parent according to the order defined by the children
query. If the same
child matches multiple parents, we need to make sure that it is considered
separately for each parent as it might appear at different ranks for
different parents. In SQL, we use a lateral join, essentially a for
loop. For children that store the id of their parent in parent_id
, we'd
run the following query:
select c.*, p.id
from unnest({parent_ids}) as p(id)
cross join lateral
(select *
from children c
where c.parent_id = p.id
and .. other conditions on c ..
order by c.{sort_key}
limit {first}
offset {skip}) c
order by c.{sort_key}
Handling parent/child relationships
How we get the children for a set of parents depends on how the relationship between the two is modeled. The interesting parameters there are whether parents store a list or a single child, and whether that field is derived, together with the same for children.
There are a total of 16 combinations of these four boolean variables; four of them, when both parent and child derive their fields, are not permissible. It also doesn't matter whether the child derives its parent field: when the parent field is not derived, we need to use that since that is the only place that contains the parent -> child relationship. When the parent field is derived, the child field can not be a derived field.
That leaves us with eight combinations of whether the parent
and child store a list or a scalar value, and whether the parent is
derived. For details on the GraphQL schema for each row in this table, see the
section at the end. The Join cond
indicates how we can find the children
for a given parent. The table refers to the four different kinds of join
condition we might need as types A, B, C, and D.
Case | Parent list? | Parent derived? | Child list? | Join cond | Type |
---|---|---|---|---|---|
1 | TRUE | TRUE | TRUE | child.parents ∋ parent.id | A |
2 | FALSE | TRUE | TRUE | child.parents ∋ parent.id | A |
3 | TRUE | TRUE | FALSE | child.parent = parent.id | B |
4 | FALSE | TRUE | FALSE | child.parent = parent.id | B |
5 | TRUE | FALSE | TRUE | child.id ∈ parent.children | C |
6 | TRUE | FALSE | FALSE | child.id ∈ parent.children | C |
7 | FALSE | FALSE | TRUE | child.id = parent.child | D |
8 | FALSE | FALSE | FALSE | child.id = parent.child | D |
In addition to how the data about the parent/child relationship is stored,
the multiplicity of the parent/child relationship also influences query
generation: if each parent can have at most a single child, queries can be
much simpler than if we have to account for multiple children per parent,
which requires paginating them. We also need to detect cases where the
mappings created multiple children per parent. We do this by adding a
clause limit {parent_ids.len} + 1
to the query, so that if there is one
parent with multiple children, we will select it, but still protect
ourselves against mappings that produce catastrophically bad data with huge
numbers of children per parent. The GraphQL execution logic will detect
that there is a parent with multiple children, and generate an error.
When we query children, we already have a list of all parents from running a previous query. To find the children, we need to have the id of the parent that child is related to, and, when the parent stores the ids of its children directly (types C and D) the child ids for each parent id.
The following queries all produce a relation that has the same columns as the table holding children, plus a column holding the id of the parent that the child belongs to.
Type A
Use when parent is derived and child stores a list of parents
Data needed to generate:
- children: name of child table
- parent_ids: list of parent ids
- parent_field: name of parents field (array) in child table
- single: boolean to indicate whether a parent has at most one child or not
The implementation uses an EntityLink::Direct
for joins of this type.
Multiple children per parent
select c.*, p.id as parent_id
from unnest({parent_ids}) as p(id)
cross join lateral
(select *
from children c
where p.id = any(c.{parent_field})
and .. other conditions on c ..
order by c.{sort_key}
limit {first} offset {skip}) c
order by c.{sort_key}
Single child per parent
select c.*, p.id as parent_id
from unnest({parent_ids}) as p(id),
children c
where c.{parent_field} @> array[p.id]
and .. other conditions on c ..
limit {parent_ids.len} + 1
Type B
Use when parent is derived and child stores a single parent
Data needed to generate:
- children: name of child table
- parent_ids: list of parent ids
- parent_field: name of parent field (scalar) in child table
- single: boolean to indicate whether a parent has at most one child or not
The implementation uses an EntityLink::Direct
for joins of this type.
Multiple children per parent
select c.*, p.id as parent_id
from unnest({parent_ids}) as p(id)
cross join lateral
(select *
from children c
where p.id = c.{parent_field}
and .. other conditions on c ..
order by c.{sort_key}
limit {first} offset {skip}) c
order by c.{sort_key}
Single child per parent
select c.*, c.{parent_field} as parent_id
from children c
where c.{parent_field} = any({parent_ids})
and .. other conditions on c ..
limit {parent_ids.len} + 1
Alternatively, this is worth a try, too:
select c.*, c.{parent_field} as parent_id
from unnest({parent_ids}) as p(id), children c
where c.{parent_field} = p.id
and .. other conditions on c ..
limit {parent_ids.len} + 1
Type C
Use when the parent stores a list of its children.
Data needed to generate:
- children: name of child table
- parent_ids: list of parent ids
- child_id_matrix: array of arrays where
child_id_matrix[i]
is an array containing the ids of the children forparent_id[i]
The implementation uses a EntityLink::Parent
for joins of this type.
Multiple children per parent
select c.*, p.id as parent_id
from rows from (unnest({parent_ids}), reduce_dim({child_id_matrix}))
as p(id, child_ids)
cross join lateral
(select *
from children c
where c.id = any(p.child_ids)
and .. other conditions on c ..
order by c.{sort_key}
limit {first} offset {skip}) c
order by c.{sort_key}
Note that reduce_dim
is a custom function that is not part of ANSI
SQL:2016 but is needed as there is
no standard way to decompose a matrix into a table where each row contains
one row of the matrix. The ROWS FROM
construct is also not part of ANSI
SQL.
Single child per parent
Not possible with relations of this type
Type D
Use when parent is not a list and not derived
Data needed to generate:
- children: name of child table
- parent_ids: list of parent ids
- child_ids: list of the id of the child for each parent such that
child_ids[i]
is the id of the child forparent_id[i]
The implementation uses a EntityLink::Parent
for joins of this type.
Multiple children per parent
Not possible with relations of this type
Single child per parent
select c.*, p.id as parent_id
from rows from (unnest({parent_ids}), unnest({child_ids})) as p(id, child_id),
children c
where c.id = p.child_id
and .. other conditions on c ..
The ROWS FROM
construct is not part of ANSI SQL.
Handling interfaces
If the GraphQL type of the children is an interface, we need to take special care to form correct queries. Whether the parents are implementations of an interface or not does not matter, as we will have a full list of parents already loaded into memory when we build the query for the children. Whether the GraphQL type of the parents is an interface may influence from which parent attribute we get child ids for queries of type C and D.
When the GraphQL type of the children is an interface, we resolve the
interface type into the concrete types implementing it, produce a query for
each concrete child type and combine those queries via union all
.
Since implementations of the same interface will generally differ in the
schema they use, we can not form a union all
of all the data in the
tables for these concrete types, but have to first query only attributes
that we know will be common to all entities implementing the interface,
most notably the vid
(a unique identifier that identifies the precise
version of an entity), and then later fill in the details of each entity by
converting it directly to JSON. A second reason to pass entities as JSON
from the database is that it is impossible with Diesel to execute queries
where the number and types of the columns of the result are not known at
compile time.
We need to to be careful though to not convert to JSONB too early, as that is slow when done for large numbers of rows. Deferring conversion is responsible for some of the complexity in these queries.
In the following, we only go through the queries for relational storage;
for JSONB storage, there are similar considerations, though they are
somewhat simpler as the union all
in the below queries turns into
an entity = any(..)
clause with JSONB storage, and because we do not need
to convert to JSONB data.
That means that when we deal with children that are an interface, we will first select only the following columns from each concrete child type (where exactly they come from depends on how the parent/child relationship is modeled)
select '{__typename}' as entity, c.vid, c.id, c.{sort_key}, p.id as parent_id
and then use that data to fill in the complete details of each concrete
entity. The query type_query(children)
is the query from the previous
section according to the concrete type of children
, but without the
select
, limit
, offset
or order by
clauses. The overall structure of
this query then is
with matches as (
select '{children.object}' as entity, c.vid, c.id,
c.{sort_key}, p.id as parent_id
from .. type_query(children) ..
union all
.. range over all child types ..
order by {sort_key}
limit {first} offset {skip})
select m.*, to_jsonb(c.*) as data
from matches m, {children.table} c
where c.vid = m.vid and m.entity = '{children.object}'
union all
.. range over all child tables ..
order by {sort_key}
The list all_parent_ids
must contain the ids of all the parents for which
we want to find children.
We have one children
object for each concrete GraphQL type that we need
to query, where children.table
is the name of the database table in which
these entities are stored, and children.object
is the GraphQL typename
for these children.
The code uses an EntityCollection::Window
containing multiple
EntityWindow
instances to represent the most general form of querying for
the children of a set of parents, the query given above.
When there is only one window, we can simplify the above query. The
simplification basically inlines the matches
CTE. That is important as
CTE's in Postgres before Postgres 12 are optimization fences, even when
they are only used once. We therefore reduce the two queries that Postgres
executes above to one for the fairly common case that the children are not
an interface. For each type of parent/child relationship, the resulting
query is essentially the same as the one given in the section
Handling parent/child relationships
, except that the select
clause is
changed to select '{window.child_type}' as entity, to_jsonb(c.*) as data
:
select '..' as entity, to_jsonb(e.*) as data, p.id as parent_id
from {expand_parents}
cross join lateral
(select *
from children c
where {linked_children}
and .. other conditions on c ..
order by c.{sort_key}
limit {first} offset {skip}) c
order by c.{sort_key}
Toplevel queries, i.e., queries where we have no parents, and therefore do
not restrict the children we return by parent ids are represented in the
code by an EntityCollection::All
. If the GraphQL type of the children is
an interface with multiple implementers, we can simplify the query by
avoiding ranking and just using an ordinary order by
clause:
with matches as (
-- Get uniform info for all matching children
select '{entity_type}' as entity, id, vid, {sort_key}
from {entity_table} c
where {query_filter}
union all
... range over all entity types
order by {sort_key} offset {query.skip} limit {query.first})
-- Get the full entity for each match
select m.entity, to_jsonb(c.*) as data, c.id, c.{sort_key}
from matches m, {entity_table} c
where c.vid = m.vid and m.entity = '{entity_type}'
union all
... range over all entity types
-- Make sure we return the children for each parent in the correct order
order by c.{sort_key}, c.id
And finally, for the very common case of a toplevel GraphQL query for a
concrete type, not an interface, we can further simplify this, again by
essentially inlining the matches
CTE to:
select '{entity_type}' as entity, to_jsonb(c.*) as data
from {entity_table} c
where query.filter()
order by {query.order} offset {query.skip} limit {query.first}
Boring list of possible GraphQL models
These are the eight ways in which a parent/child relationship can be
modeled. For brevity, I left the id
attribute on each parent and child
type out.
This list assumes that parent and child types are concrete types, i.e., that any interfaces involved in this query have already been reolved into their implementations and we are dealing with one pair of concrete parent/child types.
# Case 1
type Parent {
children: [Child] @derived
}
type Child {
parents: [Parent]
}
# Case 2
type Parent {
child: Child @derived
}
type Child {
parents: [Parent]
}
# Case 3
type Parent {
children: [Child] @derived
}
type Child {
parent: Parent
}
# Case 4
type Parent {
child: Child @derived
}
type Child {
parent: Parent
}
# Case 5
type Parent {
children: [Child]
}
type Child {
# doesn't matter
}
# Case 6
type Parent {
children: [Child]
}
type Child {
# doesn't matter
}
# Case 7
type Parent {
child: Child
}
type Child {
# doesn't matter
}
# Case 8
type Parent {
child: Child
}
type Child {
# doesn't matter
}
Resources
- PostgreSQL Manual
- Browsable SQL Grammar
- Wikipedia entry on ANSI SQL:2016 The actual standard is not freely available
PLAN-0002: Ethereum Tracing Cache
- Author
- Zachary Burns
- Implements
- RFC-0002 Ethereum Tracing Cache
- Engineering Plan pull request
- https://github.com/graphprotocol/rfcs/pull/9
- Date of submission
- 2019-12-20
- Date of approval
- 2020-01-07
- Approved by
- Jannis Pohlmann, Leo Yvens
Summary
Implements RFC-0002: Ethereum Tracing Cache
Implementation
These changes happen within or near ethereum_adapter.rs
, store.rs
and db_schema.rs
.
Limitations
The problem of reorg turns out to be a particularly tricky one for the cache, mostly due to ranges of blocks being requested rather than individual hashes. To sidestep this problem, only blocks that are older than the reorg threshold will be eligible for caching.
Additionally, there are some subgraphs which may require traces from all or a substantial number of blocks and don't make effective use of filtering. In particular, subgraphs which specify a call handler without a contract address fall into this category. In order to prevent the cache from bloating, any use of Ethereum traces which does not filter on a contract address will bypass the cache.
EthereumTraceCache
The implementation introduces the following trait, which is implemented primarily by Store
.
#![allow(unused)] fn main() { use std::ops::RangeInclusive; struct TracesInRange { range: RangeInclusive<u64>, traces: Vec<Trace>, } pub trait EthereumTraceCache: Send + Sync + 'static { /// Attempts to retrieve traces from the cache. Returns ranges which were retrieved. /// The results may not cover the entire range of blocks. It is up to the caller to decide /// what to do with ranges of blocks that are not cached. fn traces_for_blocks(contract_address: Option<H160>, blocks: RangeInclusive<u64> ) -> Box<dyn Future<Output=Result<Vec<TracesInRange>, Error>>>; fn add(contract_address: Option<H160>, traces: Vec<TracesInRange>); } }
Block schema
Each cached block will exist as its own row in the database in an eth_traces_cache
table.
#![allow(unused)] fn main() { eth_traces_cache(id) { id -> Integer, network -> Text, block_number: Integer, contract_address: Bytea, traces -> Jsonb, } }
A multi-column index will be added on network, block_number, and contract_address.
It can be noted that in the eth_traces_cache
table, there is a very low cardinality for the value of the network row. It is inefficient for example to store the string mainnet
millions of times and consider this value when querying. A data oriented approach would be to partition these tables on the value of the network. It is expected that hash partitioning available in Postgres 11 would be useful here, but the necessary dependencies won't be ready in time for this RFC. This may be revisited in the future.
Valid Cache Range
Because the absence of trace data for a block is a valid cache result, the database must maintain a data structure indicating which ranges of the cache are valid in an eth_traces_meta
table. This table also enables eventually implementing cleaning out old data.
This is the schema for that structure:
#![allow(unused)] fn main() { id -> Integer, network -> Text, start_block -> Integer, end_block -> Integer, contract_address -> Nullable<Bytea>, accessed_at -> Date, }
When inserting data into the cache, removing data from the cache, or reading the cache, a serialized transaction must be used to preserve atomicity between the valid cache range structure and the cached blocks. Care must be taken to not rely on any data read outside of the serialized transaction, and for the extent of the serialized transaction to not span any async contexts that rely on any Future
outside of the database itself. The definition of the EthereumTraceCache
trait is designed to uphold these guarantees.
In order to preserve space in the database, whenever the valid cache range is added it will be added such that adjacent and overlapping ranges are merged into it.
Cache usage
The primary user of the cache is EtheriumAdapter<T>
in the traces
function.
The correct algorithm for retrieving traces from the cache is surprisingly nuanced. The complication arises from the interaction between multiple subgraphs which may require a subset of overlapping contract addresses. The rate at which indexing proceeds of these subgraphs can cause different ranges of the cache to be valid for a contract address in a single query.
We want to minimize the cost of external requests for trace data. It is likely that it is better to...
- Make fewer requests
- Not ask for trace data that is already cached
- Ask for trace data for multiple contract addresses within the same block when possible.
There is one flow of data which upholds these invariants. In doing so it makes a tradeoff of increasing latency for the execution of a specific subgraph, but increases throughput of the whole system.
Within this graph:
- Edges which are labelled refer to some subset of the output data.
- Edges which are not labelled refer to the entire set of the output data.
- Each node executes once for each contiguous range of blocks. That is, it merges all incoming data before executing, and executes the minimum possible times.
- The example given is just for 2 addresses. The actual code must work on sets of addresses.
graph LR; A[Block Range for Contract A & B] A --> |Above Reorg Threshold| E D[Get Cache A] A --> |Below Reorg Threshold A| D A --> |Below Reorg Threshold B| H E[Ethereum A & B] F[Ethereum A] G[Ethereum B] H[Get Cache B] D --> |Found| M H --> |Found| M M[Result] D --> |Missing| N H --> |Missing| N N[Overlap] N --> |A & B| E N --> |A| F N --> |B| G E --> M K[Set Cache A] L[Set Cache B] E --> |B Below Reorg Threshold| L E --> |A Below Reorg Threshold| K F --> K G --> L F --> M G --> M
This construction is designed to make the fewest number of the most efficient calls possible. It is not as complicated as it looks. The actual construction can be expressed as sequential steps with a set of filters preceding each step.
Useful dependencies
The feature deals a lot with ranges and sets. Operations like sum, subtract, merge, and find overlapping are used frequently. nested_intervals is a crate which provides some of these operations.
Tests
Benchmark
A temporary benchmark will be added for indexing a simple subgraph which uses call handlers. The benchmark will be run in these scenarios:
- Sync before changes
- Re-sync before changes
- Sync after changes
- Re-sync after changes
Ranges
Due to the complexity of the resource minimizing data workflow, it will be useful to have mocks for the cache and database which record their calls, and check that expected calls are made for tricky data sets.
Database
A real database integration test will be added to test the add/remove from cache implementation to verify that it correctly merges blocks, handles concurrency issues, etc.
Migration
None
Documentation
None, aside from code comments
Implementation Plan:
These estimates inflated to account for the author's lack of experience with Postgres, Ethereum, Futures0.1, and The Graph in general.
- (1) Create benchmarks
- Postgres Cache
- (0.5) Block Cache
- (0.5) Trace Serialization/Deserialization
- (1.0) Ranges Cache
- (0.5) Concurrency/Transactions
- (0.5) Tests against Postgres
- Data Flow
- (3) Implementation
- (1) Unit tests
- (0.5) Run Benchmarks
Total: 8
PLAN-0003: Remove JSONB Storage
- Author
- David Lutterkort
- Implements
- No RFC - no user visible changes
- Engineering Plan pull request
- https://github.com/graphprotocol/rfcs/pull/7
- Date of submission
- 2019-12-18
- Date of approval
- 2019-12-20
- Approved by
- Jess Ngo, Jannis Pohlmann
Summary
Remove JSONB storage from graph-node
. That means that we want to remove
the old storage scheme, and only use relational storage going
forward. At a high level, removal has to touch the following areas:
- user subgraphs in the hosted service
- user subgraphs in self-hosted
graph-node
instances - subgraph metadata in
subgraphs.entities
(see this issue) - the
graph-node
code base
Because it touches so many areas and different things, JSONB storage removal will need to happen in several steps, the last being actual removal of JSONB code. The first three steps above are independent of each other and can be done in parallel.
Implementation
User Subgraphs in the Hosted Service
We will need to communicate to users that they need to update their subgraphs if they still use JSONB storage. Currently, there are ~ 580 subgraphs (list) belonging to 220 different organizations using JSONB storage. It is quite likely that the vast majority of them is not needed anymore and simply left over from somebody trying something out.
We should contact users and tell them that we will delete their subgraph after a certain date (say 2020-02-01) unless they deploy a new version of the subgraph (with an explanation why etc. of course) Redeploying their subgraph is all that is needed for those updates.
Self-hosted User Subgraphs
We will need to tell users that the 'old' JSONB storage is deprecated and support for it will be removed as of some target date, and that they need to redeploy their subgraph.
Users will need some documentation/tooling to help them understand
- which of their deployed subgraphs still use JSONB storage
- how to remove old subgraphs
- how to remove old deployments
Subgraph Metadata in subgraphs.entities
We can treat the subgraphs
schema like a normal subgraph, with the
exception that some entities must not be versioned. For that, we will need
to adopt code that makes it possible to write entities to the store without
recording their version (or, more generally, so that there will only be one
version of the entity, tagged with a block range [0,)
)
We will manually create the DDL for the subgraphs.graphql
schema and run
that as part of a database migration. In that migration, we will also copy
the existing metadata from subgraphs.entities
and
subgraphs.entity_history
into their new tables.
The Code Base
Delete all code handling JSONB storage. This will mostly affect
entities.rs
and jsonb_queries.rs
in graph-store-postgres
, but there
are also smaller things like that we do not need the annotations on
Entity
to serialize them to the JSON format that JSONB uses.
Tests
Most of the code-level changes are covered by the existing test suite. The major exception is that the migration of subgraph metadata needs to be tested and checked manually, using a recent dump of the production database.
Migration
See above on migrating data in the subgraphs
schema.
Documentation
No user-facing documentation is needed.
Implementation Plan
No estimates yet as we should first agree on this general course of action
- Notify hosted users to update their subgraph or have it deleted by date X
- Mark JSONB storage as deprecated and announce when it will be removed
- Provide tool to ship with
graph-node
to delete unused deployments and unneeded subgraphs - Add affordance to not version entities to relational storage code
- Write SQL migrations to create new subgraph metadata schema and copy existing data
- Delete old JSONB code
- On start of
graph-node
, add check for any deployments that still use JSONB storage and log warning messages telling users to redeploy (once the JSONB code has been deleted, this data can not be accessed any more)
Open Questions
None
Obsolete Engineering Plans
Obsolete Engineering Plans are moved to the engineering-plans/obsolete
directory in the rfcs
repository. They are listed below for reference.
- No Engineering Plans have been obsoleted yet.
Rejected Engineering Plans
Rejected Engineering Plans can be found by filtering open and closed pull
requests by those that are labeled with rejected
. This list can be found
here.