Data Import Configuration

In our guide on using custom graph data, we introduced the basics of importing graph data using a simple YAML configuration. This section delves deeper, providing a thorough exploration of the extensive configuration options available for data import.

Supported data source

Currently we only support import data to graph from local csv files or odps table. See configuration loading_config.data_source.scheme.

Column mapping

When importing vertex and edge data into a graph, users must define how the raw data maps to the graph’s schema. This can be done using a YAML configuration, as shown:

- column:
    index: 0  # Column index in the data source
    name: col_name # If a column name is present
  property: property_name  # The mapped property name

The column mapping requirements differ based on the data source:

Import from CSV

You can provide either index, name, or both. If both index and name are specified, we will check whether they matches.

Import from ODPS Table

You just need to specify the name of the column, since the name is guaranteed to be unique in a odps table. The index is disregarded.

Sample Configuration for loading “Modern” Graph from csv files

To illustrate, let’s examine the modern_graph_import.yaml file. This configuration is designed for importing the “modern” graph and showcases the full range of configuration possibilities. We’ll dissect each configuration item in the sections that follow.

Note

Note that a @ is prepend to the location, which means all files will be uploaded from local.

loading_config: 
  x_csr_params:
    parallelism: 1
    build_csr_in_mem: true
    use_mmap_vector: true
  data_source:
    scheme: file
    location: "@/home/modern_graph/"
  format: 
    metadata: 
      delimiter: "|"  
      header_row: true
      quoting: false
      quote_char: '"'
      double_quote: true
      escape_char: '\'
      block_size: 4MB
vertex_mappings: 
  - type_name: person  
    inputs: 
      - person.csv  
    column_mappings:  
      - column:
          index: 0  
          name: id 
        property: id 
      - column:
          index: 1 
          name: name
        property: name
      - column:
          index: 2
          name: age
        property: age
  - type_name: software
    inputs:
      - software.csv
    column_mappings:
      - column:
          index: 0      
          name: id  
        property: id  
      - column:
          index: 1
          name: name
        property: name
      - column:
          index: 2
          name: lang
        property: lang
edge_mappings: 
  - type_triplet:
      edge: knows  
      source_vertex:  person 
      destination_vertex:  person  
    inputs: 
      - person_knows_person.csv 
    source_vertex_mappings:
      - column:  
          index: 0  
          name: src_id
        property: id
    destination_vertex_mappings:
      - column:  
          index: 1  
          name: dst_id
        property: id
    column_mappings:
      - column:
          index: 2
          name: weight
        property: weight
  - type_triplet:
      edge: created
      source_vertex: person
      destination_vertex: software
    inputs:
      -  person_created_software.csv
    source_vertex_mappings:
      - column:
          index: 0
          name: id
    destination_vertex_mappings:
      - column:
          index: 1
          name: id
    column_mappings:
      - column:
          index: 2
          name: weight
        property: weight

Sample configuration for loading “Modern Graph” from odps tables

graph: dd_graph
loading_config:
  data_source:
    scheme: odps  # file, odps
  import_option: init # init, overwrite
  format:
    type: arrow
vertex_mappings:
  - type_name: person  # must align with the schema
    inputs:
      - your_proj_name/table_name/partition_col_name=partition_name
    column_mappings:  
          - column:
              index: 0  
              name: id 
            property: id 
          - column:
              index: 1 
              name: name
            property: name
          - column:
              index: 2
              name: age
            property: age
  - type_name: software
    inputs:
      - your_proj_name/table_name/partition_col_name=partition_name
    column_mappings:
      - column:
          index: 0      
          name: id  
        property: id  
      - column:
          index: 1
          name: name
        property: name
      - column:
          index: 2
          name: lang
        property: lang
edge_mappings:
  - type_triplet:
      edge: knows
      source_vertex: person
      destination_vertex: person
    inputs:
      - your_proj_name/table_name/partition_col_name=partition_name
    source_vertex_mappings:
      - column:
          index: 0
          name: src_user_id
        property: id
    destination_vertex_mappings:
      - column:
          index: 1
          name: dst_user_id
        property: id
    column_mappings:
      - column:
          index: 2
          name: weight
        property: weight
  - type_triplet:
      edge: created
      source_vertex: person
      destination_vertex: software
    inputs:
      - your_proj_name/table_name/partition_col_name=partition_name
    source_vertex_mappings:
      - column:
          index: 0
          name: id
    destination_vertex_mappings:
      - column:
          index: 1
          name: id
    column_mappings:
      - column:
          index: 2
          name: weight
        property: weight

Configuration Breakdown

The table below offers a detailed breakdown of each configuration item. In this table, we use the “.” notation to represent the hierarchy within the YAML structure.

Key

Default

Descriptions

Mandatory

loading_config

N/A

Loading configurations

Yes

loading_config.data_source

N/A

Place that maintains the raw data

Yes

loading_config.data_source.location

N/A

Path to the data source in the container, which must be mapped from the host machine while initializing the service

Yes

loading_config.scheme

file

The source of input data. Currently only file and odps are supported

No

loading_config.format

N/A

The format of the raw data in CSV

Yes

loading_config.format.metadata

N/A

Mainly for configuring the options for reading CSV

Yes

loading_config.format.metadata.delimiter

‘|’

Delimiter used to split a row of data

Yes

loading_config.format.metadata.header_row

true

Indicate if the first row should be used as the header

No

loading_config.format.metadata.quoting

false

Whether quoting is used

No

loading_config.format.metadata.quote_char

‘”’

Quoting character (if quoting is true)

No

loading_config.format.metadata.double_quote

true

Whether a quote inside a value is double-quoted

No

loading_config.format.metadata.escaping

false

Whether escaping is used

No

loading_config.format.metadata.escape_char

‘\’

Escaping character (if escaping is true)

No

loading_config.format.metadata.null_values

[]

Recognized spellings for null values

No

loading_config.format.metadata.batch_size

4MB

The size of batch for reading from files

No

loading_config.x_csr_params.parallelism

1

Number of threads used for bulk loading

No

loading_config.x_csr_params.build_csr_in_mem

false

Whether to build csr fully in memory

No

loading_config.x_csr_params.use_mmap_vector

false

Whether to use mmap_vector rather than mmap_array for building

No

vertex_mappings

N/A

Define how to map the raw data into a graph vertex in the schema

Yes

vertex_mappings.type_name

N/A

Name of the vertex type

Yes

vertex_mappings.inputs

N/A

List of files containing raw data for the vertex type, relative to loading_config.data_source.location

Yes

vertex_mappings.column_mappings

The columns in the raw data will be automatically mapped to the properties, in given order

Define which column of the vertex data maps to a given property

No

vertex_mappings.column_mappings.column.index

N/A

Index of the column in the raw data

No

vertex_mappings.column_mappings.column.name

N/A

Name of the column in the raw data

No

vertex_mappings.column_mappings.property

N/A

Property in the schema to which the column maps.

No

edge_mappings

N/A

Define how to map the raw data into a graph edge in the schema

Yes

edge_mappings.type_triplet

N/A

A triplet that uniquely defines an edge

Yes

edge_mappings.type_triplet.edge

N/A

The type of the edge

Yes

edge_mappings.type_triplet.source_vertex

N/A

Type of the source vertex for the edge

Yes

edge_mappings.type_triplet.destination_vertex

N/A

Type of the destination vertex for the edge

Yes

edge_mappings.inputs

N/A

List of files containing raw data for the edge type, relative to loading_config.data_source.location

Yes

edge_mappings.source_vertex_mappings

The first column of the edge data

Define which columns of the edge data map to its source vertex.

No

edge_mappings.source_vertex_mappings.column.index

N/A

Index of the column in the raw data representing the source vertex.

No

edge_mappings.source_vertex_mappings.column.name

N/A

Name of the column in the raw data representing the source vertex.

No

edge_mappings.destination_vertex_mappings

The second column of the edge data

Define which columns of the edge data map to its source vertex.

No

edge_mappings.destination_vertex_mappings.column.index

N/A

Index of the column in the raw data representing the destination vertex.

No

edge_mappings.destination_vertex_mappings.column.name

N/A

Name of the column in the raw data representing the destination vertex.

No

edge_mappings.column_mappings

The columns in the raw data will be automatically mapped to the properties, in given order

Define which column of the edge data maps to a given property.

No

edge_mappings.column_mappings.column.index

N/A

Index of the column in the raw data

No

edge_mappings.column_mappings.column.name

N/A

Name of the column in the raw data

No

edge_mappings.column_mappings.property

N/A

Property in the schema to which the column maps.

No (but must present if column presents)

Loading Configurations

The loading_config section defines the primary settings for data loading. This includes specifying the data source, its format, and other metadata.

  • loading_config.data_source.location: This is the path to the data source within the container. It’s essential to ensure this location is mapped from the host machine when initializing the service.

  • loading_config.format.metadata: Mainly for configuring the options for reading CSV

    • loading_config.format.metadata.delimiter: Defines the delimiter used to split rows of data.

    • loading_config.format.metadata.header_row: Indicate if the first row should be used as the header.

    • loading_config.format.metadata.quoting: Whether quoting is used

    • loading_config.format.metadata.quote_char: Quoting character (if quoting is true)

    • loading_config.format.metadata.double_quote: Whether a quote inside a value is double-quoted

    • loading_config.format.metadata.escaping: Whether escaping is used

    • loading_config.format.metadata.escape_char: Escaping character (if escaping is true)

    • loading_config.format.metadata.block_size: The size of batch for reading from files

    • loading_config.format.metadata.null_values: Recognized spellings for null values

Vertex Mappings

The vertex_mappings section outlines how raw data maps to graph vertices based on the defined schema.

  • vertex_mappings.type_name: Specifies the name of the vertex type.

  • vertex_mappings.inputs: Lists the files containing raw data for the vertex type. These paths are relative to loading_config.data_source.location.

  • vertex_mappings.column_mappings.column**: Define which column of the vertex data maps to a given property. If not give, the columns in the raw data will be automatically mapped to the properties, in given order.

    • vertex_mappings.column_mappings.column.index: Index of the column in the raw data.

    • vertex_mappings.column_mappings.column.name: Name of the column in the raw data.

  • vertex_mappings.column_mappings.property: Property in the schema to which the column maps.

Edge Mappings

The edge_mappings section details how raw data maps to graph edges as per the schema.

  • edge_mappings.type_triplet: A triplet that uniquely defines an edge.

    • edge_mappings.type_triplet.edge: Indicates the type of the edge.

    • edge_mappings.type_triplet.source_vertex: Indicates the type of the source vertex.

    • edge_mappings.type_triplet.destination_vertex: Indicates the type of the destination vertex.

  • edge_mappings.inputs: Lists the files containing raw data for the edge type. These paths are relative to loading_config.data_source.location.

  • edge_mappings.source_vertex_mappings: Define which columns (can be multiple) of the edge map to its source vertex. If not given, the first column of the edge data will be used by default.

    • edge_mappings.source_vertex_mappings.column.index: Index of the column in the raw data representing the source vertex.

    • edge_mappings.source_vertex_mappings.column.name: Name of the column in the raw data representing the source vertex.

  • edge_mappings.destination_vertex_mappings: Define which columns (can be multiple) of the edge map to its destination vertex. If not given, the second column of the edge data will be used by default.

    • edge_mappings.destination_vertex_mappings.column.index: Index of the column in the raw data representing the destination vertex.

    • edge_mappings.destination_vertex_mappings.column.name: Name of the column in the raw data representing the destination vertex.

  • edge_mappings.column_mappings.column: Define which column of the edge data maps to a given property. If not give, the columns in the raw data will be automatically mapped to the properties, in given order.

    • edge_mappings.column_mappings.column.index: Index of the column in the raw data.

    • edge_mappings.column_mappings.column.name: Name of the column in the raw data.

  • edge_mappings.column_mappings.property: Property in the schema to which the column maps.

By understanding and appropriately setting these configurations, users can seamlessly import their graph data into GraphScope, ensuring data integrity and optimal performance.