华为云AI开发平台ModelArts导入Manifest文件的规范说明_云淘科技

AI开发平台ModelArts

10 月 30, 2023

115 0

Manifest文件中定义了标注对象和标注内容的对应关系。此导入方式是指导入数据集时，使用Manifest文件。选择导入Manifest文件时，可以从OBS导入。当从OBS导入Manifest文件时，需确保当前用户具备Manifest文件所在OBS路径的权限。

Manifest文件编写规范要求较多，推荐使用OBS目录导入方式导入新数据。一般此功能常用于不同区域或不同帐号下ModelArts的数据迁移，即当您已在某一区域使用ModelArts完成数据标注，发布后的数据集可从输出路径下获得其对应的Manifest文件。在获取此Manifest文件后，可将此数据集导入其他区域或者其他帐号的ModelArts中，导入后的数据已携带标注信息，无需重复标注，提升开发效率。

Manifest文件描述的是原始文件和标注信息，可用于标注、训练、推理场景。Manifest文件中也可以只有原始文件信息，没有标注信息，如用于推理场景，或用于生成未标注的数据集。Manifest文件需满足如下要求：

Manifest文件使用UTF-8编码。文本分类的source数值可以包含中文，其他字段不建议使用中文。
Manifest文件使用json lines格式（jsonlines.org），一行一个json对象。

{"source": "/path/to/image1.jpg", "annotation": … }
{"source": "/path/to/image2.jpg", "annotation": … }
{"source": "/path/to/image3.jpg", "annotation": … }

为了说明方便，下面的Manifest例子格式化为多行的json对象。

Manifest文件可以由用户、第三方工具或ModelArts数据标注生成，其文件名没有特殊要求，可以为任意合法文件名。为了ModelArts系统内部使用方便，ModelArts数据标注功能生成的文件名由如下字符串组成：“DatasetName-VersionName.manifest”。例如，“animal-v201901231130304123.manifest”。

图像分类

表1 字段说明
字段	是否必选	说明
source	是	被标注对象的URI。数据来源的类型及示例请参考表2。
usage	否	默认为空，取值范围： TRAIN：指明该对象用于训练。 EVAL：指明该对象用于评估。 TEST：指明该对象用于测试。 INFERENCE：指明该对象用于推理。如果没有给出该字段，则使用者自行决定如何使用该对象。
id	否	此参数为系统导出的样本id，导入时可以不用填写。
annotation	否	如果不设置，则表示未标注对象。annotation值为一个对象列表，详细参数请参见表3。
inference-loc	否	当此文件由推理服务生成时会有该字段，表示推理输出的结果文件位置。

表2 数据来源类型
类型	示例
OBS	“source”:“s3://path-to-jpg”
Content	“source”:“content://I love machine learning”

表3 annotation对象说明
字段	是否必选	说明
type	是	标签类型。取值范围为： image_classification：图像分类 text_classification：文本分类 text_entity：文本命名实体 object_detection：对象检测 audio_classification：声音分类 audio_content：声音内容 audio_segmentation：声音起止点
name	是/否	对于分类是必选字段，对于其他类型为可选字段，本示例为图片分类名称。
id	是/否	标签ID。对于三元组是必选字段，对于其他类型为可选字段。三元组的实体标签ID格式为“E+数字”，比如“E1”、“E2”，三元组的关系标签ID格式为“R+数字”，例如“R1”、“R2”。
property	否	包含对标注的属性，例如本示例中猫有两个属性，颜色（color）和品种（kind）。
hard	否	表示是否是难例。“True”表示该标注是难例，“False”表示该标注不是难例。
annotated-by	否	默认为“human”，表示人工标注。 human
creation-time	否	创建该标注的时间。是用户写入标注的时间，不是Manifest生成时间。
confidence	否	表示机器标注的置信度。范围为0～1。

图像分割

{
    "annotation": [{
        "annotation-format": "PASCAL VOC",
        "type": "modelarts/image_segmentation",
        "annotation-loc": "s3://path/to/annotation/image1.xml",
        "creation-time": "2020-12-16 21:36:27",
        "annotated-by": "human"
    }],
    "usage": "train",
    "source": "s3://path/to/image1.jpg",
    "id": "16d196c19bf61994d7deccafa435398c",
    "sample-type": 0
}

“source”、“usage”、“annotation”等参数说明与图像分类一致，详细说明请参见表1。
“annotation-loc”：对于图像分割、物体检测是必选字段，对于其他类型是可选字段，标注文件的存储路径。
“annotation-format”: 描述标注文件的格式，可选字段，默认为“PASCAL VOC”。目前只支持“PASCAL VOC”。
“sample-type”：样本格式，0表示图片，1表示文本，2表示语音，4表示表格，6表示视频

表4 PASCAL VOC格式说明
字段	是否必选	说明
folder	是	表示数据源所在目录。
filename	是	被标注文件的文件名。
size	是	表示图像的像素信息。 width：必选字段，图片的宽度。 height：必选字段，图片的高度。 depth：必选字段，图片的通道数。
segmented	是	表示是否用于分割。
mask_source	否	表示图像分割保存的mask路径
object	是	表示物体检测信息，多个物体标注会有多个object体。 name：必选字段，标注内容的类别。 pose：必选字段，标注内容的拍摄角度。 truncated：必选字段，标注内容是否被截断（0表示完整）。 occluded：必选字段，标注内容是否被遮挡（0表示未遮挡） difficult：必选字段，标注目标是否难以识别（0表示容易识别）。 confidence：可选字段，标注目标的置信度，取值范围0-1之间。 bndbox：必选字段，标注框的类型，可选值请参见表5。 mask_color：必选字段，标签的颜色，以RGB值表示

表5 标注框类型描述
type	形状	标注信息
polygon	多边形	各点坐标。 100 100 200 100 250 150 200 200 100 200 50 150 100 100

示例：


    NA
    image_0006.jpg
    
        Unknown
    
    
        230
        300
        3
    
    1
    obs://xianao/out/dataset-8153-Jmf5ylLjRmSacj9KevS/annotation/V001/segmentationClassRaw/image_0006.png

文本分类

{
    "source": "content://I like this product ",
    "id":"XGDVGS",
    "annotation": [
        {
            "type": "modelarts/text_classification",
            "name": " positive",
            "annotated-by": "human",
            "creation-time": "2019-01-23 11:30:30"        
        } ]
}

content字段是指被标注的文本（UTF-8编码，可以是中文），其他参数解释与图像分类相同，请参见表1。

文本命名实体

{
    "source":"content://Michael Jordan is the most famous basketball player in the world.",
    "usage":"TRAIN",
    "annotation":[
        {
            "type":"modelarts/text_entity",
            "name":"Person",
            "property":{
                "@modelarts:start_index":0,
                "@modelarts:end_index":14
            },
            "annotated-by":"human",
            "creation-time":"2019-01-23 11:30:30"
        },
        {
            "type":"modelarts/text_entity",
            "name":"Category",
            "property":{
                "@modelarts:start_index":34,
                "@modelarts:end_index":44
            },
            "annotated-by":"human",
            "creation-time":"2019-01-23 11:30:30"
        }
    ]
}

“source”、“usage”、“annotation”等参数说明与图像分类一致，详细说明请参见表1。

其中，property的参数解释如表6所示。例如，当“”source”:”content://Michael Jordan””时，如果要提取“Michael”，则对应的“start_index”为“0”，“end_index”为“7”。

表6 property参数说明
参数名	数据类型	说明
@modelarts:start_index	Integer	文本的起始位置，值从0开始，包括start_index所指的字符。
@modelarts:end_index	Integer	文本的结束位置，但不包括end_index所指的字符。

文本三元组

{
    "source":"content://"Three Body" is a series of long science fiction novels created by Liu Cix.",
    "usage":"TRAIN",
    "annotation":[
        {
            "type":"modelarts/text_entity",
            "name":"Person",
            "id":"E1",
            "property":{
                "@modelarts:start_index":67,
                "@modelarts:end_index":74
            },
            "annotated-by":"human",
            "creation-time":"2019-01-23 11:30:30"
        },
        {
            "type":"modelarts/text_entity",
            "name":"Book",
            "id":"E2",
            "property":{
                "@modelarts:start_index":0,
                "@modelarts:end_index":12
            },
            "annotated-by":"human",
            "creation-time":"2019-01-23 11:30:30"
        },
        {
            "type":"modelarts/text_triplet",
            "name":"Author",
            "id":"R1",
            "property":{
                "@modelarts:from":"E1",
                "@modelarts:to":"E2"
            },
            "annotated-by":"human",
            "creation-time":"2019-01-23 11:30:30"
        },
        {
            "type":"modelarts/text_triplet",
            "name":"Works",
            "id":"R2",
            "property":{
                "@modelarts:from":"E2",
                "@modelarts:to":"E1"
            },
            "annotated-by":"human",
            "creation-time":"2019-01-23 11:30:30"
        }
    ]
}

“source”、“usage”、“annotation”等参数说明与图像分类一致，详细说明请参见表1。

其中，property的参数解释如表5 property参数说明所示。其中，“@modelarts:start_index”和“@modelarts:end_index”和文本命名实体的参数说明一致。例如，当“source”：”content://”Three Body” is a series of long science fiction novels created by Liu Cix.”时，“Liu Cix”是实体Person（人物），“Three Body”是实体Book（书籍），Person指向Book的关系是Author（作者），Book指向Person的关系是Works（作品）。

表7 property参数说明
参数名	数据类型	说明
@modelarts:start_index	Integer	三元组实体的起始位置，值从0开始，包括start_index所指的字符。
@modelarts:end_index	Integer	三元组实体的结束位置，但不包括end_index所指的字符。
@modelarts:from	String	三元组关系的起始实体id
@modelarts:to	String	三元组关系的指向实体id

物体检测

{
    "source":"s3://path/to/image1.jpg",
    "usage":"TRAIN",
    "hard":"true",
    "hard-coefficient":0.8,
    "annotation": [
        {
            "type":"modelarts/object_detection",
            "annotation-loc": "s3://path/to/annotation1.xml",
            "annotation-format":"PASCAL VOC",
            "annotated-by":"human",
            "creation-time":"2019-01-23 11:30:30"                
        }]
}

“source”、“usage”、“annotation”等参数说明与图像分类一致，详细说明请参见表1。
“annotation-loc”：对于物体检测、图像分割是必选字段，对于其他类型是可选字段，标注文件的存储路径。
“annotation-format”: 描述标注文件的格式，可选字段，默认为“PASCAL VOC”。目前只支持“PASCAL VOC”。

表8 PASCAL VOC格式说明
字段	是否必选	说明
folder	是	表示数据源所在目录。
filename	是	被标注文件的文件名。
size	是	表示图像的像素信息。 width：必选字段，图片的宽度。 height：必选字段，图片的高度。 depth：必选字段，图片的通道数。
segmented	是	表示是否用于分割。
object	是	表示物体检测信息，多个物体标注会有多个object体。 name：必选字段，标注内容的类别。 pose：必选字段，标注内容的拍摄角度。 truncated：必选字段，标注内容是否被截断（0表示完整）。 occluded：必选字段，标注内容是否被遮挡（0表示未遮挡） difficult：必选字段，标注目标是否难以识别（0表示容易识别）。 confidence：可选字段，标注目标的置信度，取值范围0-1之间。 bndbox：必选字段，标注框的类型，可选值请参见表9。

表9 标注框类型描述
type	形状	标注信息
point	点	点的坐标。 100 100
line	线	各点坐标。 100 100 200 200
bndbox	矩形框	左上和右下两个点坐标。 100 100 200 200
polygon	多边形	各点坐标。 100 100 200 100 250 150 200 200 100 200 50 150
circle	圆形	圆心坐标和半径。 100 100 50

示例：

   test_data
   260730932.jpg
   
       767
       959
       3
   
   0

声音分类

{
"source":
"s3://path/to/pets.wav", 
    "annotation": [
        {
            "type": "modelarts/audio_classification",
            "name":"cat",    
            "annotated-by":"human",
            "creation-time":"2019-01-23 11:30:30"
        } 
    ]
}

“source”、“usage”、“annotation”等参数说明与图像分类一致，详细说明请参见表1。

语音内容

{
    "source":"s3://path/to/audio1.wav",
    "annotation":[
        {
            "type":"modelarts/audio_content",
            "property":{
                "@modelarts:content":"Today is a good day."
            },
            "annotated-by":"human",
            "creation-time":"2019-01-23 11:30:30"
        }
    ]
}

“source”、“usage”、“annotation”等参数说明与图像分类一致，详细说明请参见表1。
“property”中的“@modelarts:content”参数，数据类型为“String”，表示语音内容。

语音分割

{
    "source":"s3://path/to/audio1.wav",
    "usage":"TRAIN",
    "annotation":[
        {
           
"type":"modelarts/audio_segmentation",
            "property":{
                "@modelarts:start_time":"00:01:10.123",
                "@modelarts:end_time":"00:01:15.456",
               
                "@modelarts:source":"Tom",
               
                "@modelarts:content":"How are you?"
            },
           "annotated-by":"human",
           "creation-time":"2019-01-23 11:30:30"
        },
        {
           "type":"modelarts/audio_segmentation",
            "property":{
                "@modelarts:start_time":"00:01:22.754",
                "@modelarts:end_time":"00:01:24.145",
                "@modelarts:source":"Jerry",
                "@modelarts:content":"I'm fine, thank you."
            },
           "annotated-by":"human",
           "creation-time":"2019-01-23 11:30:30"
        }
    ]
}

“source”、“usage”、“annotation”等参数说明与图像分类一致，详细说明请参见表1。
“property”的参数解释如表10所示。

**表10** “property”参数说明
参数名	数据类型	描述
@modelarts:start_time	String	声音的起始时间，格式为“hh:mm:ss.SSS”。其中“hh”表示小时，“mm”表示分钟，“ss”表示秒，“SSS”表示毫秒。
@modelarts:end_time	String	声音的结束时间，格式为“hh:mm:ss.SSS”。其中“hh”表示小时，“mm”表示分钟，“ss”表示秒，“SSS”表示毫秒。
@modelarts:source	String	声音来源。
@modelarts:content	String	声音内容。

视频标注

{
	"annotation": [{
		"annotation-format": "PASCAL VOC",
		"type": "modelarts/object_detection",
		"annotation-loc": "s3://path/to/annotation1_t1.473722.xml",
		"creation-time": "2020-10-09 14:08:24",
		"annotated-by": "human"
	}],
	"usage": "train",
	"property": {
		"@modelarts:parent_duration": 8,
		"@modelarts:parent_source": "s3://path/to/annotation1.mp4",
		"@modelarts:time_in_video": 1.473722
	},
	"source": "s3://input/path/to/annotation1_t1.473722.jpg",
	"id": "43d88677c1e9a971eeb692a80534b5d5",
	"sample-type": 0
}

“source”、“usage”、“annotation”等参数说明与图像分类一致，详细说明请参见表1。
“annotation-loc”：对于物体检测、是必选字段，对于其他类型是可选字段，标注文件的存储路径。
“annotation-format”: 描述标注文件的格式，可选字段，默认为“PASCAL VOC”。目前只支持“PASCAL VOC”。
“sample-type”：样本格式，0表示图片，1表示文本，2表示语音，4表示表格，6表示视频。

**表11** property参数说明
参数名	数据类型	说明
@modelarts:parent_duration	Double	标注视频的时长，单位：秒。
@modelarts:time_in_video	Double	标注的视频帧的时间戳，单位：秒。
@modelarts:parent_source	String	标注视频的OBS路径。

**表12** PASCAL VOC格式说明
字段	是否必选	说明
folder	是	表示数据源所在目录。
filename	是	被标注文件的文件名。
size	是	表示图像的像素信息。 width：必选字段，图片的宽度。 height：必选字段，图片的高度。 depth：必选字段，图片的通道数。
segmented	是	表示是否用于分割。
object	是	表示物体检测信息，多个物体标注会有多个object体。 name：必选字段，标注内容的类别。 pose：必选字段，标注内容的拍摄角度。 truncated：必选字段，标注内容是否被截断（0表示完整）。 occluded：必选字段，标注内容是否被遮挡（0表示未遮挡） difficult：必选字段，标注目标是否难以识别（0表示容易识别）。 confidence：可选字段，标注目标的置信度，取值范围0-1之间。 bndbox：必选字段，标注框的类型，可选值请参见表13。

**表13** 标注框类型描述
type	形状	标注信息
point	点	点的坐标。 100 100
line	线	各点坐标。 100 100 200 200
bndbox	矩形框	左上和右下两个点坐标。 100 100 200 200
polygon	多边形	各点坐标。 100 100 200 100 250 150 200 200 100 200 50 150
circle	圆形	圆心坐标和半径。 100 100 50

示例：

   test_data
   260730932_t1.473722.jpg.jpg
   
       767
       959
       3
   
   0

父主题： 导入数据

同意关联代理商云淘科技，购买华为云产品更优惠（QQ 78315851）

内容没看懂？不太想学习？想快速解决？有偿解决：联系专家