chroma新增、删除、知识库应用
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

174 lines
9.8 KiB

  1. ---
  2. pipeline_tag: sentence-similarity
  3. license: apache-2.0
  4. tags:
  5. - text2vec
  6. - feature-extraction
  7. - sentence-similarity
  8. - transformers
  9. datasets:
  10. - shibing624/nli_zh
  11. language:
  12. - zh
  13. metrics:
  14. - spearmanr
  15. library_name: transformers
  16. ---
  17. # shibing624/text2vec-base-chinese
  18. This is a CoSENT(Cosine Sentence) model: shibing624/text2vec-base-chinese.
  19. It maps sentences to a 768 dimensional dense vector space and can be used for tasks
  20. like sentence embeddings, text matching or semantic search.
  21. ## Evaluation
  22. For an automated evaluation of this model, see the *Evaluation Benchmark*: [text2vec](https://github.com/shibing624/text2vec)
  23. - chinese text matching task:
  24. | Arch | BaseModel | Model | ATEC | BQ | LCQMC | PAWSX | STS-B | SOHU-dd | SOHU-dc | Avg | QPS |
  25. |:-----------|:----------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------|:-----:|:-----:|:-----:|:-----:|:-----:|:-------:|:-------:|:---------:|:-----:|
  26. | Word2Vec | word2vec | [w2v-light-tencent-chinese](https://ai.tencent.com/ailab/nlp/en/download.html) | 20.00 | 31.49 | 59.46 | 2.57 | 55.78 | 55.04 | 20.70 | 35.03 | 23769 |
  27. | SBERT | xlm-roberta-base | [sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) | 18.42 | 38.52 | 63.96 | 10.14 | 78.90 | 63.01 | 52.28 | 46.46 | 3138 |
  28. | Instructor | hfl/chinese-roberta-wwm-ext | [moka-ai/m3e-base](https://huggingface.co/moka-ai/m3e-base) | 41.27 | 63.81 | 74.87 | 12.20 | 76.96 | 75.83 | 60.55 | 57.93 | 2980 |
  29. | CoSENT | hfl/chinese-macbert-base | [shibing624/text2vec-base-chinese](https://huggingface.co/shibing624/text2vec-base-chinese) | 31.93 | 42.67 | 70.16 | 17.21 | 79.30 | 70.27 | 50.42 | 51.61 | 3008 |
  30. | CoSENT | hfl/chinese-lert-large | [GanymedeNil/text2vec-large-chinese](https://huggingface.co/GanymedeNil/text2vec-large-chinese) | 32.61 | 44.59 | 69.30 | 14.51 | 79.44 | 73.01 | 59.04 | 53.12 | 2092 |
  31. | CoSENT | nghuyong/ernie-3.0-base-zh | [shibing624/text2vec-base-chinese-sentence](https://huggingface.co/shibing624/text2vec-base-chinese-sentence) | 43.37 | 61.43 | 73.48 | 38.90 | 78.25 | 70.60 | 53.08 | 59.87 | 3089 |
  32. | CoSENT | nghuyong/ernie-3.0-base-zh | [shibing624/text2vec-base-chinese-paraphrase](https://huggingface.co/shibing624/text2vec-base-chinese-paraphrase) | 44.89 | 63.58 | 74.24 | 40.90 | 78.93 | 76.70 | 63.30 | 63.08 | 3066 |
  33. | CoSENT | sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 | [shibing624/text2vec-base-multilingual](https://huggingface.co/shibing624/text2vec-base-multilingual) | 32.39 | 50.33 | 65.64 | 32.56 | 74.45 | 68.88 | 51.17 | 53.67 | 4004 |
  34. 说明:
  35. - 结果评测指标:spearman系数
  36. - `shibing624/text2vec-base-chinese`模型,是用CoSENT方法训练,基于`hfl/chinese-macbert-base`在中文STS-B数据训练得到,并在中文STS-B测试集评估达到较好效果,运行[examples/training_sup_text_matching_model.py](https://github.com/shibing624/text2vec/blob/master/examples/training_sup_text_matching_model.py)代码可训练模型,模型文件已经上传HF model hub,中文通用语义匹配任务推荐使用
  37. - `shibing624/text2vec-base-chinese-sentence`模型,是用CoSENT方法训练,基于`nghuyong/ernie-3.0-base-zh`用人工挑选后的中文STS数据集[shibing624/nli-zh-all/text2vec-base-chinese-sentence-dataset](https://huggingface.co/datasets/shibing624/nli-zh-all/tree/main/text2vec-base-chinese-sentence-dataset)训练得到,并在中文各NLI测试集评估达到较好效果,运行[examples/training_sup_text_matching_model_jsonl_data.py](https://github.com/shibing624/text2vec/blob/master/examples/training_sup_text_matching_model_jsonl_data.py)代码可训练模型,模型文件已经上传HF model hub,中文s2s(句子vs句子)语义匹配任务推荐使用
  38. - `shibing624/text2vec-base-chinese-paraphrase`模型,是用CoSENT方法训练,基于`nghuyong/ernie-3.0-base-zh`用人工挑选后的中文STS数据集[shibing624/nli-zh-all/text2vec-base-chinese-paraphrase-dataset](https://huggingface.co/datasets/shibing624/nli-zh-all/tree/main/text2vec-base-chinese-paraphrase-dataset),数据集相对于[shibing624/nli-zh-all/text2vec-base-chinese-sentence-dataset](https://huggingface.co/datasets/shibing624/nli-zh-all/tree/main/text2vec-base-chinese-sentence-dataset)加入了s2p(sentence to paraphrase)数据,强化了其长文本的表征能力,并在中文各NLI测试集评估达到SOTA,运行[examples/training_sup_text_matching_model_jsonl_data.py](https://github.com/shibing624/text2vec/blob/master/examples/training_sup_text_matching_model_jsonl_data.py)代码可训练模型,模型文件已经上传HF model hub,中文s2p(句子vs段落)语义匹配任务推荐使用
  39. - `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`模型是用SBERT训练,是`paraphrase-MiniLM-L12-v2`模型的多语言版本,支持中文、英文等
  40. - `w2v-light-tencent-chinese`是腾讯词向量的Word2Vec模型,CPU加载使用,适用于中文字面匹配任务和缺少数据的冷启动情况
  41. ## Usage (text2vec)
  42. Using this model becomes easy when you have [text2vec](https://github.com/shibing624/text2vec) installed:
  43. ```
  44. pip install -U text2vec
  45. ```
  46. Then you can use the model like this:
  47. ```python
  48. from text2vec import SentenceModel
  49. sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
  50. model = SentenceModel('shibing624/text2vec-base-chinese')
  51. embeddings = model.encode(sentences)
  52. print(embeddings)
  53. ```
  54. ## Usage (HuggingFace Transformers)
  55. Without [text2vec](https://github.com/shibing624/text2vec), you can use the model like this:
  56. First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
  57. Install transformers:
  58. ```
  59. pip install transformers
  60. ```
  61. Then load model and predict:
  62. ```python
  63. from transformers import BertTokenizer, BertModel
  64. import torch
  65. # Mean Pooling - Take attention mask into account for correct averaging
  66. def mean_pooling(model_output, attention_mask):
  67. token_embeddings = model_output[0] # First element of model_output contains all token embeddings
  68. input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
  69. return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
  70. # Load model from HuggingFace Hub
  71. tokenizer = BertTokenizer.from_pretrained('shibing624/text2vec-base-chinese')
  72. model = BertModel.from_pretrained('shibing624/text2vec-base-chinese')
  73. sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
  74. # Tokenize sentences
  75. encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
  76. # Compute token embeddings
  77. with torch.no_grad():
  78. model_output = model(**encoded_input)
  79. # Perform pooling. In this case, mean pooling.
  80. sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
  81. print("Sentence embeddings:")
  82. print(sentence_embeddings)
  83. ```
  84. ## Usage (sentence-transformers)
  85. [sentence-transformers](https://github.com/UKPLab/sentence-transformers) is a popular library to compute dense vector representations for sentences.
  86. Install sentence-transformers:
  87. ```
  88. pip install -U sentence-transformers
  89. ```
  90. Then load model and predict:
  91. ```python
  92. from sentence_transformers import SentenceTransformer
  93. m = SentenceTransformer("shibing624/text2vec-base-chinese")
  94. sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
  95. sentence_embeddings = m.encode(sentences)
  96. print("Sentence embeddings:")
  97. print(sentence_embeddings)
  98. ```
  99. ## Full Model Architecture
  100. ```
  101. CoSENT(
  102. (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel
  103. (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_mean_tokens': True})
  104. )
  105. ```
  106. ## Intended uses
  107. Our model is intented to be used as a sentence and short paragraph encoder. Given an input text, it ouptuts a vector which captures
  108. the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks.
  109. By default, input text longer than 256 word pieces is truncated.
  110. ## Training procedure
  111. ### Pre-training
  112. We use the pretrained [`hfl/chinese-macbert-base`](https://huggingface.co/hfl/chinese-macbert-base) model.
  113. Please refer to the model card for more detailed information about the pre-training procedure.
  114. ### Fine-tuning
  115. We fine-tune the model using a contrastive objective. Formally, we compute the cosine similarity from each
  116. possible sentence pairs from the batch.
  117. We then apply the rank loss by comparing with true pairs and false pairs.
  118. #### Hyper parameters
  119. - training dataset: https://huggingface.co/datasets/shibing624/nli_zh
  120. - max_seq_length: 128
  121. - best epoch: 5
  122. - sentence embedding dim: 768
  123. ## Citing & Authors
  124. This model was trained by [text2vec](https://github.com/shibing624/text2vec).
  125. If you find this model helpful, feel free to cite:
  126. ```bibtex
  127. @software{text2vec,
  128. author = {Xu Ming},
  129. title = {text2vec: A Tool for Text to Vector},
  130. year = {2022},
  131. url = {https://github.com/shibing624/text2vec},
  132. }
  133. ```