from transformers import BertForQuestionAnswering, BertTokenizer
import numpy as np
import torch

model = BertForQuestionAnswering.from_pretrained("google-bert/bert-large-uncased-whole-word-masking-finetuned-squad")

config.json: 100%|██████████| 443/443 [00:00<00:00, 42.4kB/s]
model.safetensors: 100%|██████████| 1.34G/1.34G [03:05<00:00, 7.24MB/s]
Some weights of the model checkpoint at google-bert/bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

tokenizer = BertTokenizer.from_pretrained("google-bert/bert-large-uncased-whole-word-masking-finetuned-squad")

tokenizer_config.json: 100%|██████████| 48.0/48.0 [00:00<00:00, 14.2kB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 3.59MB/s]
tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 4.54MB/s]

Q = "[CLS] What is the immune system? [SEP]"
P = "The immune system is a system of many biological structures and processes within an organism that protects against disease. To function properly, an immune system must detect a wide variety of agents, known as pathogens, from viruses to parasitic worms, and distinguish them from the organism's own healthy tissue. [SEP]"

q_tokens = tokenizer.tokenize(Q)
p_tokens = tokenizer.tokenize(P)

tokens = q_tokens + p_tokens
input_ids = tokenizer.convert_tokens_to_ids(tokens)

segment_ids = [0] * len(q_tokens) + [1] * len(p_tokens)

67

input_ids = torch.tensor([input_ids])
segment_ids = torch.tensor([segment_ids])
print(f"input: {input_ids.shape}, segment: {segment_ids.shape}")

input: torch.Size([1, 67]), segment: torch.Size([1, 67])

output = model(input_ids, token_type_ids=segment_ids)

s_index = torch.argmax(output.start_logits)
e_index = torch.argmax(output.end_logits)

print(' '.join(tokens[s_index: e_index+1]))

a system of many biological structures and processes within an organism that protects against disease