通过学习和理解人类的语言来进行对话, ChatGPT 真正像人类一样来聊天交流,
[
打开
](https://www.googleadservices.com/pagead/aclk?sa=L&ai=CRoblRb9YZOvoAYaG8AKN26KABojukrtwvKHNrPYQuuDHtrkQEAEg6ofpF2CdAaABguXKuynIAQGpArRgc2cJGYE-qAMBqgSGAk_QuKRKPe9_-wrxqCPbGAooIb9E0m3VFbD7JXeBYeDchGMsch0i0oJLRBVfXQtcV6CezkRuDbCu7DT5nLMV0OEZ82THYeVUFkki147aMAQ7HBno9prJXQkhLP6bPCrT3CgtbC7cipsS39l0ELvbXR9ETR4VPy4NPOyeJ5vFUXJv-YsqdNbGadzabRqyVQn7W9wsRkaWPhsxFFpuHQy5dFXMQn5ovGdHrpww_yG3Ryt0T3XM0UI7gKhtqLVL8Ki6K-IfeJypZfVLhfsH-haeZvHeJrqbg0mSqbFQmvV-NiAVJGvHdyqLNCNd99IdGzNcoWafYj2UkWO586bVN_MXBUBVh8uKZ2DABM-d2v2kBIAHgp2bmwSoB47OG6gHk9gbqAfulrECqAf-nrECqAeko7ECqAfVyRuoB6a-G6gHmgaoB_PRG6gHltgbqAeqm7ECqAeDrbECqAf_nrECqAffn7EC2AcB0ggUCIBhEAEYHzICigI6AoBASL39wTqxCUtgHccWAAUdgAoBmAsByAsBuAwB2BMM0BUBmBYB-BYBgBcB&ae=1&num=1&cid=CAQSPABygQiDYSee05XHCRCBBvy0EeF6EhdLOv1jOBb-5OxtwhVyjECH-cSI-N9hkGsCiyjTHzJntGNYfCSUzBgB&sig=AOD64_0BkZ474BD6NNOGg0l5hrBf37rLIg&client=ca-pub-4605373693034661&rf=1&nb=8&adurl=https://chatgptmirror.com%3Fshare%3Dgooglepromotion%26gclid%3DEAIaIQobChMIq5-s5LLl_gIVBgNcCh2NrQhgEAEYASAAEgJVa_D_BwE "Chat Ai中文")
正如@theta指出的那样,“根据轮廓分割pdf ”具有提取页码所需的代码。如果您觉得这很复杂,我复制了一部分代码,该代码将页面ID映射到页面编号并使其成为函数。这是一个打印书签o [0]的页码的工作示例:
from [PyPDF2](https://www.jb51.cc/tag/PyPDF2/) import PdfFileReader
def _setup_page_id_to_num(pdf, pages=None, _result=None, _num_pages=None):
if _result is None:
_result = {}
if pages is None:
_num_pages = []
pages = pdf.trailer["/Root"].g[eto](https://www.jb51.cc/tag/eto/)bject()["/Pages"].g[eto](https://www.jb51.cc/tag/eto/)bject()
t = pages["/Type"]
if t == "/Pages":
for page in pages["/Kids"]:
_result[page.idnum] = len(_num_pages)
_setup_page_id_to_num(pdf, page.g[eto](https://www.jb51.cc/tag/eto/)bject(), _result, _num_pages)
elif t == "/Page":
_num_pages.append(1)
return _result
# main
f = open('document.pdf','rb')
p = PdfFileReader(f)
# map page ids to page numbers
pg_id_num_map = _setup_page_id_to_num(p)
o = p.g[eto](https://www.jb51.cc/tag/eto/)utli[nes](https://www.jb51.cc/tag/nes/)()
pg_num = pg_id_num_map[o[0].page.idnum] + 1
print(pg_num)
@theta可能为时已晚,但可能会对其他人有所帮助:) btw我关于stackoverflow的第一篇文章,所以请问如果我不遵循通常的格式
如果您希望在页面上获得书签的确切位置,这将使您的工作更加轻松:
from [PyPDF2](https://www.jb51.cc/tag/PyPDF2/) import PdfFileReader
import [PyPDF2](https://www.jb51.cc/tag/PyPDF2/) as pyPdf
def _setup_page_id_to_num(pdf, pages=None, _result=None, _num_pages=None):
if _result is None:
_result = {}
if pages is None:
_num_pages = []
pages = pdf.trailer["/Root"].g[eto](https://www.jb51.cc/tag/eto/)bject()["/Pages"].g[eto](https://www.jb51.cc/tag/eto/)bject()
t = pages["/Type"]
if t == "/Pages":
for page in pages["/Kids"]:
_result[page.idnum] = len(_num_pages)
_setup_page_id_to_num(pdf, page.g[eto](https://www.jb51.cc/tag/eto/)bject(), _result, _num_pages)
elif t == "/Page":
_num_pages.append(1)
return _result
def outli[nes](https://www.jb51.cc/tag/nes/)_pg_zoom_info(outli[nes](https://www.jb51.cc/tag/nes/), pg_id_num_map, result=None):
if result is None:
result = dict()
if type(outli[nes](https://www.jb51.cc/tag/nes/)) == list:
for outline in outli[nes](https://www.jb51.cc/tag/nes/):
result = outli[nes](https://www.jb51.cc/tag/nes/)_pg_zoom_info(outline, pg_id_num_map, result)
elif type(outli[nes](https://www.jb51.cc/tag/nes/)) == pyPdf.pdf.Destination:
title = outli[nes](https://www.jb51.cc/tag/nes/)['/Title']
result[title.split()[0]] = dict(title=outli[nes](https://www.jb51.cc/tag/nes/)['/Title'], top=outli[nes](https://www.jb51.cc/tag/nes/)['/Top'], \
left=outli[nes](https://www.jb51.cc/tag/nes/)['/Left'], page=(pg_id_num_map[outli[nes](https://www.jb51.cc/tag/nes/).page.idnum]+1))
return result
# main
pdf_name = 'document.pdf'
f = open(pdf_name,'rb')
pdf = PdfFileReader(f)
# map page ids to page numbers
pg_id_num_map = _setup_page_id_to_num(pdf)
outli[nes](https://www.jb51.cc/tag/nes/) = pdf.g[eto](https://www.jb51.cc/tag/eto/)utli[nes](https://www.jb51.cc/tag/nes/)()
bookmarks_info = outli[nes](https://www.jb51.cc/tag/nes/)_pg_zoom_info(outli[nes](https://www.jb51.cc/tag/nes/), pg_id_num_map)
print(bookmarks_info)
elif type(outli[nes](https://www.jb51.cc/tag/nes/)) == pyPdf.pdf.Destination:
title = outli[nes](https://www.jb51.cc/tag/nes/)['/Title']
result[title.split()[0]] = dict(title=outli[nes](https://www.jb51.cc/tag/nes/)['/Title'], top=outli[nes](https://www.jb51.cc/tag/nes/)['/Top'], \
left=outli[nes](https://www.jb51.cc/tag/nes/)['/Left'], page=(pg_id_num_map[outli[nes](https://www.jb51.cc/tag/nes/).page.idnum]+1))
from typing import List
from PyPDF2 import PdfFileReader
from PyPDF2.generic import Destination
def get_outlines(pdf_filepath: str) -> List[Destination]:
"""Get the bookmarks of a PDF file."""
with open(pdf_filepath,"rb") as fp:
pdf_file_reader = PdfFileReader(fp)
outlines = pdf_file_reader.getOutlines()
return outlines
print(get_outlines("PDF-export-example.pdf"))
pyPdf.pdf.Destination
具有许多属性,但是找不到该书签的任何引用页码。如何获得书签的页码?
例如,outlines[1].page.idnum
返回一个数字,该数字大约是PDF文档中引用的页码的3倍,我认为引用的对象比页面小,因为.page.idnum
在整个PDF文档轮廓上运行时返回的数字数组甚至与“实数”都不线性相关PDF文档中的页码目标,大约是3的倍数
原网址: 访问
创建于: 2023-05-08 17:22:43
目录: default
标签: 无
未标明原创文章均为采集,版权归作者所有,转载无需和我联系,请注明原出处,南摩阿彌陀佛,知识,不只知道,要得到
最新评论