通过学习和理解人类的语言来进行对话, ChatGPT 真正像人类一样来聊天交流,
[
打开
](https://www.googleadservices.com/pagead/aclk?sa=L&ai=CRoblRb9YZOvoAYaG8AKN26KABojukrtwvKHNrPYQuuDHtrkQEAEg6ofpF2CdAaABguXKuynIAQGpArRgc2cJGYE-qAMBqgSGAk_QuKRKPe9_-wrxqCPbGAooIb9E0m3VFbD7JXeBYeDchGMsch0i0oJLRBVfXQtcV6CezkRuDbCu7DT5nLMV0OEZ82THYeVUFkki147aMAQ7HBno9prJXQkhLP6bPCrT3CgtbC7cipsS39l0ELvbXR9ETR4VPy4NPOyeJ5vFUXJv-YsqdNbGadzabRqyVQn7W9wsRkaWPhsxFFpuHQy5dFXMQn5ovGdHrpww_yG3Ryt0T3XM0UI7gKhtqLVL8Ki6K-IfeJypZfVLhfsH-haeZvHeJrqbg0mSqbFQmvV-NiAVJGvHdyqLNCNd99IdGzNcoWafYj2UkWO586bVN_MXBUBVh8uKZ2DABM-d2v2kBIAHgp2bmwSoB47OG6gHk9gbqAfulrECqAf-nrECqAeko7ECqAfVyRuoB6a-G6gHmgaoB_PRG6gHltgbqAeqm7ECqAeDrbECqAf_nrECqAffn7EC2AcB0ggUCIBhEAEYHzICigI6AoBASL39wTqxCUtgHccWAAUdgAoBmAsByAsBuAwB2BMM0BUBmBYB-BYBgBcB&ae=1&num=1&cid=CAQSPABygQiDYSee05XHCRCBBvy0EeF6EhdLOv1jOBb-5OxtwhVyjECH-cSI-N9hkGsCiyjTHzJntGNYfCSUzBgB&sig=AOD64_0BkZ474BD6NNOGg0l5hrBf37rLIg&client=ca-pub-4605373693034661&rf=1&nb=8&adurl=https://chatgptmirror.com%3Fshare%3Dgooglepromotion%26gclid%3DEAIaIQobChMIq5-s5LLl_gIVBgNcCh2NrQhgEAEYASAAEgJVa_D_BwE "Chat Ai中文")
正如@theta指出的那样,“根据轮廓分割pdf ”具有提取页码所需的代码。如果您觉得这很复杂,我复制了一部分代码,该代码将页面ID映射到页面编号并使其成为函数。这是一个打印书签o [0]的页码的工作示例:
from [PyPDF2](https://www.jb51.cc/tag/PyPDF2/) import PdfFileReaderdef _setup_page_id_to_num(pdf, pages=None, _result=None, _num_pages=None):if _result is None:_result = {}if pages is None:_num_pages = []pages = pdf.trailer["/Root"].g[eto](https://www.jb51.cc/tag/eto/)bject()["/Pages"].g[eto](https://www.jb51.cc/tag/eto/)bject()t = pages["/Type"]if t == "/Pages":for page in pages["/Kids"]:_result[page.idnum] = len(_num_pages)_setup_page_id_to_num(pdf, page.g[eto](https://www.jb51.cc/tag/eto/)bject(), _result, _num_pages)elif t == "/Page":_num_pages.append(1)return _result# mainf = open('document.pdf','rb')p = PdfFileReader(f)# map page ids to page numberspg_id_num_map = _setup_page_id_to_num(p)o = p.g[eto](https://www.jb51.cc/tag/eto/)utli[nes](https://www.jb51.cc/tag/nes/)()pg_num = pg_id_num_map[o[0].page.idnum] + 1print(pg_num)@theta可能为时已晚,但可能会对其他人有所帮助:) btw我关于stackoverflow的第一篇文章,所以请问如果我不遵循通常的格式
如果您希望在页面上获得书签的确切位置,这将使您的工作更加轻松:
from [PyPDF2](https://www.jb51.cc/tag/PyPDF2/) import PdfFileReaderimport [PyPDF2](https://www.jb51.cc/tag/PyPDF2/) as pyPdfdef _setup_page_id_to_num(pdf, pages=None, _result=None, _num_pages=None):if _result is None:_result = {}if pages is None:_num_pages = []pages = pdf.trailer["/Root"].g[eto](https://www.jb51.cc/tag/eto/)bject()["/Pages"].g[eto](https://www.jb51.cc/tag/eto/)bject()t = pages["/Type"]if t == "/Pages":for page in pages["/Kids"]:_result[page.idnum] = len(_num_pages)_setup_page_id_to_num(pdf, page.g[eto](https://www.jb51.cc/tag/eto/)bject(), _result, _num_pages)elif t == "/Page":_num_pages.append(1)return _resultdef outli[nes](https://www.jb51.cc/tag/nes/)_pg_zoom_info(outli[nes](https://www.jb51.cc/tag/nes/), pg_id_num_map, result=None):if result is None:result = dict()if type(outli[nes](https://www.jb51.cc/tag/nes/)) == list:for outline in outli[nes](https://www.jb51.cc/tag/nes/):result = outli[nes](https://www.jb51.cc/tag/nes/)_pg_zoom_info(outline, pg_id_num_map, result)elif type(outli[nes](https://www.jb51.cc/tag/nes/)) == pyPdf.pdf.Destination:title = outli[nes](https://www.jb51.cc/tag/nes/)['/Title']result[title.split()[0]] = dict(title=outli[nes](https://www.jb51.cc/tag/nes/)['/Title'], top=outli[nes](https://www.jb51.cc/tag/nes/)['/Top'], \left=outli[nes](https://www.jb51.cc/tag/nes/)['/Left'], page=(pg_id_num_map[outli[nes](https://www.jb51.cc/tag/nes/).page.idnum]+1))return result# mainpdf_name = 'document.pdf'f = open(pdf_name,'rb')pdf = PdfFileReader(f)# map page ids to page numberspg_id_num_map = _setup_page_id_to_num(pdf)outli[nes](https://www.jb51.cc/tag/nes/) = pdf.g[eto](https://www.jb51.cc/tag/eto/)utli[nes](https://www.jb51.cc/tag/nes/)()bookmarks_info = outli[nes](https://www.jb51.cc/tag/nes/)_pg_zoom_info(outli[nes](https://www.jb51.cc/tag/nes/), pg_id_num_map)print(bookmarks_info)elif type(outli[nes](https://www.jb51.cc/tag/nes/)) == pyPdf.pdf.Destination:title = outli[nes](https://www.jb51.cc/tag/nes/)['/Title']result[title.split()[0]] = dict(title=outli[nes](https://www.jb51.cc/tag/nes/)['/Title'], top=outli[nes](https://www.jb51.cc/tag/nes/)['/Top'], \left=outli[nes](https://www.jb51.cc/tag/nes/)['/Left'], page=(pg_id_num_map[outli[nes](https://www.jb51.cc/tag/nes/).page.idnum]+1))from typing import List
from PyPDF2 import PdfFileReader
from PyPDF2.generic import Destination
def get_outlines(pdf_filepath: str) -> List[Destination]:"""Get the bookmarks of a PDF file."""with open(pdf_filepath,"rb") as fp:pdf_file_reader = PdfFileReader(fp)outlines = pdf_file_reader.getOutlines()return outlinesprint(get_outlines("PDF-export-example.pdf"))pyPdf.pdf.Destination具有许多属性,但是找不到该书签的任何引用页码。如何获得书签的页码?
例如,outlines[1].page.idnum返回一个数字,该数字大约是PDF文档中引用的页码的3倍,我认为引用的对象比页面小,因为.page.idnum在整个PDF文档轮廓上运行时返回的数字数组甚至与“实数”都不线性相关PDF文档中的页码目标,大约是3的倍数
原网址: 访问
创建于: 2023-05-08 17:22:43
目录: default
标签: 无
未标明原创文章均为采集,版权归作者所有,转载无需和我联系,请注明原出处,南摩阿彌陀佛,知识,不只知道,要得到
最新评论