用pdfbox实现分段读取pdf文件

导入依赖

  <dependency>
            <groupId>org.apache.pdfbox</groupId>
            <artifactId>pdfbox</artifactId>
            <version>2.0.4</version>
        </dependency>

读取内容示例

代码示例

                    List<CoordinatesEntity> response = new ArrayList<>();
                    // 处理PDF文档
                    PDDocument doc = PDDocument.load(inputStream);
                    int pageNumber = doc.getNumberOfPages();
                    PDFTextStripper stripper = new PDFTextStripper() {
                        @Override
                        protected void writeString(String text1, List<TextPosition> textPositions) {
                            for (TextPosition text : textPositions) {
                                CoordinatesEntity one = new CoordinatesEntity(text.getUnicode(), text.getXDirAdj(),                            text.getYDirAdj());
                                response.add(one);
                            }
                        }
                    };
                    stripper.setSortByPosition(true);
                    stripper.setStartPage(1);
                    stripper.setEndPage(pageNumber);
                    stripper.getText(doc);
                    Float x = response.get(0).getX();
                    Float y = response.get(0).getY();
                    String line = response.get(0).getLine();
                    float contentFirstX = -1;
                    float contentFirstY = -1;
                    float contentX;
                    float contentY;
                    float lastY = -1;
                    StringBuilder title = new StringBuilder();
                    title.append(line);
                    StringBuilder content = new StringBuilder();
                    for (CoordinatesEntity entity : response) {
                        if (entity.getX() > x && Objects.equals(entity.getY(), y)) {
                            title.append(entity.getLine());
                        } else if (!Objects.equals(entity.getY(), y) && !Objects.equals(entity.getX(), x)) {
                            if (contentFirstX == -1 && contentFirstY == -1) {
                                // 起始段落x
                                contentFirstX = entity.getX();
                                // 起始段落y
                                contentFirstY = entity.getY();
                            }
                            // 内容体x
                            contentX = entity.getX();
                            // 内容体y
                            contentY = entity.getY();
                            if (contentX == contentFirstX && contentY != lastY) {
                                content.append("/n").append(entity.getLine());
                            } else {
                                content.append(entity.getLine());
                            }
                            lastY = contentY;
                        }

                    }
                    String[] split = content.toString().split("/n");
                    List<String> list = Arrays.asList(split);
                    list = list.stream().filter(StringUtils::isNotBlank).collect(Collectors.toList());
                    result.put("title", title);
                    result.put("content", list);
                    doc.close();

思路:

先获取每个字的 line 字段值 x x坐标 y y坐标封装成对象放到集合中
开启排序设置开始结束页
第一行内容体就是文本的标题, 读取到的第一个字记录它的x y 为标题的开始坐标 titleX 和 titleY 遍历集合比对 x y 如果 x>titleX && y == titleY 拼接line 就是文本的title 当y发生变化时就时换行
记录读取的换行的第一个字的坐标 x y 为 contentFirstX contentFirstY 并设置 lastY 记录上一个字的y坐标如果 contentX == contentFirstX && contentY != lastY 段落开始的x坐标相等并且y坐标不等于上一个字的y坐标即换行的第一个字
最后封装title 和 content 返回

结果：

菜单

用pdf分段读取pdf文件

用pdfbox实现分段读取pdf文件

评论

代码提交相关规范

前端预加载图片

lxml库之etree使用小结

BeautifulSoup使用小结

JavaScript无重复字符的最长子串(利用数组解法)

Matplotlib图表

Shadcn介绍

Clickhouse 的查询优化详解

linux安装oracle

robotframework脚本常用关键字总结