Athanasios Oikonomou
d735b036fe
fix: handle unicode filenames in external document loader
...
Files with special characters in their names (e.g., ü.pdf) caused issues since HTTP headers only allow Latin-1 characters.
This change URL-encodes `X-Filename` before adding it to request headers, preventing failures when uploading or processing such files.
Fixes : #17000
2025-08-28 22:19:50 +03:00
Timothy Jaeryang Baek
2bb6063dcb
refac/fix: marker
2025-08-28 03:03:31 +04:00
Selene Blok
5051bfe7ab
feat(document retrieval): Authenticate Azure Document Intelligence using AzureDefaultCredential if API key is not provided
2025-08-22 16:15:43 +02:00
Timothy Jaeryang Baek
e8696c63fe
refac
2025-08-04 15:23:43 +04:00
Tim Jaeryang Baek
5db60ca34f
Merge pull request #15903 from Hisma/marker-api-update
...
feat: Add configurable API URL (for self-hosting) and additional_config parameter for Datalab Marker API
2025-08-04 15:21:03 +04:00
Hisma
21337a2fd8
ci fix
2025-07-22 22:07:14 -04:00
Hisma
a99e20cc3d
add format_lines
2025-07-22 21:06:29 -04:00
Hisma
f31cc07a9d
feat: update marker api
2025-07-22 20:49:28 -04:00
bekzod
4bc054a347
Update docling endpoint
2025-07-16 20:40:13 +05:00
expruc
453a2bd9b5
fixed issue where text/html files being detected as text when loaded
2025-07-06 20:10:26 +03:00
Tim Jaeryang Baek
600344f2e8
Merge pull request #15510 from kopero2000/bug/oauth_logout_fix
...
fix/oauth logout fix
2025-07-04 10:30:02 +04:00
Bela Vizi
9623ef4360
add trust env to clientsession
2025-07-02 17:59:56 +02:00
Timothy Jaeryang Baek
81b8267e85
feat: odt file parse support
2025-06-19 18:39:00 +04:00
Timothy Jaeryang Baek
7753f57d42
chore: format
2025-06-16 13:48:50 +04:00
Tim Jaeryang Baek
c5b48ec551
Merge pull request #14992 from sreesdas/dev
...
Fix: Added support for multiple pages in external document loader
2025-06-16 11:01:33 +04:00
sree
62bfe73964
Fix: Added support for multiple pages in external document loader, added filename in api request header
2025-06-15 19:59:05 +05:30
Vaclav Cerny
4bbc32efa6
fix: serialize picture description parameters to JSON in DoclingLoader
2025-06-11 20:00:25 +02:00
Timothy Jaeryang Baek
7f75acff96
chore: format
2025-06-08 22:08:25 +04:00
Timothy Jaeryang Baek
0cd400f5ee
refac: docling picture describe params
2025-06-08 20:02:14 +04:00
Tim Jaeryang Baek
6bf393a480
Merge pull request #14787 from vaclcer/vaclavs-custom-docling
...
feat: Customize Docling's "Describe Pictures" feature
2025-06-08 19:02:36 +04:00
Tim Jaeryang Baek
50d9a2ac58
Merge pull request #14781 from lucyknada/patch-2
...
fix: fix #14752 and add manual transcription retrieval
2025-06-08 18:40:28 +04:00
Vaclav Cerny
99f05561f8
Add configuration options for picture description modes and update related components
2025-06-08 16:30:26 +02:00
lucy
b0965a8184
fixes #14752 and adds manual transcription option
2025-06-08 14:26:24 +02:00
Timothy Jaeryang Baek
5e35aab292
chore: format
2025-06-05 01:12:28 +04:00
Vaclav Cerny
9772c18b20
fix(loader): remove deprecated picture description configuration
2025-06-04 17:21:44 +02:00
Vaclav Cerny
c71236ba07
feat(loader): enhance picture description prompt for improved detail and clarity
2025-06-04 14:25:31 +02:00
Vaclav Cerny
c4278f4784
fix description vs classification mismatch
2025-06-04 14:13:00 +02:00
Vaclav Cerny
8644e81a1c
feat(loader): add picture description configuration for DoclingLoader
2025-06-04 12:34:39 +02:00
Timothy Jaeryang Baek
4d364e2967
refac: remove msg from known type
2025-06-03 16:27:28 +04:00
PVBLIC Foundation
cf3635ba25
Update mistral.py
...
1. Intelligent Error Handling
Added _is_retryable_error() method to distinguish retryable vs non-retryable errors
Prevents unnecessary retries on client errors (4xx) that won't succeed
Caps retry delay at 30 seconds to prevent excessive waiting
2. Optimized Timeout Configuration
Upload: Capped at 2 minutes (was using full 5-minute timeout)
URL requests: 30 seconds (should be fast)
OCR processing: Full timeout (can take time)
Cleanup: 30 seconds (should be quick)
3. Enhanced Connection Pool
Increased connection limits: 20 total, 10 per host
Longer DNS cache TTL (10 minutes vs 5 minutes)
Increased keepalive timeout (60s vs 30s)
Added async DNS resolver for better performance
Granular timeout controls (connect, read, total)
4. Concurrency Control for Batch Processing
Added semaphore-based concurrency control (default: 5 concurrent)
Prevents API overwhelming while maintaining throughput
Configurable concurrency limit per workload
5. Memory Efficient Result Processing
Early exit for empty content validation
Better error metadata for debugging
Added content length tracking
Streamlined page processing logic
6. General Performance Improvements
Better error logging with truncated responses
Optimized metadata creation
Improved debug logging efficiency
2025-05-30 20:06:29 -07:00
Timothy Jaeryang Baek
7dc7d5c028
refac: PLEASE FOLLOW EXISTING CONVENTION
2025-05-29 03:47:02 +04:00
Timothy Jaeryang Baek
551597b9cc
chore: format
2025-05-29 02:36:33 +04:00
Hisma
e12a79c0e2
fix: handle json output format correctly
2025-05-27 01:12:03 -04:00
Hisma
a9405cc101
feat: Marker api content extraction support
2025-05-27 00:44:07 -04:00
Timothy Jaeryang Baek
8b5e89eada
chore: format
2025-05-24 00:43:38 +04:00
PVBLIC Foundation
bf193dfb5d
Update mistral.py
2025-05-23 10:00:19 -07:00
sree
f408b08965
minor bug fix for external document loader not working
2025-05-20 11:10:23 +05:30
Timothy Jaeryang Baek
8732b64b6b
feat: external document loader support
2025-05-14 22:28:40 +04:00
Timothy Jaeryang Baek
de70d0cb64
feat: docling do picture description support
2025-05-14 21:26:49 +04:00
Timothy Jaeryang Baek
6359cb55fe
chore: format
2025-05-07 02:01:03 +04:00
Tim Jaeryang Baek
ea07e242f5
Merge pull request #13528 from Classic298/dev
...
feat: Enhance YouTube Transcription Loader for multi-language support
2025-05-07 00:44:45 +04:00
Classic298
1dcbec71ec
Update youtube.py
2025-05-06 17:14:00 +02:00
Classic298
87dcbd198c
Update youtube.py
2025-05-06 17:11:03 +02:00
Classic298
d7927506f1
Update youtube.py
2025-05-06 17:06:21 +02:00
Classic298
f65dc715f9
Update youtube.py
2025-05-06 16:30:18 +02:00
Classic298
c69278c13c
Update youtube.py
2025-05-06 16:24:27 +02:00
Classic298
a129e0954e
Update youtube.py
2025-05-06 16:22:40 +02:00
Classic298
5e1cb76b93
Update youtube.py
2025-05-06 16:16:58 +02:00
Timothy Jaeryang Baek
e63b8b3879
refac
2025-05-06 00:46:32 +04:00
Timothy Jaeryang Baek
27da31dc83
fix: tikaloader extract images
2025-05-05 23:40:34 +04:00